Data model duplication#3477
Conversation
|
Hi @mlodic,
|
|
can you please clean up the commit history and fix the conflicts. thank you |
836a3fe to
499515e
Compare
|
Hi @mlodic, I've rebased, resolved conflicts, and squashed the history into one clean commit and I've verified the CAS logic and migration safety in the Docker environment. Ready for review! |
mlodic
left a comment
There was a problem hiding this comment.
there are neither unittests nor any other proof that the migration work as intended. Please show how the data model is changed in the database before and after the migration is applied
f336b65 to
d0d0e04
Compare
d0d0e04 to
316a7a6
Compare
fa1bf1a to
97f3c80
Compare
97f3c80 to
eaca64d
Compare
|
Hi @mlodic,
Ready for review Thanks! |
| encoded_data = json.dumps(normalized_data, sort_keys=True).encode("utf-8") | ||
| return hashlib.sha256(encoded_data).hexdigest() | ||
|
|
||
| def populate_fingerprints(apps, schema_editor): |
There was a problem hiding this comment.
the problem about adding data migrations is that no failing tests show up by running classic CI because the CI would work on a fresh environment.
This is a very risky change if not tested with already existing environments. The benefit of this could be easily destroyed by introducing an unwanted breaking change. Additional more comprehensive tests should be required. We can't merge this in the next release, we would need to wait a major like we will do for other critical PRs
|
Hi @mlodic, I totally get the worry about the CI missing potential migration bugs it’s actually why I spent some extra time making sure this is really safe for anyone with a big existing database.
I’ve verified the full suite in Docker.Here is the proof from my manual Before vs After check:
I also completely understand the decision to wait for a major release.so let me know if you want me to check anything else! |
|
this will be reconsidered after the next release in April |







Description
This PR implements Content-Addressable Storage (CAS) for Data Models to resolve the linear storage bloat issue identified in
Previously, IntelOwl would create a new database row for every successful analyzer execution, even if the intelligence gathered was bit-for-bit identical to an existing record. This led to redundant data and unnecessary index growth. This change transitions the system to deduplicate intelligence at the point of ingestion.
Key Architectural Changes:
This directly addresses the # TODO in api_app/analyzers_manager/models.py:124.
Type of change
Checklist
(https://intelowlproject.github.io/docs/IntelOwl/contribute/) to this project
and new deduplication tests) passed with 0 errors.
closes Reusing identical Data Models to prevent Data Bloat (Found a # TODO) #3450