Skip to content

Data model duplication#3477

Open
ayushgupta704 wants to merge 4 commits intointelowlproject:developfrom
ayushgupta704:feature/data-model-deduplication
Open

Data model duplication#3477
ayushgupta704 wants to merge 4 commits intointelowlproject:developfrom
ayushgupta704:feature/data-model-deduplication

Conversation

@ayushgupta704
Copy link
Copy Markdown
Contributor

@ayushgupta704 ayushgupta704 commented Mar 14, 2026

Description

This PR implements Content-Addressable Storage (CAS) for Data Models to resolve the linear storage bloat issue identified in
Previously, IntelOwl would create a new database row for every successful analyzer execution, even if the intelligence gathered was bit-for-bit identical to an existing record. This led to redundant data and unnecessary index growth. This change transitions the system to deduplicate intelligence at the point of ingestion.

Key Architectural Changes:

  • Stable Hashing: Added a normalize_dict helper to recursively sort dictionary keys and lists. This ensures the fingerprint stays the same even if an external API returns JSON keys in a different order.
  • Unique Fingerprinting: Introduced an indexed fingerprint field to the BaseDataModel. This acts as a unique ID for the actual intelligence content, allowing for fast deduplication lookups.
  • Smart Deduplication: Updated the create_data_model logic to use a "get-or-create" pattern. If the intelligence is already in the database, the system now just links it instead of creating a duplicate row.
  • Concurrency & Safety: Wrapped the logic in an atomic database transaction to prevent race conditions during parallel scans, while strictly maintaining existing return types to ensure zero regressions.

This directly addresses the # TODO in api_app/analyzers_manager/models.py:124.

Type of change

  • New feature (non-breaking change which adds functionality)

Checklist

  • I have read and understood the rules about [how to Contribute
    (https://intelowlproject.github.io/docs/IntelOwl/contribute/) to this project
  • I have inserted the copyright banner at the start of the file.
  • Linters (Ruff) gave 0 errors.
  • I have added tests for the feature I solved. All tests (including the existing analyzer suite
    and new deduplication tests) passed with 0 errors.
    closes Reusing identical Data Models to prevent Data Bloat (Found a # TODO) #3450

@ayushgupta704
Copy link
Copy Markdown
Contributor Author

Hi @mlodic,
I’ve implemented the content-addressable storage (CAS) to fix the data model bloat. Identical intelligence reports now correctly reuse the same records via fingerprints. I've verified it manually and all tests are passing. Ready for your review when you have a moment!

Screenshot 2026-03-15 093056

@mlodic
Copy link
Copy Markdown
Member

mlodic commented Mar 17, 2026

can you please clean up the commit history and fix the conflicts. thank you

@ayushgupta704 ayushgupta704 force-pushed the feature/data-model-deduplication branch 4 times, most recently from 836a3fe to 499515e Compare March 18, 2026 05:59
@ayushgupta704
Copy link
Copy Markdown
Contributor Author

Hi @mlodic, I've rebased, resolved conflicts, and squashed the history into one clean commit and I've verified the CAS logic and migration safety in the Docker environment. Ready for review!

Copy link
Copy Markdown
Member

@mlodic mlodic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are neither unittests nor any other proof that the migration work as intended. Please show how the data model is changed in the database before and after the migration is applied

@ayushgupta704 ayushgupta704 force-pushed the feature/data-model-deduplication branch 2 times, most recently from f336b65 to d0d0e04 Compare March 19, 2026 20:25
@ayushgupta704 ayushgupta704 force-pushed the feature/data-model-deduplication branch from d0d0e04 to 316a7a6 Compare March 20, 2026 06:06
@ayushgupta704 ayushgupta704 reopened this Mar 20, 2026
@ayushgupta704 ayushgupta704 force-pushed the feature/data-model-deduplication branch 2 times, most recently from fa1bf1a to 97f3c80 Compare March 20, 2026 07:22
@ayushgupta704 ayushgupta704 force-pushed the feature/data-model-deduplication branch from 97f3c80 to eaca64d Compare March 20, 2026 07:47
@ayushgupta704
Copy link
Copy Markdown
Contributor Author

Hi @mlodic,
I’ve refactored this PR to address your feedback. It is now a single clean commit that includes the core logic, required migrations, and dedicated unit tests for CAS verification.

  • Database Schema Proof: The migration adds the fingerprint column and a performance index to the base data models.
Screenshot 2026-03-20 133103
  • Passing CAS Verification Tests: I've added a test suite (test_cas_deduplication.py) that confirms identical reports correctly deduplicate across different jobs.
Screenshot 2026-03-20 133442 Screenshot 2026-03-20 133507

Ready for review Thanks!

encoded_data = json.dumps(normalized_data, sort_keys=True).encode("utf-8")
return hashlib.sha256(encoded_data).hexdigest()

def populate_fingerprints(apps, schema_editor):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the problem about adding data migrations is that no failing tests show up by running classic CI because the CI would work on a fresh environment.
This is a very risky change if not tested with already existing environments. The benefit of this could be easily destroyed by introducing an unwanted breaking change. Additional more comprehensive tests should be required. We can't merge this in the next release, we would need to wait a major like we will do for other critical PRs

@ayushgupta704
Copy link
Copy Markdown
Contributor Author

ayushgupta704 commented Mar 21, 2026

Hi @mlodic, I totally get the worry about the CI missing potential migration bugs it’s actually why I spent some extra time making sure this is really safe for anyone with a big existing database.
I ended up writing a Migration Test (test_migrations.py) that fakes an old installation, seeds it with messy duplicate data, and then runs the upgrade. Here's a breakdown of the changes

  • Migration Integrity Test: I added tests/api_app/data_model_manager/test_migrations.py. Instead of starting fresh, it forces the DB backward to 0012, seeds "dirty" duplicate data, and verifies that the 0013 upgrade merges everything and re-links all Reports/Jobs/Events safely without data loss.

  • Scale & Resilience: The migration now uses bulk_update (500-row batches) and try-except blocks. This prevents table-locking and ensures a single malformed row won't crash the upgrade on large production environments.

  • Bug Fix: Caught and fixed a potential AttributeError in the signal handlers that triggers during the record pruning process.

I’ve verified the full suite in Docker.Here is the proof from my manual Before vs After check:

  • Before Migration: Created 2 identical IP records linked to 2 separate reports.
Screenshot 2026-03-21 185550
  • After Migration: Redundant IP row deleted. Both reports successfully re-linked to the single remaining canonical record.
Screenshot 2026-03-21 185640
  • All 6 tests are passing in my local setup.
Screenshot 2026-03-21 184408

I also completely understand the decision to wait for a major release.so let me know if you want me to check anything else!
Thanks

@mlodic
Copy link
Copy Markdown
Member

mlodic commented Mar 24, 2026

this will be reconsidered after the next release in April

@mlodic mlodic added the keep-open To avoid workflow closing PRs label Mar 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

keep-open To avoid workflow closing PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants