Data model duplication by ayushgupta704 · Pull Request #3477 · intelowlproject/IntelOwl

ayushgupta704 · 2026-03-14T18:46:21Z

Description

This PR implements Content-Addressable Storage (CAS) for Data Models to resolve the linear storage bloat issue identified in
Previously, IntelOwl would create a new database row for every successful analyzer execution, even if the intelligence gathered was bit-for-bit identical to an existing record. This led to redundant data and unnecessary index growth. This change transitions the system to deduplicate intelligence at the point of ingestion.

Key Architectural Changes:

Stable Hashing: Added a normalize_dict helper to recursively sort dictionary keys and lists. This ensures the fingerprint stays the same even if an external API returns JSON keys in a different order.
Unique Fingerprinting: Introduced an indexed fingerprint field to the BaseDataModel. This acts as a unique ID for the actual intelligence content, allowing for fast deduplication lookups.
Smart Deduplication: Updated the create_data_model logic to use a "get-or-create" pattern. If the intelligence is already in the database, the system now just links it instead of creating a duplicate row.
Concurrency & Safety: Wrapped the logic in an atomic database transaction to prevent race conditions during parallel scans, while strictly maintaining existing return types to ensure zero regressions.

This directly addresses the # TODO in api_app/analyzers_manager/models.py:124.

Type of change

New feature (non-breaking change which adds functionality)

Checklist

I have read and understood the rules about [how to Contribute
(https://intelowlproject.github.io/docs/IntelOwl/contribute/) to this project
I have inserted the copyright banner at the start of the file.
Linters (Ruff) gave 0 errors.
I have added tests for the feature I solved. All tests (including the existing analyzer suite
and new deduplication tests) passed with 0 errors.
closes Reusing identical Data Models to prevent Data Bloat (Found a # TODO) #3450

ayushgupta704 · 2026-03-15T04:06:30Z

Hi @mlodic,
I’ve implemented the content-addressable storage (CAS) to fix the data model bloat. Identical intelligence reports now correctly reuse the same records via fingerprints. I've verified it manually and all tests are passing. Ready for your review when you have a moment!

mlodic · 2026-03-17T17:26:06Z

can you please clean up the commit history and fix the conflicts. thank you

ayushgupta704 · 2026-03-18T06:08:48Z

Hi @mlodic, I've rebased, resolved conflicts, and squashed the history into one clean commit and I've verified the CAS logic and migration safety in the Docker environment. Ready for review!

mlodic

there are neither unittests nor any other proof that the migration work as intended. Please show how the data model is changed in the database before and after the migration is applied

ayushgupta704 · 2026-03-20T08:14:02Z

Hi @mlodic,
I’ve refactored this PR to address your feedback. It is now a single clean commit that includes the core logic, required migrations, and dedicated unit tests for CAS verification.

Database Schema Proof: The migration adds the fingerprint column and a performance index to the base data models.

Passing CAS Verification Tests: I've added a test suite (test_cas_deduplication.py) that confirms identical reports correctly deduplicate across different jobs.

Ready for review Thanks!

mlodic · 2026-03-20T11:41:16Z

+    encoded_data = json.dumps(normalized_data, sort_keys=True).encode("utf-8")
+    return hashlib.sha256(encoded_data).hexdigest()
+
+def populate_fingerprints(apps, schema_editor):


the problem about adding data migrations is that no failing tests show up by running classic CI because the CI would work on a fresh environment.
This is a very risky change if not tested with already existing environments. The benefit of this could be easily destroyed by introducing an unwanted breaking change. Additional more comprehensive tests should be required. We can't merge this in the next release, we would need to wait a major like we will do for other critical PRs

ayushgupta704 · 2026-03-21T13:59:18Z

Hi @mlodic, I totally get the worry about the CI missing potential migration bugs it’s actually why I spent some extra time making sure this is really safe for anyone with a big existing database.
I ended up writing a Migration Test (test_migrations.py) that fakes an old installation, seeds it with messy duplicate data, and then runs the upgrade. Here's a breakdown of the changes

Migration Integrity Test: I added tests/api_app/data_model_manager/test_migrations.py. Instead of starting fresh, it forces the DB backward to 0012, seeds "dirty" duplicate data, and verifies that the 0013 upgrade merges everything and re-links all Reports/Jobs/Events safely without data loss.
Scale & Resilience: The migration now uses bulk_update (500-row batches) and try-except blocks. This prevents table-locking and ensures a single malformed row won't crash the upgrade on large production environments.
Bug Fix: Caught and fixed a potential AttributeError in the signal handlers that triggers during the record pruning process.

I’ve verified the full suite in Docker.Here is the proof from my manual Before vs After check:

Before Migration: Created 2 identical IP records linked to 2 separate reports.

After Migration: Redundant IP row deleted. Both reports successfully re-linked to the single remaining canonical record.

All 6 tests are passing in my local setup.

I also completely understand the decision to wait for a major release.so let me know if you want me to check anything else!
Thanks

mlodic · 2026-03-24T19:13:56Z

this will be reconsidered after the next release in April

ayushgupta704 force-pushed the feature/data-model-deduplication branch 4 times, most recently from 836a3fe to 499515e Compare March 18, 2026 05:59

mlodic requested changes Mar 19, 2026

View reviewed changes

ayushgupta704 force-pushed the feature/data-model-deduplication branch 2 times, most recently from f336b65 to d0d0e04 Compare March 19, 2026 20:25

ayushgupta704 closed this Mar 20, 2026

ayushgupta704 force-pushed the feature/data-model-deduplication branch from d0d0e04 to 316a7a6 Compare March 20, 2026 06:06

ayushgupta704 reopened this Mar 20, 2026

ayushgupta704 force-pushed the feature/data-model-deduplication branch 2 times, most recently from fa1bf1a to 97f3c80 Compare March 20, 2026 07:22

feat: implement CAS for Data Model deduplication

eaca64d

ayushgupta704 force-pushed the feature/data-model-deduplication branch from 97f3c80 to eaca64d Compare March 20, 2026 07:47

mlodic reviewed Mar 20, 2026

View reviewed changes

ayushgupta704 added 3 commits March 21, 2026 12:45

feat: hardened migration and CAS deduplication

b7bf9eb

feat: hardened migration and CAS deduplication

3961c1f

cleanup and formatting of migration tests

1ae2ab8

mlodic added the keep-open To avoid workflow closing PRs label Mar 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Data model duplication#3477

Data model duplication#3477
ayushgupta704 wants to merge 4 commits intointelowlproject:developfrom
ayushgupta704:feature/data-model-deduplication

ayushgupta704 commented Mar 14, 2026 •

edited

Loading

Uh oh!

ayushgupta704 commented Mar 15, 2026

Uh oh!

mlodic commented Mar 17, 2026

Uh oh!

ayushgupta704 commented Mar 18, 2026

Uh oh!

mlodic left a comment

Uh oh!

ayushgupta704 commented Mar 20, 2026

Uh oh!

mlodic Mar 20, 2026

Uh oh!

ayushgupta704 commented Mar 21, 2026 •

edited

Loading

Uh oh!

mlodic commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

ayushgupta704 commented Mar 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Key Architectural Changes:

Uh oh!

ayushgupta704 commented Mar 15, 2026

Uh oh!

mlodic commented Mar 17, 2026

Uh oh!

ayushgupta704 commented Mar 18, 2026

Uh oh!

mlodic left a comment

Choose a reason for hiding this comment

Uh oh!

ayushgupta704 commented Mar 20, 2026

Uh oh!

mlodic Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

ayushgupta704 commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mlodic commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ayushgupta704 commented Mar 14, 2026 •

edited

Loading

ayushgupta704 commented Mar 21, 2026 •

edited

Loading