Skip to content

Feature addition: NormalizedArticle Pydantic schema for unified article representation#193

Open
Abhishek-Kumar-Rai5 wants to merge 1 commit into
c2siorg:mainfrom
Abhishek-Kumar-Rai5:clean-normalized
Open

Feature addition: NormalizedArticle Pydantic schema for unified article representation#193
Abhishek-Kumar-Rai5 wants to merge 1 commit into
c2siorg:mainfrom
Abhishek-Kumar-Rai5:clean-normalized

Conversation

@Abhishek-Kumar-Rai5

Copy link
Copy Markdown

Closes #191
Introduces a unified NormalizedArticle Pydantic schema for representing cybersecurity news articles across the system.

Description

Related Issue

Currently, articles are passed across the pipeline as loosely structured dictionaries (scraper → embedding → retrieval → LLM), which leads to:

  • Inconsistent field formats across modules
  • No deterministic article identity (duplicates in vector store)
  • No content-based deduplication across sources
  • Difficulty extending ingestion to RSS and other APIs

This schema introduces a single, validated data contract to standardize article representation across the system.

Motivation and Context

  • This is a foundational, additive change and does not modify or break existing pipeline behavior
  • Future ingestion, retrieval, and agent pipeline improvements will build on this schema

How Has This Been Tested?

  • Added unit tests in tests/test_article_model.py covering:

    • Deterministic ID generation (generate_id)
    • Content hash consistency (compute_content_hash)
    • Valid schema creation with correct field types
    • Validation errors for invalid credibility_tier
    • Factory method from_scraper_dict for converting scraper output
    • Factory method from_rss_entry using a mocked RSS entry
  • All tests were executed locally using:

    pytest tests/test_article_model.py
  • Result:

    6 passed, 0 failed
    

You can see in the following screenshot:-

Screenshots (if appropriate):

image

Types of changes

  • Added models/article.py with NormalizedArticle schema

  • Includes deterministic ID generation (md5(source_url))

  • Includes content-based deduplication (sha256(title + content))

  • Added factory methods for scraper and RSS inputs

  • Added full pytest coverage for validation and helpers

  • Bug fix (non-breaking change which fixes an issue)

  • New feature (non-breaking change which adds functionality)

  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add unified NormalizedArticle schema for standardized article representation

1 participant