Feature addition: NormalizedArticle Pydantic schema for unified article representation by Abhishek-Kumar-Rai5 · Pull Request #193 · c2siorg/b0bot

Abhishek-Kumar-Rai5 · 2026-04-14T11:56:24Z

Closes #191
Introduces a unified NormalizedArticle Pydantic schema for representing cybersecurity news articles across the system.

Description

Related Issue

Currently, articles are passed across the pipeline as loosely structured dictionaries (scraper → embedding → retrieval → LLM), which leads to:

Inconsistent field formats across modules
No deterministic article identity (duplicates in vector store)
No content-based deduplication across sources
Difficulty extending ingestion to RSS and other APIs

This schema introduces a single, validated data contract to standardize article representation across the system.

Motivation and Context

This is a foundational, additive change and does not modify or break existing pipeline behavior
Future ingestion, retrieval, and agent pipeline improvements will build on this schema

How Has This Been Tested?

Added unit tests in tests/test_article_model.py covering:
- Deterministic ID generation (generate_id)
- Content hash consistency (compute_content_hash)
- Valid schema creation with correct field types
- Validation errors for invalid credibility_tier
- Factory method from_scraper_dict for converting scraper output
- Factory method from_rss_entry using a mocked RSS entry
All tests were executed locally using:
```
pytest tests/test_article_model.py
```
Result:
```
6 passed, 0 failed
```

You can see in the following screenshot:-

Screenshots (if appropriate):

Types of changes

Added models/article.py with NormalizedArticle schema
Includes deterministic ID generation (md5(source_url))
Includes content-based deduplication (sha256(title + content))
Added factory methods for scraper and RSS inputs
Added full pytest coverage for validation and helpers
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

My code follows the code style of this project.
My change requires a change to the documentation.
I have updated the documentation accordingly.
I have read the CONTRIBUTING document.
I have added tests to cover my changes.
All new and existing tests passed.

add NormalizedArticle Pydantic schema for unified article representation

3801761

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature addition: NormalizedArticle Pydantic schema for unified article representation#193

Feature addition: NormalizedArticle Pydantic schema for unified article representation#193
Abhishek-Kumar-Rai5 wants to merge 1 commit into
c2siorg:mainfrom
Abhishek-Kumar-Rai5:clean-normalized

Abhishek-Kumar-Rai5 commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Abhishek-Kumar-Rai5 commented Apr 14, 2026

Description

Related Issue

Motivation and Context

How Has This Been Tested?

Screenshots (if appropriate):

Types of changes

Checklist:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant