Skip to content

feature: add query preprocessing pipeline with CVE normalization and stopword filtering#195

Open
Abhishek-Kumar-Rai5 wants to merge 2 commits into
c2siorg:mainfrom
Abhishek-Kumar-Rai5:query-preprocessor
Open

feature: add query preprocessing pipeline with CVE normalization and stopword filtering#195
Abhishek-Kumar-Rai5 wants to merge 2 commits into
c2siorg:mainfrom
Abhishek-Kumar-Rai5:query-preprocessor

Conversation

@Abhishek-Kumar-Rai5

Copy link
Copy Markdown

Introduces a query preprocessing pipeline to normalize and clean user queries before embedding and retrieval.

Description

Currently, queries are passed directly to the embedding model without preprocessing, which leads to:

  • Poor matching of cybersecurity-specific terms (e.g., inconsistent CVE formats)
  • Noisy queries due to generic words like "latest", "news"
  • Reduced retrieval accuracy in Pinecone

Related Issue

Closes the issue #194

Motivation and Context

  • Fully additive change (no existing code modified)
  • Intended to be integrated into models/NewsModel.py in a future step

How Has This Been Tested?

  • Added unit tests in tests/test_query_preprocessor.py covering:

    • CVE normalization across multiple input formats (e.g., "cve 2024 1234", "CVE_2024_1234", "cve2024-1234")
    • Stopword removal for generic terms ("latest", "news", "update", etc.)
    • Lowercasing and whitespace normalization
    • End-to-end pipeline behavior via preprocess_query
  • Tests were executed locally using:

    python -m pytest tests/test_query_preprocessor.py
  • Result:

    4 passed, 0 failed
    

Screenshots (if appropriate):

image

Types of changes

  • Added utils/query_preprocessor.py with:

    • normalize_query (lowercase + whitespace cleanup)
    • standardize_cve_patterns (normalize CVE formats)
    • filter_stopwords (remove generic noise terms)
    • preprocess_query (pipeline function)
  • Added unit tests covering CVE normalization, stopword removal, and full pipeline behavior

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant