Feat/extraction new source bundestag dip client by joaosantos-wq · Pull Request #52 · feld-m/rag_blueprint

joaosantos-wq · 2025-10-30T13:38:16Z

Add DIP API Integration for Bundestag Data Source

Summary

This PR adds comprehensive DIP (Dokumentations- und Informationssystem) API integration to the Bundestag data source, enabling extraction of parliamentary documents alongside existing BundestagMine speeches. It also includes performance optimizations, developer experience improvements, and comprehensive end-to-end testing.

Key Features

🎯 DIP API Integration

New DIP Client: Fetch protocols (plenary transcripts), drucksachen (printed materials), and proceedings (legislative processes) from the official Bundestag DIP API
Markdown Conversion: Automatic conversion of parliamentary documents to markdown format for improved readability
Unified Parser: Enhanced parser handles both BundestagMine speeches and DIP documents with comprehensive metadata extraction
Multi-source Reader: Updated reader supports fetching from both data sources with independent export limits
Rich Metadata: Extracts document type, numbers, dates, publishers, URLs, and source-specific fields

⚡ Performance Improvements

Optimized Chat Engine: Replaced COMPACT mode with SIMPLE_SUMMARIZE to eliminate iterative refinement loops
- Reduces from N LLM calls to 1 LLM call for N documents
- Significantly improves response latency
- Maintains context and quality while reducing API costs

🛠️ Developer Experience

Clear Collection Flag: New --clear-collection CLI flag for embedding script to easily re-embed documents
Lazy Configuration: Fixed initialization order issues in augment.py to prevent NameErrors
Comprehensive Documentation: Added detailed documentation for DIP API integration and bundesapi deutschland package

🧪 Testing

End-to-End Tests: Complete test suite covering:
- BundestagMine speech extraction pipeline
- DIP protocols, drucksachen, and proceedings pipelines
- Combined sources validation
- Test runner with configuration management
- Shell scripts for easy execution
pytest Configuration: Added pytest.ini for consistent test discovery and execution

Changes by Category

New Files (20)

src/extraction/datasources/bundestag/client_dip.py - DIP API client implementation
docs/datasources/bundestag_dip_api.md - DIP API documentation
docs/datasources/bundesapi_deutschland_package.md - Package documentation
docs/datasources/bundestag_dip_api_examples.md - API usage examples
docs/datasources/bundesapi_deutschland_examples.py - Code examples
tests/e2e/ - Complete e2e test suite (9 files)
pytest.ini - pytest configuration

Modified Files (9)

src/extraction/datasources/bundestag/parser.py - Enhanced parser for both sources
src/extraction/datasources/bundestag/reader.py - Multi-source reader support
src/extraction/datasources/bundestag/configuration.py - Extended configuration
src/extraction/datasources/bundestag/document.py - Additional metadata fields
src/embed.py - Added clear-collection flag and improved error handling
src/augment.py - Fixed lazy configuration initialization
src/augmentation/components/chat_engines/langfuse/chat_engine.py - Performance optimization
pyproject.toml - Added deutschland[dip_bundestag], torch dependencies
uv.lock - Updated dependency lock file

Test Plan

Run e2e tests: ./tests/e2e/run_e2e_tests.sh
Test BundestagMine extraction: pytest tests/e2e/test_bundestag_mine_full_pipeline.py -v
Test DIP extraction: pytest tests/e2e/test_bundestag_dip_full_pipeline.py -v
Test combined sources: pytest tests/e2e/test_bundestag_combined_sources.py -v
Verify embedding with clear flag: python src/embed.py --clear-collection
Test chat engine response synthesis performance
Verify metadata extraction for all document types

Migration Notes

Configuration Changes

The Bundestag datasource configuration now supports additional fields:

{
  "name": "bundestag",
  "include_bundestag_mine": true,
  "include_dip": true,
  "dip_api_key": "optional-api-key",
  "dip_wahlperiode": 21,
  "dip_sources": ["protocols", "drucksachen", "proceedings"]
}

Dependencies

New dependencies added:

deutschland[dip_bundestag]>=0.4.2 - DIP API client
torch>=2.0.0 - For embedding models
Updated more-itertools>=8.10.0 for compatibility

Run uv sync to install new dependencies.

Breaking Changes

None. All changes are backward compatible with existing BundestagMine configurations.

Commits

feat: add DIP API client for Bundestag data extraction (e842b23)
- Core DIP integration with protocols, drucksachen, and proceedings support
feat: add clear-collection flag to embedding script (038caa1)
- Developer experience improvement for re-embedding workflows
perf: optimize chat engine response synthesis (d174486)
- Performance optimization reducing LLM API calls
fix: add lazy configuration initialization in augment.py (50bcd8f)
- Bug fix for initialization order issues
chore: add dependencies for DIP API and embedding improvements (778adce)
- Dependency updates for new features
test: add e2e tests for Bundestag data sources (599c37b)
- Comprehensive testing infrastructure

Related Documentation

Implements comprehensive DIP (Dokumentations- und Informationssystem) API integration to fetch parliamentary documents alongside existing BundestagMine speeches. Key additions: - New DIPClient with support for protocols, drucksachen, and proceedings - Markdown conversion for parliamentary documents - Enhanced parser to handle both BundestagMine and DIP content - Updated reader to support multiple data sources with separate limits - Extended configuration with DIP-specific settings - Comprehensive metadata extraction for all document types - Documentation for DIP API integration The reader now supports fetching from both sources, with each getting the full export_limit independently.

Adds --clear-collection command-line flag to allow clearing vector store collections before embedding new documents. Key changes: - New argparse-based CLI with --clear-collection flag - Clear collection before embedding if flag is provided - Skip validation when clearing to avoid false positives - Improved logging and user guidance in error messages - Lazy configuration initialization in get_data_layer to prevent NameError This allows users to re-embed documents without manually clearing the collection, improving the embedding workflow.

Replaces COMPACT mode with SIMPLE_SUMMARIZE to eliminate iterative refinement that makes N LLM calls for N documents. Key changes: - Override _get_response_synthesizer to use SIMPLE_SUMMARIZE mode - Concatenates all retrieved documents in a single LLM call - Eliminates the refinement loop that iteratively processes documents - Maintains context and system prompt handling This significantly reduces LLM API calls and improves response latency when processing multiple retrieved documents.

Fixes NameError when get_data_layer is called before app_startup by initializing configuration lazily if not already set. Key changes: - Initialize global configuration variables early to avoid NameError - Add lazy initialization in get_data_layer function - Improved docstring explaining the initialization order issue This prevents errors when Chainlit calls get_data_layer before the app_startup hook has run.

Adds required dependencies for the new DIP API client integration and torch-based embedding models. Key changes: - Add deutschland[dip_bundestag] package for DIP API access - Add torch>=2.0.0 for embedding model support - Update more-itertools to >=8.10.0 for compatibility - Configure uv overrides for macOS Intel torch compatibility - Add numpy<2 constraint for torch 2.2.2 compatibility - Update lock file with all transitive dependencies These dependencies support the new Bundestag DIP data source and multilingual embedding models.

Adds comprehensive end-to-end tests for the Bundestag data extraction pipeline covering BundestagMine, DIP API, and combined sources. Key additions: - Base test class with common setup and teardown - Full pipeline tests for BundestagMine speeches - Full pipeline tests for DIP protocols, drucksachen, and proceedings - Combined sources test validating both data sources - Test runner script with configuration management - Shell script for easy test execution - README with test documentation - pytest.ini configuration for test discovery and options Tests validate extraction, parsing, embedding, and retrieval for all Bundestag data sources.

Updates VectorIndexAutoRetriever metadata schema to match actual Bundestag document metadata fields, enabling proper filtering for temporal and person-specific queries. Key changes: - Replace generic creation_date/last_update_date with created_time/last_edited_time - Add legislature_period and protocol_number for session queries - Add speaker and speaker_party for person-specific queries - Add document_type, source_client for content filtering - Add detailed descriptions to help LLM extract correct filters This fixes date-based queries like 'current Chancellor' or 'last session' by allowing the Auto Retriever LLM to correctly identify and apply temporal filters on the actual metadata fields present in documents.

…filtering Removes all metadata info from VectorIndexAutoRetriever configuration after discovering that the LLM-based filter generation was causing a regression: Problems with metadata filtering: - LLM hallucinated dates not mentioned in queries (e.g., applying 2025-01-29 from question 1 to unrelated question 2) - Generated overly specific filters that eliminated all results - Inferred temporal relationships incorrectly ("recent" → 2022 date range) - Reduced success rate from 9/11 (Oct 31) to 3-6/11 questions answered Solution: - Disable all metadata filtering - Rely purely on semantic search with embeddings - Improved success rate to 11/11 questions (100%, exceeding baseline) The semantic search proves more effective than strict metadata filtering for this use case, as it finds relevant documents based on content similarity rather than attempting to match arbitrary metadata criteria.

Implements a new HybridFilterPostprocessor that applies intelligent multi-stage filtering to retrieved documents: 1. Score threshold - Fast removal of low-similarity documents (default: 0.65) 2. Semantic deduplication - Removes near-duplicate content (default: 0.90 similarity) 3. LLM relevance check (optional) - Verifies semantic relevance to query 4. Max documents limit - Final cap on results (default: 8) Key features: - Balances quality and performance by applying cheap filters first - Optional LLM-based relevance validation for high-accuracy mode - Uses lenient prompting to avoid over-filtering (1500 char context) - Provides detailed logging of filtering decisions - Properly integrates with Pydantic models using Field/PrivateAttr The postprocessor addresses the issue of semantic search returning too many results or near-duplicates while maintaining recall by using a lenient "when in doubt, keep it" approach for LLM filtering.

Enhances retrieval quality by rewriting queries before processing: - Wrap VectorIndexAutoRetriever with QueryRewritingRetriever - Add PromptHelper with explicit context window limits (16384) to prevent token overflow - Return BaseRetriever interface for flexibility This improves retrieval on specific query patterns that benefit from query transformation.

Updates default prompts in Langfuse to improve accuracy for parliamentary queries: - Add temporal-aware condense prompt that preserves time-related keywords (current, recent, latest, aktuell, jetzt, etc.) during query reformulation - Add comprehensive system prompt with: - Temporal context (21st Bundestag is current, 20th is historical) - Strict grounding in retrieved documents (no hallucination from training data) - Special instructions for party composition queries using metadata - Guidelines for neutral, period-aware responses This addresses issues where the assistant would: - Drop temporal keywords during conversation, retrieving wrong time periods - Use outdated training data instead of current document information - Mix information from different legislative periods

Enhances Bundestag document processing with party metadata and text filtering: Document metadata: - Add parliamentary_composition field to track parties/fractions in documents - Extract composition from speaker party metadata and protocol text - Use PartyExtractor for consistent party information extraction Protocol text filtering: - Remove non-informative Anlage sections (attendance lists, voting records) - Filter out name list sections (consecutive proper noun lines without verbs) - Detect and remove content after "Anlagen zum Stenografischen Bericht" marker - Add helper methods: _is_name_list_line() and _has_verbs() This improves retrieval quality by: - Providing structured party metadata for composition queries - Reducing noise from procedural protocol content - Focusing embeddings on substantive parliamentary discussions

Adds QueryRewriter and QueryRewritingRetriever to enhance semantic search by expanding queries with domain-specific terminology: QueryRewriter: - Pattern-based detection for party composition and temporal queries - Expands party queries with parliamentary terms (Fraktionen, Bundestagsfraktionen) - Expands temporal queries with period identifiers (21. Wahlperiode, 2025) - Language-aware expansions (German/English detection) QueryRewritingRetriever: - Wrapper for BaseRetriever that applies query rewriting before retrieval - Preserves original query in logs for debugging - Delegates to underlying retriever after rewriting This addresses issues where: - Queries for "current parties" failed to retrieve procedural documents - Temporal queries retrieved outdated information from wrong periods - Semantic search missed documents due to terminology mismatch

Implements PartyExtractor for extracting parliamentary composition metadata WITHOUT hardcoded party names (future-proof design): Extraction features: - Dynamic pattern matching using "Name (PARTY)" format in protocol text - Heuristic-based party detection (length, capitalization, patterns) - Groups related variations (CDU/CSU/CDU → CDU/CSU, GRÜNE variations) - Filters non-party keywords (roles, locations, organizations) - Minimum mention threshold (2+) to reduce noise - Confidence scoring based on fraction count and mentions Design principles: - NO hardcoded party names (works for future parties) - Uses Union-Find algorithm for grouping variations - Extracts from both protocol text and speaker metadata - Robust to changing parliamentary compositions Tests: - Comprehensive test suite for 21st/20th Bundestag scenarios - Tests grouping, filtering, confidence scoring - Validates dynamic extraction without hardcoding

Adds test suite for HybridFilterPostprocessor covering: Basic filtering: - Score threshold filtering (removes low-similarity documents) - Max documents limit enforcement - Semantic deduplication (removes near-duplicate embeddings) - Empty node and missing embedding handling LLM filtering: - LLM relevance filtering when enabled - Graceful error handling for LLM failures - Proper delegation to base retriever - Mock LLM integration tests Test fixtures: - Configurable filter configurations (strict/lenient) - Sample nodes with embeddings and metadata - Duplicate nodes for deduplication testing - Mock LLM with YES/NO responses Validates all 5 filtering stages work correctly.

Enhances HybridFilterPostprocessor with Stage 2 temporal filtering: Temporal filtering features: - Detects temporal keywords in query (current, recent, latest, aktuell, jetzt, etc.) - Filters to only current legislative period (21) when temporal keywords present - Falls back to document_number if legislature_period metadata missing - Logs detailed filtering decisions for debugging - Gracefully handles edge case (returns all if filtering would remove everything) Filter pipeline now has 5 stages: 1. Score threshold filtering 2. Temporal filtering (NEW) 3. Semantic deduplication 4. LLM relevance check (optional) 5. Max documents limit This prevents mixing historical data (e.g., FDP in 20th Bundestag) with current information when users ask about "current" parliament composition.

Test fixes: - Add api_key field to LiteLLMConfiguration in hybrid filter tests - Update party extractor tests to have 2+ mentions per party (MIN_MENTIONS threshold) - Add sys.path.append to test_party_extractor.py for imports - Increase mention counts in test assertions to match updated fixtures This ensures all tests pass with the MIN_MENTIONS=2 filtering logic.

gingerjx · 2025-11-07T14:15:44Z

Are we able to fix the Pytests?

gingerjx · 2025-11-07T14:17:13Z

For the time being, it is okay, but eventually we should separate the DIP and Bundestag data sources rather than blend them into a single one. What do you think?

gingerjx · 2025-11-07T14:20:47Z

+        context_window=16384,
+        num_output=4096,
+    )
+    Settings.prompt_helper = prompt_helper


Cannot we set it up on LLMConfiguration level, e.g., in LiteLLMFactory._create_instance, moving context_window and num_output to LLMConfiguration?

gingerjx · 2025-11-07T14:29:17Z

        )

+        # Custom condense prompt that preserves temporal keywords for query rewriting
+        TEMPORAL_AWARE_CONDENSE_PROMPT = """Given the following conversation between a user and an AI assistant and a follow up question from user,


So, the default prompts are meant to be generic, for any use case. The codebase should apply to other scenarios as well. Such a prompt is specifically tailored for the Bundestag use case and, because of that, shouldn't reside in the main branch.

I would suggest bringing back the old default prompts. And then either

Prepare the next PR to feld-m-main-ragbt after this PR where we apply these changes
or

We don't change the code, but after deployment we will change default prompts in Langfuse

I think it's fine if we just add it in Langfuse, also because it's easier to change on the fly in case we spot any issue last minute

joaosantos-wq · 2025-11-07T16:00:13Z

For the time being, it is okay, but eventually we should separate the DIP and Bundestag data sources rather than blend them into a single one. What do you think?

Yes, wanted to do it but didn't come up with a good way to do it that wouldn't require more effort to split them. I think in general we should have the main generic pieces together and then plug in datasources specific code so they're not really aware of each other

Move context_window and num_output from global Settings.prompt_helper to LLMConfiguration to allow each LLM to have its own values. - Add optional context_window and num_output fields to LLMConfiguration - Create ConfigurableLiteLLM subclass that overrides metadata property - Remove global Settings.prompt_helper from augment.py app startup - Existing config files with context_window now actually use those values - Backward compatible: optional fields default to model metadata values

AugmentationPackageLoader already handles postprocessor registration when loading the src.augmentation.components.postprocessors package. The explicit import and register() calls in __init__.py are redundant.

Replace Bundestag-specific default prompts with generic RAG assistant prompts to keep the main branch reusable for any domain. Changes: - Replace temporal-aware condense prompt with generic version - Replace Bundestag-specific system prompt with generic RAG prompt - Create prompts/bundestag/ directory for domain-specific prompts - Add prompt loader utility in tests/utils/ for loading domain prompts - Add documentation and usage examples Domain-specific prompts can now be: 1. Loaded from prompts/ directory in tests using load_bundestag_prompt() 2. Configured in Langfuse for specific deployments 3. Version-controlled while keeping framework generic

Joao added 6 commits October 30, 2025 11:03

joaosantos-wq requested a review from gingerjx October 30, 2025 13:38

joaosantos-wq self-assigned this Oct 30, 2025

Joao added 13 commits October 30, 2025 14:45

chore: allow test configuration files in version control

b3ac2e6

test: fix unit tests for updated parser and reader

2346dc9

ci: exclude e2e tests from CI pipeline

e7eb894

joaosantos-wq force-pushed the feat/extraction-new-source-bundestag-dip-client branch from 9c40d81 to 031cc08 Compare November 7, 2025 13:28

gingerjx reviewed Nov 7, 2025

View reviewed changes

Comment thread src/augmentation/components/postprocessors/__init__.py

gingerjx reviewed Nov 7, 2025

View reviewed changes

Comment thread src/augmentation/components/retrievers/auto/retriever.py

joaosantos-wq force-pushed the feat/extraction-new-source-bundestag-dip-client branch from 8d5f289 to a570b6e Compare November 10, 2025 08:12

Joao added 2 commits November 10, 2025 09:40

refactor: remove redundant postprocessor registration from __init__.py

eb2e0f4

AugmentationPackageLoader already handles postprocessor registration when loading the src.augmentation.components.postprocessors package. The explicit import and register() calls in __init__.py are redundant.

gingerjx merged commit 8aa7fe2 into main Nov 10, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/extraction new source bundestag dip client#52

Feat/extraction new source bundestag dip client#52
gingerjx merged 23 commits intomainfrom
feat/extraction-new-source-bundestag-dip-client

joaosantos-wq commented Oct 30, 2025

Uh oh!

gingerjx commented Nov 7, 2025

Uh oh!

gingerjx commented Nov 7, 2025

Uh oh!

gingerjx Nov 7, 2025

Uh oh!

gingerjx Nov 7, 2025 •

edited

Loading

Uh oh!

joaosantos-wq Nov 7, 2025

Uh oh!

Uh oh!

Uh oh!

joaosantos-wq commented Nov 7, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

joaosantos-wq commented Oct 30, 2025

Add DIP API Integration for Bundestag Data Source

Summary

Key Features

🎯 DIP API Integration

⚡ Performance Improvements

🛠️ Developer Experience

🧪 Testing

Changes by Category

New Files (20)

Modified Files (9)

Test Plan

Migration Notes

Configuration Changes

Dependencies

Breaking Changes

Commits

Related Documentation

Uh oh!

gingerjx commented Nov 7, 2025

Uh oh!

gingerjx commented Nov 7, 2025

Uh oh!

gingerjx Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

gingerjx Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joaosantos-wq Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

joaosantos-wq commented Nov 7, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gingerjx Nov 7, 2025 •

edited

Loading