Feat/extraction new source bundestag dip client#52
Conversation
Implements comprehensive DIP (Dokumentations- und Informationssystem) API integration to fetch parliamentary documents alongside existing BundestagMine speeches. Key additions: - New DIPClient with support for protocols, drucksachen, and proceedings - Markdown conversion for parliamentary documents - Enhanced parser to handle both BundestagMine and DIP content - Updated reader to support multiple data sources with separate limits - Extended configuration with DIP-specific settings - Comprehensive metadata extraction for all document types - Documentation for DIP API integration The reader now supports fetching from both sources, with each getting the full export_limit independently.
Adds --clear-collection command-line flag to allow clearing vector store collections before embedding new documents. Key changes: - New argparse-based CLI with --clear-collection flag - Clear collection before embedding if flag is provided - Skip validation when clearing to avoid false positives - Improved logging and user guidance in error messages - Lazy configuration initialization in get_data_layer to prevent NameError This allows users to re-embed documents without manually clearing the collection, improving the embedding workflow.
Replaces COMPACT mode with SIMPLE_SUMMARIZE to eliminate iterative refinement that makes N LLM calls for N documents. Key changes: - Override _get_response_synthesizer to use SIMPLE_SUMMARIZE mode - Concatenates all retrieved documents in a single LLM call - Eliminates the refinement loop that iteratively processes documents - Maintains context and system prompt handling This significantly reduces LLM API calls and improves response latency when processing multiple retrieved documents.
Fixes NameError when get_data_layer is called before app_startup by initializing configuration lazily if not already set. Key changes: - Initialize global configuration variables early to avoid NameError - Add lazy initialization in get_data_layer function - Improved docstring explaining the initialization order issue This prevents errors when Chainlit calls get_data_layer before the app_startup hook has run.
Adds required dependencies for the new DIP API client integration and torch-based embedding models. Key changes: - Add deutschland[dip_bundestag] package for DIP API access - Add torch>=2.0.0 for embedding model support - Update more-itertools to >=8.10.0 for compatibility - Configure uv overrides for macOS Intel torch compatibility - Add numpy<2 constraint for torch 2.2.2 compatibility - Update lock file with all transitive dependencies These dependencies support the new Bundestag DIP data source and multilingual embedding models.
Adds comprehensive end-to-end tests for the Bundestag data extraction pipeline covering BundestagMine, DIP API, and combined sources. Key additions: - Base test class with common setup and teardown - Full pipeline tests for BundestagMine speeches - Full pipeline tests for DIP protocols, drucksachen, and proceedings - Combined sources test validating both data sources - Test runner script with configuration management - Shell script for easy test execution - README with test documentation - pytest.ini configuration for test discovery and options Tests validate extraction, parsing, embedding, and retrieval for all Bundestag data sources.
Updates VectorIndexAutoRetriever metadata schema to match actual Bundestag document metadata fields, enabling proper filtering for temporal and person-specific queries. Key changes: - Replace generic creation_date/last_update_date with created_time/last_edited_time - Add legislature_period and protocol_number for session queries - Add speaker and speaker_party for person-specific queries - Add document_type, source_client for content filtering - Add detailed descriptions to help LLM extract correct filters This fixes date-based queries like 'current Chancellor' or 'last session' by allowing the Auto Retriever LLM to correctly identify and apply temporal filters on the actual metadata fields present in documents.
…filtering
Removes all metadata info from VectorIndexAutoRetriever configuration after
discovering that the LLM-based filter generation was causing a regression:
Problems with metadata filtering:
- LLM hallucinated dates not mentioned in queries (e.g., applying 2025-01-29
from question 1 to unrelated question 2)
- Generated overly specific filters that eliminated all results
- Inferred temporal relationships incorrectly ("recent" → 2022 date range)
- Reduced success rate from 9/11 (Oct 31) to 3-6/11 questions answered
Solution:
- Disable all metadata filtering
- Rely purely on semantic search with embeddings
- Improved success rate to 11/11 questions (100%, exceeding baseline)
The semantic search proves more effective than strict metadata filtering
for this use case, as it finds relevant documents based on content
similarity rather than attempting to match arbitrary metadata criteria.
Implements a new HybridFilterPostprocessor that applies intelligent multi-stage filtering to retrieved documents: 1. Score threshold - Fast removal of low-similarity documents (default: 0.65) 2. Semantic deduplication - Removes near-duplicate content (default: 0.90 similarity) 3. LLM relevance check (optional) - Verifies semantic relevance to query 4. Max documents limit - Final cap on results (default: 8) Key features: - Balances quality and performance by applying cheap filters first - Optional LLM-based relevance validation for high-accuracy mode - Uses lenient prompting to avoid over-filtering (1500 char context) - Provides detailed logging of filtering decisions - Properly integrates with Pydantic models using Field/PrivateAttr The postprocessor addresses the issue of semantic search returning too many results or near-duplicates while maintaining recall by using a lenient "when in doubt, keep it" approach for LLM filtering.
Enhances retrieval quality by rewriting queries before processing: - Wrap VectorIndexAutoRetriever with QueryRewritingRetriever - Add PromptHelper with explicit context window limits (16384) to prevent token overflow - Return BaseRetriever interface for flexibility This improves retrieval on specific query patterns that benefit from query transformation.
Updates default prompts in Langfuse to improve accuracy for parliamentary queries: - Add temporal-aware condense prompt that preserves time-related keywords (current, recent, latest, aktuell, jetzt, etc.) during query reformulation - Add comprehensive system prompt with: - Temporal context (21st Bundestag is current, 20th is historical) - Strict grounding in retrieved documents (no hallucination from training data) - Special instructions for party composition queries using metadata - Guidelines for neutral, period-aware responses This addresses issues where the assistant would: - Drop temporal keywords during conversation, retrieving wrong time periods - Use outdated training data instead of current document information - Mix information from different legislative periods
Enhances Bundestag document processing with party metadata and text filtering: Document metadata: - Add parliamentary_composition field to track parties/fractions in documents - Extract composition from speaker party metadata and protocol text - Use PartyExtractor for consistent party information extraction Protocol text filtering: - Remove non-informative Anlage sections (attendance lists, voting records) - Filter out name list sections (consecutive proper noun lines without verbs) - Detect and remove content after "Anlagen zum Stenografischen Bericht" marker - Add helper methods: _is_name_list_line() and _has_verbs() This improves retrieval quality by: - Providing structured party metadata for composition queries - Reducing noise from procedural protocol content - Focusing embeddings on substantive parliamentary discussions
Adds QueryRewriter and QueryRewritingRetriever to enhance semantic search by expanding queries with domain-specific terminology: QueryRewriter: - Pattern-based detection for party composition and temporal queries - Expands party queries with parliamentary terms (Fraktionen, Bundestagsfraktionen) - Expands temporal queries with period identifiers (21. Wahlperiode, 2025) - Language-aware expansions (German/English detection) QueryRewritingRetriever: - Wrapper for BaseRetriever that applies query rewriting before retrieval - Preserves original query in logs for debugging - Delegates to underlying retriever after rewriting This addresses issues where: - Queries for "current parties" failed to retrieve procedural documents - Temporal queries retrieved outdated information from wrong periods - Semantic search missed documents due to terminology mismatch
Implements PartyExtractor for extracting parliamentary composition metadata WITHOUT hardcoded party names (future-proof design): Extraction features: - Dynamic pattern matching using "Name (PARTY)" format in protocol text - Heuristic-based party detection (length, capitalization, patterns) - Groups related variations (CDU/CSU/CDU → CDU/CSU, GRÜNE variations) - Filters non-party keywords (roles, locations, organizations) - Minimum mention threshold (2+) to reduce noise - Confidence scoring based on fraction count and mentions Design principles: - NO hardcoded party names (works for future parties) - Uses Union-Find algorithm for grouping variations - Extracts from both protocol text and speaker metadata - Robust to changing parliamentary compositions Tests: - Comprehensive test suite for 21st/20th Bundestag scenarios - Tests grouping, filtering, confidence scoring - Validates dynamic extraction without hardcoding
Adds test suite for HybridFilterPostprocessor covering: Basic filtering: - Score threshold filtering (removes low-similarity documents) - Max documents limit enforcement - Semantic deduplication (removes near-duplicate embeddings) - Empty node and missing embedding handling LLM filtering: - LLM relevance filtering when enabled - Graceful error handling for LLM failures - Proper delegation to base retriever - Mock LLM integration tests Test fixtures: - Configurable filter configurations (strict/lenient) - Sample nodes with embeddings and metadata - Duplicate nodes for deduplication testing - Mock LLM with YES/NO responses Validates all 5 filtering stages work correctly.
Enhances HybridFilterPostprocessor with Stage 2 temporal filtering: Temporal filtering features: - Detects temporal keywords in query (current, recent, latest, aktuell, jetzt, etc.) - Filters to only current legislative period (21) when temporal keywords present - Falls back to document_number if legislature_period metadata missing - Logs detailed filtering decisions for debugging - Gracefully handles edge case (returns all if filtering would remove everything) Filter pipeline now has 5 stages: 1. Score threshold filtering 2. Temporal filtering (NEW) 3. Semantic deduplication 4. LLM relevance check (optional) 5. Max documents limit This prevents mixing historical data (e.g., FDP in 20th Bundestag) with current information when users ask about "current" parliament composition.
9c40d81 to
031cc08
Compare
Test fixes: - Add api_key field to LiteLLMConfiguration in hybrid filter tests - Update party extractor tests to have 2+ mentions per party (MIN_MENTIONS threshold) - Add sys.path.append to test_party_extractor.py for imports - Increase mention counts in test assertions to match updated fixtures This ensures all tests pass with the MIN_MENTIONS=2 filtering logic.
|
Are we able to fix the Pytests? |
|
For the time being, it is okay, but eventually we should separate the DIP and Bundestag data sources rather than blend them into a single one. What do you think? |
| context_window=16384, | ||
| num_output=4096, | ||
| ) | ||
| Settings.prompt_helper = prompt_helper |
There was a problem hiding this comment.
Cannot we set it up on LLMConfiguration level, e.g., in LiteLLMFactory._create_instance, moving context_window and num_output to LLMConfiguration?
| ) | ||
|
|
||
| # Custom condense prompt that preserves temporal keywords for query rewriting | ||
| TEMPORAL_AWARE_CONDENSE_PROMPT = """Given the following conversation between a user and an AI assistant and a follow up question from user, |
There was a problem hiding this comment.
So, the default prompts are meant to be generic, for any use case. The codebase should apply to other scenarios as well. Such a prompt is specifically tailored for the Bundestag use case and, because of that, shouldn't reside in the main branch.
I would suggest bringing back the old default prompts. And then either
- Prepare the next PR to
feld-m-main-ragbtafter this PR where we apply these changes
or - We don't change the code, but after deployment we will change default prompts in Langfuse
There was a problem hiding this comment.
I think it's fine if we just add it in Langfuse, also because it's easier to change on the fly in case we spot any issue last minute
Yes, wanted to do it but didn't come up with a good way to do it that wouldn't require more effort to split them. I think in general we should have the main generic pieces together and then plug in datasources specific code so they're not really aware of each other |
Move context_window and num_output from global Settings.prompt_helper to LLMConfiguration to allow each LLM to have its own values. - Add optional context_window and num_output fields to LLMConfiguration - Create ConfigurableLiteLLM subclass that overrides metadata property - Remove global Settings.prompt_helper from augment.py app startup - Existing config files with context_window now actually use those values - Backward compatible: optional fields default to model metadata values
8d5f289 to
a570b6e
Compare
AugmentationPackageLoader already handles postprocessor registration when loading the src.augmentation.components.postprocessors package. The explicit import and register() calls in __init__.py are redundant.
Replace Bundestag-specific default prompts with generic RAG assistant prompts to keep the main branch reusable for any domain. Changes: - Replace temporal-aware condense prompt with generic version - Replace Bundestag-specific system prompt with generic RAG prompt - Create prompts/bundestag/ directory for domain-specific prompts - Add prompt loader utility in tests/utils/ for loading domain prompts - Add documentation and usage examples Domain-specific prompts can now be: 1. Loaded from prompts/ directory in tests using load_bundestag_prompt() 2. Configured in Langfuse for specific deployments 3. Version-controlled while keeping framework generic
Add DIP API Integration for Bundestag Data Source
Summary
This PR adds comprehensive DIP (Dokumentations- und Informationssystem) API integration to the Bundestag data source, enabling extraction of parliamentary documents alongside existing BundestagMine speeches. It also includes performance optimizations, developer experience improvements, and comprehensive end-to-end testing.
Key Features
🎯 DIP API Integration
⚡ Performance Improvements
🛠️ Developer Experience
--clear-collectionCLI flag for embedding script to easily re-embed documents🧪 Testing
Changes by Category
New Files (20)
src/extraction/datasources/bundestag/client_dip.py- DIP API client implementationdocs/datasources/bundestag_dip_api.md- DIP API documentationdocs/datasources/bundesapi_deutschland_package.md- Package documentationdocs/datasources/bundestag_dip_api_examples.md- API usage examplesdocs/datasources/bundesapi_deutschland_examples.py- Code examplestests/e2e/- Complete e2e test suite (9 files)pytest.ini- pytest configurationModified Files (9)
src/extraction/datasources/bundestag/parser.py- Enhanced parser for both sourcessrc/extraction/datasources/bundestag/reader.py- Multi-source reader supportsrc/extraction/datasources/bundestag/configuration.py- Extended configurationsrc/extraction/datasources/bundestag/document.py- Additional metadata fieldssrc/embed.py- Added clear-collection flag and improved error handlingsrc/augment.py- Fixed lazy configuration initializationsrc/augmentation/components/chat_engines/langfuse/chat_engine.py- Performance optimizationpyproject.toml- Added deutschland[dip_bundestag], torch dependenciesuv.lock- Updated dependency lock fileTest Plan
./tests/e2e/run_e2e_tests.shpytest tests/e2e/test_bundestag_mine_full_pipeline.py -vpytest tests/e2e/test_bundestag_dip_full_pipeline.py -vpytest tests/e2e/test_bundestag_combined_sources.py -vpython src/embed.py --clear-collectionMigration Notes
Configuration Changes
The Bundestag datasource configuration now supports additional fields:
{ "name": "bundestag", "include_bundestag_mine": true, "include_dip": true, "dip_api_key": "optional-api-key", "dip_wahlperiode": 21, "dip_sources": ["protocols", "drucksachen", "proceedings"] }Dependencies
New dependencies added:
deutschland[dip_bundestag]>=0.4.2- DIP API clienttorch>=2.0.0- For embedding modelsmore-itertools>=8.10.0for compatibilityRun
uv syncto install new dependencies.Breaking Changes
None. All changes are backward compatible with existing BundestagMine configurations.
Commits
feat: add DIP API client for Bundestag data extraction (e842b23)
feat: add clear-collection flag to embedding script (038caa1)
perf: optimize chat engine response synthesis (d174486)
fix: add lazy configuration initialization in augment.py (50bcd8f)
chore: add dependencies for DIP API and embedding improvements (778adce)
test: add e2e tests for Bundestag data sources (599c37b)
Related Documentation