Citation Finder System

This document describes the BMLibrarian citation finder system for extracting verifiable citations from scored documents to support evidence-based reporting.

Overview

The Citation Finder Agent processes documents that exceed a specified relevance score threshold to extract specific passages that answer user questions. It builds a queue of verifiable citations that can be used by reporting agents to synthesize evidence-based responses with proper references.

Architecture

Core Components

CitationFinderAgent: Main agent class for citation extraction
Citation: Data structure representing extracted citations
Queue Integration: Memory-efficient processing via SQLite queues
Document Verification: Ensures citation integrity and prevents hallucination

Citation Data Structure

@dataclass
class Citation:
    passage: str              # Exact text from document
    summary: str              # Brief explanation of relevance
    relevance_score: float    # 0-1 confidence score
    document_id: str          # Verified database document ID
    document_title: str       # Document title
    authors: List[str]        # Author list
    publication_date: str     # Publication date
    pmid: Optional[str]       # PubMed ID if available
    created_at: datetime      # Citation creation timestamp

Processing Workflow

Scored Documents → Threshold Filter → Citation Extraction → Verification → Citation Queue

Input: Documents with relevance scores from DocumentScoringAgent
Filtering: Only process documents above score threshold (e.g., >2.0)
Extraction: Use LLM to identify relevant passages and create summaries
Verification: Ensure document IDs match database records
Output: Queue of verified citations for reporting

Key Features

1. Threshold-Based Processing

Only documents exceeding a configurable score threshold are processed:

citations = citation_agent.process_scored_documents_for_citations(
    user_question="What is COVID-19 vaccine effectiveness?",
    scored_documents=scored_docs,
    score_threshold=2.0,      # Only process docs scoring > 2.0
    min_relevance=0.7         # Only accept citations with relevance ≥ 0.7
)

2. LLM-Based Passage Extraction

Uses structured prompts to extract relevant passages:

prompt = f"""Extract the most relevant passage from this abstract that answers: "{question}"

Response format (JSON):
{{
    "relevant_passage": "exact text from abstract",
    "summary": "brief explanation of relevance", 
    "relevance_score": 0.8,
    "has_relevant_content": true
}}"""

3. Document ID Verification

Document IDs are programmatically assigned from database records to prevent:

Hallucinated references
Malformed citations
Non-existent document references

4. Queue-Based Processing

Supports memory-efficient processing of large document sets:

# Process citations via queue system
for doc, citation in citation_agent.process_citation_queue(
    user_question=question,
    scored_documents=large_document_set,
    batch_size=25
):
    if citation:
        verified_citations.append(citation)

API Reference

CitationFinderAgent Methods

Core Citation Extraction

def extract_citation_from_document(self, user_question: str, 
                                 document: Dict[str, Any], 
                                 min_relevance: float = 0.7) -> Optional[Citation]

Extract citation from single document with relevance filtering.

Batch Processing

def process_scored_documents_for_citations(self, user_question: str,
                                         scored_documents: List[Tuple[Dict, Dict]],
                                         score_threshold: float = 2.0,
                                         min_relevance: float = 0.7) -> List[Citation]

Process multiple scored documents to extract qualifying citations.

Queue Integration

def submit_citation_extraction_tasks(self, user_question: str,
                                   scored_documents: List[Tuple[Dict, Dict]],
                                   score_threshold: float = 2.0,
                                   priority: TaskPriority = TaskPriority.NORMAL) -> Optional[List[str]]

Submit citation extraction tasks to the processing queue.

def process_citation_queue(self, user_question: str,
                         scored_documents: List[Tuple[Dict, Dict]],
                         score_threshold: float = 2.0,
                         batch_size: int = 25) -> Iterator[Tuple[Dict, Optional[Citation]]]

Memory-efficient citation processing using queue system.

Statistics and Analysis

def get_citation_stats(self, citations: List[Citation]) -> Dict[str, Any]

Generate statistics about extracted citations including:

Total citations and unique documents
Average, min, max relevance scores
Publication date ranges
Citations per document ratio

Implementation Details

LLM Integration

The system uses Ollama for citation extraction with:

Low temperature (0.1) for consistent extraction
Structured JSON responses for reliable parsing
Timeout handling for robust processing
Error recovery for failed extractions

Queue Processing

Citations integrate with the existing queue system:

# Task submission
task_ids = citation_agent.submit_citation_extraction_tasks(
    user_question=question,
    scored_documents=qualifying_docs,
    priority=TaskPriority.HIGH
)

# Result collection
results = orchestrator.wait_for_completion(task_ids, timeout=60.0)

Memory Efficiency

Large document sets are processed in configurable batches:

Batch processing: Process documents in chunks (default 25)
Streaming results: Yield citations as they're processed
Queue persistence: Tasks persist across process restarts
Progress tracking: Optional progress callbacks

Usage Examples

Basic Citation Extraction

from bmlibrarian.agents import CitationFinderAgent

# Initialize agent
citation_agent = CitationFinderAgent(orchestrator=orchestrator)

# Process scored documents
citations = citation_agent.process_scored_documents_for_citations(
    user_question="What are the side effects of drug X?",
    scored_documents=scored_docs,
    score_threshold=2.5,
    min_relevance=0.8
)

# Display citations
for citation in citations:
    print(f"Document: {citation.document_title}")
    print(f"Passage: {citation.passage}")
    print(f"Summary: {citation.summary}")
    print(f"Relevance: {citation.relevance_score:.2f}")
    print(f"Reference: {citation.document_id}")

Queue-Based Processing

# Process large dataset efficiently
all_citations = []

def progress_callback(current, total):
    print(f"Progress: {current}/{total} ({current/total*100:.1f}%)")

for doc, citation in citation_agent.process_citation_queue(
    user_question=question,
    scored_documents=large_scored_dataset,
    score_threshold=2.0,
    progress_callback=progress_callback,
    batch_size=50
):
    if citation:
        all_citations.append(citation)

print(f"Extracted {len(all_citations)} citations")

Integration with Scoring Workflow

# Complete workflow: Query → Score → Cite
documents = query_agent.search_documents(user_query)

# Score documents for relevance
scored_docs = []
for doc in documents:
    score = scoring_agent.score_document_relevance(user_question, doc)
    if score:
        scored_docs.append((doc, score))

# Extract citations from high-scoring documents
citations = citation_agent.process_scored_documents_for_citations(
    user_question=user_question,
    scored_documents=scored_docs,
    score_threshold=3.0
)

# Generate report with verified citations
report_agent.generate_evidence_based_report(user_question, citations)

Configuration

Agent Initialization

citation_agent = CitationFinderAgent(
    orchestrator=orchestrator,      # Required for queue processing
    ollama_url="http://localhost:11434",  # Ollama service URL
    model="gpt-oss:20b"            # LLM model for extraction (default)
)

Processing Parameters

score_threshold: Minimum document score to process (default: 2.0)
min_relevance: Minimum citation relevance to accept (default: 0.7)
batch_size: Queue processing batch size (default: 25)
timeout: Task completion timeout (default: 60s)

Quality Control

Relevance Filtering

Citations undergo multiple quality checks:

Document Score Filter: Only high-scoring documents processed
LLM Relevance Score: Each citation rated 0-1 for relevance
Minimum Threshold: Only citations above threshold accepted
Passage Validation: Extracted text must exist in source document

Document Verification

Document IDs are verified to ensure:

Database Existence: ID exists in literature database
Proper Format: ID follows expected format patterns
No Hallucination: Prevents fabricated document references

Error Handling

Robust error handling for:

Network Failures: Ollama service unavailable
Malformed Responses: Invalid JSON from LLM
Missing Data: Documents without abstracts
Processing Timeouts: Long-running extractions

Performance Considerations

Scaling Factors

Processing time depends on:

Document Count: Linear scaling with queue batching
Abstract Length: Longer abstracts take more processing
Model Performance: Faster models reduce latency
Network Latency: Local Ollama faster than remote

Optimization Strategies

Batch Processing: Process documents in parallel batches
Score Pre-filtering: Only process high-scoring documents
Model Selection: Use faster models like medgemma4B_it_q8:latest for speed
Caching: Cache similar extractions (not currently implemented)
Queue Persistence: Resume processing after interruptions

Monitoring

Track key metrics:

Processing Rate: Documents per second
Citation Yield: Citations per processed document
Quality Scores: Average relevance scores
Error Rates: Failed extractions per batch

Integration Points

Upstream Dependencies

QueryAgent: Provides initial document search
DocumentScoringAgent: Provides relevance scores
PostgreSQL Database: Source of truth for document metadata

Downstream Consumers

ReportingAgent: Uses citations to generate evidence-based reports
SummaryAgent: Creates summaries with proper citations
ExportAgent: Formats citations for external systems

Queue System Integration

QueueManager: Handles task persistence and scheduling
AgentOrchestrator: Coordinates multi-agent workflows
Recovery System: Handles process interruptions and failures

Security Considerations

Data Integrity

Document ID Verification: Prevents citation of non-existent documents
Source Attribution: All citations traceable to database records
Audit Trail: Citation creation timestamps and process tracking

Privacy

No Data Logging: Citations not logged to external systems
Local Processing: All LLM processing happens locally via Ollama
Secure Storage: Queue database uses SQLite with proper permissions

Future Enhancements

Planned Improvements

Semantic Similarity: Use embeddings for better passage matching
Multi-language Support: Extract citations from non-English papers
Citation Clustering: Group similar citations from different papers
Quality Learning: Improve extraction based on user feedback
Export Formats: Support multiple citation formats (APA, MLA, etc.)

Advanced Features

Citation Networks: Build citation relationship graphs
Temporal Analysis: Track how findings evolve over time
Contradiction Detection: Identify conflicting findings
Evidence Synthesis: Automatically synthesize multiple citations
Interactive Refinement: Allow users to refine extraction criteria

FilesExpand file tree

citation_system.md

Latest commit

History