Production-shaped Agentic RAG prototype for SEC 10-K Financial Document Analysis with cost-aware model routing. Uses a trained ML classifier to route queries by complexity, hybrid retrieval with RRF fusion, cross-encoder reranking, and a LangGraph-based agentic loop with self-reflection.
┌──────────────────────────────┐
│ User Query │
└──────────────┬───────────────┘
│
┌──────────────▼───────────────┐
│ Query Processor │
│ Rewriting + HyDE + Multi- │
│ Query Expansion │
└──────────────┬───────────────┘
│
┌──────────────▼───────────────┐
│ Cost-Aware Router │
│ TF-IDF + LogisticRegression │
│ ┌─────────┬───────────┐ │
│ │ simple │ complex │ │
│ └────┬────┴─────┬─────┘ │
│ │ │ │
└────────┼──────────┼───────────┘
│ │
┌────────────▼──┐ ┌────▼────────────┐
│ gemma3:4b │ │ gemma3:27b │
│ (fast + cheap)│ │ (capable + vision)│
└────────────┬──┘ └────┬────────────┘
│ │
└─────┬────┘
│
┌──────────────▼───────────────┐
│ LangGraph Agent Loop │
│ │
│ ┌─────────┐ ┌────────────┐ │
│ │ Planner │─▶│ Tool Exec │ │
│ └─────────┘ └─────┬──────┘ │
│ │ │
│ ┌─────────────────▼───────┐ │
│ │ Hybrid Retriever │ │
│ │ ┌───────┐ ┌──────────┐ │ │
│ │ │Vector │ │ BM25 │ │ │
│ │ │Chroma │ │ +stem │ │ │
│ │ └───┬───┘ └────┬─────┘ │ │
│ │ └────┬─────┘ │ │
│ │ RRF Fusion (k=60) │ │
│ │ │ │ │
│ │ Cross-Encoder Rerank │ │
│ │ (ms-marco-MiniLM) │ │
│ └───────────┬─────────────┘ │
│ │ │
│ ┌───────────▼─────────────┐ │
│ │ Generator │ │
│ │ (context + query → LLM) │ │
│ └───────────┬─────────────┘ │
│ │ │
│ ┌───────────▼─────────────┐ │
│ │ Reflector │ │
│ │ (self-evaluate → retry) │ │
│ └───────────┬─────────────┘ │
└──────────────┬───────────────┘
│
┌──────────────▼───────────────┐
│ Response + Citations │
└──────────────────────────────┘
| Component | Technology | Purpose |
|---|---|---|
| LLM (simple) | Ollama Cloud — gemma3:4b |
Fast factual queries |
| LLM (complex) | Ollama Cloud — gemma3:27b |
Multi-hop reasoning, vision |
| LLM (judge) | Ollama Cloud — minimax-m3:cloud |
LLM-as-Judge evaluation |
| Embeddings | BAAI/bge-small-en-v1.5 (384d, local) |
Dense retrieval vectors |
| Reranker | cross-encoder/ms-marco-MiniLM-L-6-v2 (local) |
Precision reranking |
| Vector DB | ChromaDB (PersistentClient) | Document storage + cosine search |
| Sparse Retrieval | rank_bm25 with stemming + stopwords |
BM25 keyword matching |
| Fusion | Reciprocal Rank Fusion (RRF, k=60) | Score-agnostic list merging |
| Query Processing | Rule-based rewriting + HyDE + multi-query expansion | Recall improvement |
| Knowledge Graph | NetworkX | Entity/relation extraction |
| Routing Classifier | TF-IDF + LogisticRegression (sklearn) | Complexity classification |
| Agent Framework | LangGraph with self-reflection loop | Multi-step orchestration |
| API | FastAPI (30+ endpoints) | REST + SSE streaming |
| Frontend | Jinja2 HTML (9 pages) served by FastAPI | Web dashboard |
| Auth | bcrypt password hashing (file-based) | Admin access control |
| Cache | Redis (optional) | Query result + rate-limit caching |
| Observability | Langfuse | Per-query tracing |
| Evaluation | Golden set (55+ Q&A), LLM-as-Judge, retrieval metrics, CI gating | Quality assurance |
| Containerization | Docker Compose (api + redis) | Deployment |
Note: The
frontend-archived/(Next.js) anddashboard-archived/(Streamlit) directories contain previous frontend implementations that are no longer actively maintained. The Jinja2-based web dashboard (web/templates/) is the primary and only supported frontend.
A trained TF-IDF + LogisticRegression classifier determines query complexity. Simple factual queries ("What was Microsoft's revenue?") route to gemma3:4b for speed and low cost. Complex multi-hop queries ("Compare Microsoft and Amazon revenue growth") route to gemma3:27b. Falls back to LLM-based classification when classifier confidence is below 0.6.
Parallel dense (ChromaDB cosine) and sparse (BM25) retrieval with automatic ticker/year filter extraction from the query. BM25 uses a custom Porter-like stemmer with English stopword removal.
Score-agnostic RRF combines vector and BM25 ranked lists using score = Σ 1/(k + rank) with k=60. This avoids the need to normalize scores between heterogeneous retrieval methods.
After RRF fusion, a cross-encoder/ms-marco-MiniLM-L-6-v2 model reranks the top candidates for precision. This provides a significant quality improvement over embedding-only similarity.
Three-stage query transformation before retrieval:
- Rewriting: pronoun resolution, abbreviation expansion, context addition
- HyDE: generates a hypothetical answer paragraph and retrieves with that embedding
- Multi-Query Expansion: generates 3 alternative phrasings for better recall
A four-node state graph (planner → tools → generator → reflector) with conditional edges. The reflector self-evaluates answer quality and can trigger a retry (up to 2 reflections).
NetworkX-based entity and relation extraction from SEC filings. Extracts companies, monetary values, dates, metrics, and persons using regex patterns. Builds knowledge triples (e.g., Microsoft → REPORTED_REVENUE → $245.1 billion).
- Golden Set: 76 curated Q&A pairs across MSFT, AMZN, TSLA, GOOG, META, AAPL, NVDA
- LLM-as-Judge:
minimax-m3:cloudscores faithfulness, answer relevancy, context precision, context recall - Retrieval Metrics: NDCG@10, MRR, Recall@5, Recall@10, Precision@5, Precision@10, Hit Rate
- CI Gating: configurable thresholds that gate deployment based on evaluation scores
pip install -e ".[dev]"cp .env.example .env
# Edit .env with your Ollama API key:
# OLLAMA_API_KEY=your_key_here
# ADMIN_USERNAME=admin
# ADMIN_PASSWORD=your_password_herepython scripts/ingest.pypython -c "from src.ml.routing import train_classifier; train_classifier()"uvicorn api.main:app --reload --port 8001The dashboard is available at http://localhost:8001/.
docker compose up --buildThis starts the API on port 8001 and Redis on port 6379.
| Method | Endpoint | Description |
|---|---|---|
| GET | /health |
System health + document/chunk counts |
| POST | /query |
Execute a financial query |
| POST | /query/stream |
SSE streaming response |
| GET | /cost/summary |
Cost analytics summary |
| GET | /cost/budget |
Budget check |
| GET | /documents |
List all indexed 10-K filings |
| POST | /upload |
Upload PDF for indexing |
| GET | /upload/status/{id} |
Upload processing status |
| GET | /analytics/models |
Model cost/performance comparison |
| GET | /analytics/routing |
Routing efficiency breakdown |
| GET | /analytics/trend |
Cost trend over time |
| POST | /feedback |
Submit query feedback |
| GET | /feedback/stats |
Aggregated feedback stats |
| POST | /eval/run |
Run evaluation on Q&A pair |
| GET | /eval/averages |
Average evaluation scores |
| GET | /knowledge/stats |
Knowledge graph stats |
| POST | /knowledge/extract |
Extract entities from text |
| GET | /suggestions |
Query suggestions |
| GET | /anomalies |
Anomaly detection |
| POST | /admin/login |
Admin login |
| GET | /admin/users |
List users |
| GET | /export/query |
Export query to PDF |
| GET | /export/queries/csv |
Export history to CSV |
| GET | /conversation/history |
Conversation history |
| Path | Page |
|---|---|
/ |
Landing page |
/app |
Main query dashboard |
/app/documents |
Document browser |
/app/analytics |
Cost analytics |
/app/comparison |
Model comparison |
/app/upload |
Document upload |
/app/latency |
Latency dashboard |
/app/cost-optimization |
Cost optimization |
/app/admin |
Admin panel |
cost-aware-agentic-rag/
├── api/
│ ├── main.py # FastAPI app + all routes
│ └── models.py # Pydantic request/response models
├── src/
│ ├── config.py # pydantic-settings config (all paths, models, etc.)
│ ├── agents/
│ │ ├── graph.py # LangGraph state graph + orchestrator
│ │ ├── memory.py # Conversation memory (multi-turn)
│ │ └── guardrails.py # Input/output guardrails
│ ├── retrieval/
│ │ ├── vector_store.py # ChromaDB + bge-small-en embeddings
│ │ ├── bm25_index.py # BM25 with stemming + stopwords
│ │ ├── fusion.py # RRF + weighted score fusion
│ │ ├── hybrid.py # HybridRetriever (vector + BM25 + RRF + rerank)
│ │ └── reranker.py # Cross-encoder reranking
│ ├── generation/
│ │ ├── llm_client.py # Ollama Cloud client + cost estimation
│ │ ├── cost_tracker.py # Per-query cost tracking
│ │ └── prompts.py # Prompt templates
│ ├── ml/
│ │ ├── routing.py # TF-IDF + LogisticRegression classifier
│ │ ├── query_processor.py # Rewriting, HyDE, multi-query expansion
│ │ ├── cost_analytics.py # Model/routing cost analytics
│ │ ├── feedback.py # User feedback storage
│ │ ├── suggestions.py # Query suggestion engine
│ │ ├── anomaly.py # Anomaly detection
│ │ └── export.py # PDF/CSV export
│ ├── knowledge/
│ │ └── graph.py # NetworkX knowledge graph + entity extraction
│ ├── multimodal/
│ │ ├── vision.py # Image understanding (gemma3:27b)
│ │ ├── tables.py # Table extraction from text
│ │ └── images.py # Image processing
│ ├── eval/
│ │ ├── golden_set.py # Golden Q&A set (55+ pairs)
│ │ ├── evaluator.py # Evaluation runner
│ │ ├── llm_judge.py # LLM-as-Judge (faithfulness, relevancy, etc.)
│ │ ├── retrieval_metrics.py # NDCG, MRR, Recall, Precision, Hit Rate
│ │ ├── pipeline.py # EvalPipeline + CI Gating
│ │ └── ragas_eval.py # RAGAS integration
│ ├── ingestion/
│ │ ├── downloader.py # SEC EDGAR document downloader
│ │ ├── parser.py # Document parser
│ │ ├── chunker.py # Text chunking
│ │ ├── pipeline.py # Ingestion pipeline
│ │ └── upload_handler.py # Upload processing
│ ├── database/
│ │ ├── admin_auth.py # bcrypt auth, file-based sessions
│ │ ├── cache.py # Redis caching layer
│ │ └── models.py # SQLAlchemy models
│ ├── observability/
│ │ └── langfuse.py # Langfuse integration
│ └── tasks/
│ └── celery_app.py # Celery task queue
├── web/
│ ├── templates/ # Jinja2 HTML (9 pages)
│ │ ├── index.html # Landing
│ │ ├── app.html # Query dashboard
│ │ ├── documents.html # Document browser
│ │ ├── analytics.html # Cost analytics
│ │ ├── comparison.html # Model comparison
│ │ ├── upload.html # Upload page
│ │ ├── latency.html # Latency dashboard
│ │ ├── cost_optimization.html # Cost optimization
│ │ └── admin.html # Admin panel
│ └── static/ # CSS + JS
├── frontend-archived/ # Next.js frontend (archived)
├── dashboard-archived/ # Streamlit dashboard (archived)
├── scripts/
│ ├── ingest.py # Data ingestion CLI
│ ├── evaluate.py # Evaluation runner
│ ├── evaluate_ml.py # ML evaluation
│ ├── eval_llm_judge.py # LLM judge evaluation
│ ├── eval_ragas.py # RAGAS evaluation
│ └── create_samples.py # Sample data creation
├── tests/
│ ├── conftest.py # Test fixtures
│ ├── test_comprehensive.py
│ └── test_config.py
├── data/
│ ├── raw/ # Raw SEC filings (per ticker/year)
│ ├── processed/ # Parsed + chunked data
│ ├── indexes/ # ChromaDB + BM25 indices
│ └── eval/ # Golden set + eval results
├── docker-compose.yml # api + redis
├── Dockerfile
├── pyproject.toml
└── requirements.txt
ChromaDB runs fully embedded with zero infrastructure. PersistentClient gives us disk-backed storage with HNSW indexing — no server process needed. For a single-tenant RAG system analyzing SEC filings, the scaling limits of ChromaDB are irrelevant, and the operational simplicity is a major win. The cosine similarity search with metadata filtering covers our exact use case (filter by ticker + year).
Ollama Cloud provides access to open-weight models (gemma3:4b, gemma3:27b) with a unified API. The gemma3:4b model handles 80%+ of queries (simple factual lookups) at a fraction of the cost, while gemma3:27b provides strong reasoning and vision capabilities for complex multi-hop analysis. The Ollama client library (ollama pip package) is thin and well-maintained.
Reciprocal Rank Fusion operates on rank positions, not raw scores. This is critical when fusing heterogeneous retrieval methods (cosine similarity from vector DB vs. BM25 log-odds scores) whose score distributions are not comparable. RRF with k=60 is the standard from the original RRF paper and requires no weight tuning — a significant advantage over weighted fusion which needs per-dataset calibration.
BAAI/bge-small-en-v1.5 produces 384-dimensional embeddings locally with no API calls. At 33M parameters it loads in under 2 seconds and encodes batches in milliseconds. On MTEB benchmarks it punches well above its size class. For SEC 10-K retrieval where the query-document semantic gap is narrow (financial terminology is consistent), the quality difference vs. larger models is negligible, while the latency and cost savings are substantial.
A trained classifier on labeled query data (TRAINING_DATA in src/ml/routing.py) provides fast, deterministic routing with interpretable confidence scores. TF-IDF features capture the lexical patterns that distinguish simple lookups ("What is X's revenue?") from complex analysis ("Compare X and Y's risk factors across 3 years"). When confidence drops below 0.6, the system falls back to LLM-based classification — a hybrid approach that balances speed and accuracy.
LangGraph provides explicit state management, conditional edges, and checkpointing for the agent loop. The planner → tools → generator → reflector graph with a retry edge from reflector back to tools is natural to express as a state graph. MemorySaver checkpointing enables conversation continuity. The alternative (manually chaining LLM calls with if/else logic) would be harder to debug, extend, and observe.
Run the full evaluation suite:
# LLM-as-Judge evaluation
python scripts/eval_llm_judge.py
# Retrieval metrics
python scripts/evaluate_ml.py
# RAGAS evaluation
python scripts/eval_ragas.py| Metric | Score |
|---|---|
| Faithfulness | 0.596 |
| Answer Relevancy | 0.918 |
| Context Precision | 0.975 |
| Context Recall | 1.000 |
| Overall (weighted) | 0.849 |
| Metric | Score |
|---|---|
| NDCG@10 | 0.710 |
| MRR | 0.611 |
| Hit Rate | 1.000 |
| Metric | Threshold |
|---|---|
| Overall Score | ≥ 0.6 |
| Retrieval Precision | ≥ 0.5 |
| Retrieval Recall | ≥ 0.5 |
| Answer Faithfulness | ≥ 0.6 |
| Answer Relevance | ≥ 0.5 |
# Simple factual lookup (routes to gemma3:4b)
curl -X POST http://localhost:8001/query \
-H "Content-Type: application/json" \
-d '{"query": "What was Microsoft'\''s total revenue in 2024?"}'
# Complex comparison (routes to gemma3:27b)
curl -X POST http://localhost:8001/query \
-H "Content-Type: application/json" \
-d '{"query": "Compare Microsoft and Amazon revenue growth over the last 3 years"}'
# Streaming response
curl -X POST http://localhost:8001/query/stream \
-H "Content-Type: application/json" \
-d '{"query": "What are Tesla'\''s main risk factors?"}'- Cost model is approximate — Token costs use per-million-token rates from public Ollama pricing. Actual costs may vary by deployment.
- Upload status is in-memory — Server restart loses pending upload status. Production would use persistent queue.
- File-based auth — Admin users and sessions stored in JSON files. Suitable for demo, not enterprise multi-tenant.
- Single-process retrieval — No distributed search or horizontal scaling. ChromaDB and BM25 are local.
- Eval harness is offline — Golden set has 20 entries. Online eval with production traffic not yet implemented.
- Agent loop is bounded — Max 2 reflection iterations. No human-in-the-loop approval or tool budget enforcement.
- No persistent deployment — Docker builds locally. No cloud deployment, load balancer, or auto-scaling.
MIT