Cost-Aware Agentic RAG

Production-shaped Agentic RAG prototype for SEC 10-K Financial Document Analysis with cost-aware model routing. Uses a trained ML classifier to route queries by complexity, hybrid retrieval with RRF fusion, cross-encoder reranking, and a LangGraph-based agentic loop with self-reflection.

Architecture

                            ┌──────────────────────────────┐
                            │        User Query             │
                            └──────────────┬───────────────┘
                                           │
                            ┌──────────────▼───────────────┐
                            │   Query Processor             │
                            │   Rewriting + HyDE + Multi-   │
                            │   Query Expansion             │
                            └──────────────┬───────────────┘
                                           │
                            ┌──────────────▼───────────────┐
                            │   Cost-Aware Router           │
                            │   TF-IDF + LogisticRegression │
                            │   ┌─────────┬───────────┐     │
                            │   │ simple  │ complex   │     │
                            │   └────┬────┴─────┬─────┘     │
                            │        │          │           │
                            └────────┼──────────┼───────────┘
                                     │          │
                        ┌────────────▼──┐  ┌────▼────────────┐
                        │ gemma3:4b     │  │ gemma3:27b       │
                        │ (fast + cheap)│  │ (capable + vision)│
                        └────────────┬──┘  └────┬────────────┘
                                     │          │
                                     └─────┬────┘
                                           │
                            ┌──────────────▼───────────────┐
                            │   LangGraph Agent Loop        │
                            │                              │
                            │  ┌─────────┐ ┌────────────┐  │
                            │  │ Planner │─▶│ Tool Exec  │  │
                            │  └─────────┘ └─────┬──────┘  │
                            │                    │         │
                            │  ┌─────────────────▼───────┐ │
                            │  │     Hybrid Retriever     │ │
                            │  │  ┌───────┐ ┌──────────┐ │ │
                            │  │  │Vector │ │  BM25     │ │ │
                            │  │  │Chroma │ │  +stem    │ │ │
                            │  │  └───┬───┘ └────┬─────┘ │ │
                            │  │      └────┬─────┘       │ │
                            │  │      RRF Fusion (k=60)  │ │
                            │  │           │             │ │
                            │  │  Cross-Encoder Rerank   │ │
                            │  │  (ms-marco-MiniLM)      │ │
                            │  └───────────┬─────────────┘ │
                            │              │               │
                            │  ┌───────────▼─────────────┐ │
                            │  │     Generator            │ │
                            │  │  (context + query → LLM) │ │
                            │  └───────────┬─────────────┘ │
                            │              │               │
                            │  ┌───────────▼─────────────┐ │
                            │  │     Reflector            │ │
                            │  │  (self-evaluate → retry) │ │
                            │  └───────────┬─────────────┘ │
                            └──────────────┬───────────────┘
                                           │
                            ┌──────────────▼───────────────┐
                            │   Response + Citations        │
                            └──────────────────────────────┘

Tech Stack

Component	Technology	Purpose
LLM (simple)	Ollama Cloud — `gemma3:4b`	Fast factual queries
LLM (complex)	Ollama Cloud — `gemma3:27b`	Multi-hop reasoning, vision
LLM (judge)	Ollama Cloud — `minimax-m3:cloud`	LLM-as-Judge evaluation
Embeddings	`BAAI/bge-small-en-v1.5` (384d, local)	Dense retrieval vectors
Reranker	`cross-encoder/ms-marco-MiniLM-L-6-v2` (local)	Precision reranking
Vector DB	ChromaDB (PersistentClient)	Document storage + cosine search
Sparse Retrieval	`rank_bm25` with stemming + stopwords	BM25 keyword matching
Fusion	Reciprocal Rank Fusion (RRF, k=60)	Score-agnostic list merging
Query Processing	Rule-based rewriting + HyDE + multi-query expansion	Recall improvement
Knowledge Graph	NetworkX	Entity/relation extraction
Routing Classifier	TF-IDF + LogisticRegression (sklearn)	Complexity classification
Agent Framework	LangGraph with self-reflection loop	Multi-step orchestration
API	FastAPI (30+ endpoints)	REST + SSE streaming
Frontend	Jinja2 HTML (9 pages) served by FastAPI	Web dashboard
Auth	bcrypt password hashing (file-based)	Admin access control
Cache	Redis (optional)	Query result + rate-limit caching
Observability	Langfuse	Per-query tracing
Evaluation	Golden set (55+ Q&A), LLM-as-Judge, retrieval metrics, CI gating	Quality assurance
Containerization	Docker Compose (api + redis)	Deployment

Archived Frontends

Note: The frontend-archived/ (Next.js) and dashboard-archived/ (Streamlit) directories contain previous frontend implementations that are no longer actively maintained. The Jinja2-based web dashboard (web/templates/) is the primary and only supported frontend.

Features

Cost-Aware Routing

A trained TF-IDF + LogisticRegression classifier determines query complexity. Simple factual queries ("What was Microsoft's revenue?") route to gemma3:4b for speed and low cost. Complex multi-hop queries ("Compare Microsoft and Amazon revenue growth") route to gemma3:27b. Falls back to LLM-based classification when classifier confidence is below 0.6.

Hybrid Retrieval

Parallel dense (ChromaDB cosine) and sparse (BM25) retrieval with automatic ticker/year filter extraction from the query. BM25 uses a custom Porter-like stemmer with English stopword removal.

Reciprocal Rank Fusion

Score-agnostic RRF combines vector and BM25 ranked lists using score = Σ 1/(k + rank) with k=60. This avoids the need to normalize scores between heterogeneous retrieval methods.

Cross-Encoder Reranking

After RRF fusion, a cross-encoder/ms-marco-MiniLM-L-6-v2 model reranks the top candidates for precision. This provides a significant quality improvement over embedding-only similarity.

Query Processing

Three-stage query transformation before retrieval:

Rewriting: pronoun resolution, abbreviation expansion, context addition
HyDE: generates a hypothetical answer paragraph and retrieves with that embedding
Multi-Query Expansion: generates 3 alternative phrasings for better recall

LangGraph Agentic Loop

A four-node state graph (planner → tools → generator → reflector) with conditional edges. The reflector self-evaluates answer quality and can trigger a retry (up to 2 reflections).

Knowledge Graph

NetworkX-based entity and relation extraction from SEC filings. Extracts companies, monetary values, dates, metrics, and persons using regex patterns. Builds knowledge triples (e.g., Microsoft → REPORTED_REVENUE → $245.1 billion).

Evaluation Pipeline

Golden Set: 76 curated Q&A pairs across MSFT, AMZN, TSLA, GOOG, META, AAPL, NVDA
LLM-as-Judge: minimax-m3:cloud scores faithfulness, answer relevancy, context precision, context recall
Retrieval Metrics: NDCG@10, MRR, Recall@5, Recall@10, Precision@5, Precision@10, Hit Rate
CI Gating: configurable thresholds that gate deployment based on evaluation scores

Quick Start

1. Install Dependencies

pip install -e ".[dev]"

2. Configure Environment

cp .env.example .env
# Edit .env with your Ollama API key:
#   OLLAMA_API_KEY=your_key_here
#   ADMIN_USERNAME=admin
#   ADMIN_PASSWORD=your_password_here

3. Ingest SEC 10-K Data

python scripts/ingest.py

4. Train Query Classifier

python -c "from src.ml.routing import train_classifier; train_classifier()"

5. Run API Server

uvicorn api.main:app --reload --port 8001

The dashboard is available at http://localhost:8001/.

6. Docker Deployment

docker compose up --build

This starts the API on port 8001 and Redis on port 6379.

API Endpoints

Method	Endpoint	Description
GET	`/health`	System health + document/chunk counts
POST	`/query`	Execute a financial query
POST	`/query/stream`	SSE streaming response
GET	`/cost/summary`	Cost analytics summary
GET	`/cost/budget`	Budget check
GET	`/documents`	List all indexed 10-K filings
POST	`/upload`	Upload PDF for indexing
GET	`/upload/status/{id}`	Upload processing status
GET	`/analytics/models`	Model cost/performance comparison
GET	`/analytics/routing`	Routing efficiency breakdown
GET	`/analytics/trend`	Cost trend over time
POST	`/feedback`	Submit query feedback
GET	`/feedback/stats`	Aggregated feedback stats
POST	`/eval/run`	Run evaluation on Q&A pair
GET	`/eval/averages`	Average evaluation scores
GET	`/knowledge/stats`	Knowledge graph stats
POST	`/knowledge/extract`	Extract entities from text
GET	`/suggestions`	Query suggestions
GET	`/anomalies`	Anomaly detection
POST	`/admin/login`	Admin login
GET	`/admin/users`	List users
GET	`/export/query`	Export query to PDF
GET	`/export/queries/csv`	Export history to CSV
GET	`/conversation/history`	Conversation history

Web Pages

Path	Page
`/`	Landing page
`/app`	Main query dashboard
`/app/documents`	Document browser
`/app/analytics`	Cost analytics
`/app/comparison`	Model comparison
`/app/upload`	Document upload
`/app/latency`	Latency dashboard
`/app/cost-optimization`	Cost optimization
`/app/admin`	Admin panel

Project Structure

cost-aware-agentic-rag/
├── api/
│   ├── main.py              # FastAPI app + all routes
│   └── models.py            # Pydantic request/response models
├── src/
│   ├── config.py            # pydantic-settings config (all paths, models, etc.)
│   ├── agents/
│   │   ├── graph.py         # LangGraph state graph + orchestrator
│   │   ├── memory.py        # Conversation memory (multi-turn)
│   │   └── guardrails.py    # Input/output guardrails
│   ├── retrieval/
│   │   ├── vector_store.py  # ChromaDB + bge-small-en embeddings
│   │   ├── bm25_index.py    # BM25 with stemming + stopwords
│   │   ├── fusion.py        # RRF + weighted score fusion
│   │   ├── hybrid.py        # HybridRetriever (vector + BM25 + RRF + rerank)
│   │   └── reranker.py      # Cross-encoder reranking
│   ├── generation/
│   │   ├── llm_client.py    # Ollama Cloud client + cost estimation
│   │   ├── cost_tracker.py  # Per-query cost tracking
│   │   └── prompts.py       # Prompt templates
│   ├── ml/
│   │   ├── routing.py       # TF-IDF + LogisticRegression classifier
│   │   ├── query_processor.py # Rewriting, HyDE, multi-query expansion
│   │   ├── cost_analytics.py # Model/routing cost analytics
│   │   ├── feedback.py      # User feedback storage
│   │   ├── suggestions.py   # Query suggestion engine
│   │   ├── anomaly.py       # Anomaly detection
│   │   └── export.py        # PDF/CSV export
│   ├── knowledge/
│   │   └── graph.py         # NetworkX knowledge graph + entity extraction
│   ├── multimodal/
│   │   ├── vision.py        # Image understanding (gemma3:27b)
│   │   ├── tables.py        # Table extraction from text
│   │   └── images.py        # Image processing
│   ├── eval/
│   │   ├── golden_set.py    # Golden Q&A set (55+ pairs)
│   │   ├── evaluator.py     # Evaluation runner
│   │   ├── llm_judge.py     # LLM-as-Judge (faithfulness, relevancy, etc.)
│   │   ├── retrieval_metrics.py # NDCG, MRR, Recall, Precision, Hit Rate
│   │   ├── pipeline.py      # EvalPipeline + CI Gating
│   │   └── ragas_eval.py    # RAGAS integration
│   ├── ingestion/
│   │   ├── downloader.py    # SEC EDGAR document downloader
│   │   ├── parser.py        # Document parser
│   │   ├── chunker.py       # Text chunking
│   │   ├── pipeline.py      # Ingestion pipeline
│   │   └── upload_handler.py # Upload processing
│   ├── database/
│   │   ├── admin_auth.py    # bcrypt auth, file-based sessions
│   │   ├── cache.py         # Redis caching layer
│   │   └── models.py        # SQLAlchemy models
│   ├── observability/
│   │   └── langfuse.py      # Langfuse integration
│   └── tasks/
│       └── celery_app.py    # Celery task queue
├── web/
│   ├── templates/           # Jinja2 HTML (9 pages)
│   │   ├── index.html       # Landing
│   │   ├── app.html         # Query dashboard
│   │   ├── documents.html   # Document browser
│   │   ├── analytics.html   # Cost analytics
│   │   ├── comparison.html  # Model comparison
│   │   ├── upload.html      # Upload page
│   │   ├── latency.html     # Latency dashboard
│   │   ├── cost_optimization.html # Cost optimization
│   │   └── admin.html       # Admin panel
│   └── static/              # CSS + JS
├── frontend-archived/        # Next.js frontend (archived)
├── dashboard-archived/       # Streamlit dashboard (archived)
├── scripts/
│   ├── ingest.py            # Data ingestion CLI
│   ├── evaluate.py          # Evaluation runner
│   ├── evaluate_ml.py       # ML evaluation
│   ├── eval_llm_judge.py    # LLM judge evaluation
│   ├── eval_ragas.py        # RAGAS evaluation
│   └── create_samples.py    # Sample data creation
├── tests/
│   ├── conftest.py          # Test fixtures
│   ├── test_comprehensive.py
│   └── test_config.py
├── data/
│   ├── raw/                 # Raw SEC filings (per ticker/year)
│   ├── processed/           # Parsed + chunked data
│   ├── indexes/             # ChromaDB + BM25 indices
│   └── eval/                # Golden set + eval results
├── docker-compose.yml       # api + redis
├── Dockerfile
├── pyproject.toml
└── requirements.txt

Architecture Decision Records

Why ChromaDB over Pinecone/Weaviate?

ChromaDB runs fully embedded with zero infrastructure. PersistentClient gives us disk-backed storage with HNSW indexing — no server process needed. For a single-tenant RAG system analyzing SEC filings, the scaling limits of ChromaDB are irrelevant, and the operational simplicity is a major win. The cosine similarity search with metadata filtering covers our exact use case (filter by ticker + year).

Why Ollama Cloud over OpenAI/Anthropic?

Ollama Cloud provides access to open-weight models (gemma3:4b, gemma3:27b) with a unified API. The gemma3:4b model handles 80%+ of queries (simple factual lookups) at a fraction of the cost, while gemma3:27b provides strong reasoning and vision capabilities for complex multi-hop analysis. The Ollama client library (ollama pip package) is thin and well-maintained.

Why RRF over Weighted Score Fusion?

Reciprocal Rank Fusion operates on rank positions, not raw scores. This is critical when fusing heterogeneous retrieval methods (cosine similarity from vector DB vs. BM25 log-odds scores) whose score distributions are not comparable. RRF with k=60 is the standard from the original RRF paper and requires no weight tuning — a significant advantage over weighted fusion which needs per-dataset calibration.

Why BGE-small-en-v1.5 over OpenAI Ada/large models?

BAAI/bge-small-en-v1.5 produces 384-dimensional embeddings locally with no API calls. At 33M parameters it loads in under 2 seconds and encodes batches in milliseconds. On MTEB benchmarks it punches well above its size class. For SEC 10-K retrieval where the query-document semantic gap is narrow (financial terminology is consistent), the quality difference vs. larger models is negligible, while the latency and cost savings are substantial.

Why TF-IDF + LogisticRegression for Routing?

A trained classifier on labeled query data (TRAINING_DATA in src/ml/routing.py) provides fast, deterministic routing with interpretable confidence scores. TF-IDF features capture the lexical patterns that distinguish simple lookups ("What is X's revenue?") from complex analysis ("Compare X and Y's risk factors across 3 years"). When confidence drops below 0.6, the system falls back to LLM-based classification — a hybrid approach that balances speed and accuracy.

Why LangGraph over Raw LLM Calls?

LangGraph provides explicit state management, conditional edges, and checkpointing for the agent loop. The planner → tools → generator → reflector graph with a retry edge from reflector back to tools is natural to express as a state graph. MemorySaver checkpointing enables conversation continuity. The alternative (manually chaining LLM calls with if/else logic) would be harder to debug, extend, and observe.

Evaluation Results

Run the full evaluation suite:

# LLM-as-Judge evaluation
python scripts/eval_llm_judge.py

# Retrieval metrics
python scripts/evaluate_ml.py

# RAGAS evaluation
python scripts/eval_ragas.py

LLM-as-Judge Metrics (`minimax-m3:cloud` judge, 55 samples)

Metric	Score
Faithfulness	0.596
Answer Relevancy	0.918
Context Precision	0.975
Context Recall	1.000
Overall (weighted)	0.849

Retrieval Metrics

Metric	Score
NDCG@10	0.710
MRR	0.611
Hit Rate	1.000

CI Gating Thresholds

Metric	Threshold
Overall Score	≥ 0.6
Retrieval Precision	≥ 0.5
Retrieval Recall	≥ 0.5
Answer Faithfulness	≥ 0.6
Answer Relevance	≥ 0.5

Example Queries

# Simple factual lookup (routes to gemma3:4b)
curl -X POST http://localhost:8001/query \
  -H "Content-Type: application/json" \
  -d '{"query": "What was Microsoft'\''s total revenue in 2024?"}'

# Complex comparison (routes to gemma3:27b)
curl -X POST http://localhost:8001/query \
  -H "Content-Type: application/json" \
  -d '{"query": "Compare Microsoft and Amazon revenue growth over the last 3 years"}'

# Streaming response
curl -X POST http://localhost:8001/query/stream \
  -H "Content-Type: application/json" \
  -d '{"query": "What are Tesla'\''s main risk factors?"}'

Known Limits

Cost model is approximate — Token costs use per-million-token rates from public Ollama pricing. Actual costs may vary by deployment.
Upload status is in-memory — Server restart loses pending upload status. Production would use persistent queue.
File-based auth — Admin users and sessions stored in JSON files. Suitable for demo, not enterprise multi-tenant.
Single-process retrieval — No distributed search or horizontal scaling. ChromaDB and BM25 are local.
Eval harness is offline — Golden set has 20 entries. Online eval with production traffic not yet implemented.
Agent loop is bounded — Max 2 reflection iterations. No human-in-the-loop approval or tool budget enforcement.
No persistent deployment — Docker builds locally. No cloud deployment, load balancer, or auto-scaling.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
.github/workflows		.github/workflows
api		api
data		data
scripts		scripts
src		src
tests		tests
web		web
.env.example		.env.example
.gitignore		.gitignore
DEPLOY.md		DEPLOY.md
Dockerfile		Dockerfile
PLAN.md		PLAN.md
PROGRESS.md		PROGRESS.md
README.md		README.md
STRUCTURE.md		STRUCTURE.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Cost-Aware Agentic RAG

Architecture

Tech Stack

Archived Frontends

Features

Cost-Aware Routing

Hybrid Retrieval

Reciprocal Rank Fusion

Cross-Encoder Reranking

Query Processing

LangGraph Agentic Loop

Knowledge Graph

Evaluation Pipeline

Quick Start

1. Install Dependencies

2. Configure Environment

3. Ingest SEC 10-K Data

4. Train Query Classifier

5. Run API Server

6. Docker Deployment

API Endpoints

Web Pages

Project Structure

Architecture Decision Records

Why ChromaDB over Pinecone/Weaviate?

Why Ollama Cloud over OpenAI/Anthropic?

Why RRF over Weighted Score Fusion?

Why BGE-small-en-v1.5 over OpenAI Ada/large models?

Why TF-IDF + LogisticRegression for Routing?

Why LangGraph over Raw LLM Calls?

Evaluation Results

LLM-as-Judge Metrics (minimax-m3:cloud judge, 55 samples)

Retrieval Metrics

CI Gating Thresholds

Example Queries

Known Limits

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

LLM-as-Judge Metrics (`minimax-m3:cloud` judge, 55 samples)

Packages