A retrieval + reasoning system over City of Reno council memos (77 memos, January–April 2026). It is built to find patterns across memos — connections, tensions, and timelines — not to summarize individual ones. Every answer is traceable to the memos it cites.
The repo contains two things:
- The base system — a research-grounded RAG over the curated CityThread spreadsheet, with a FastAPI + React UI. (This is the system labelled A-research in the evaluation.)
- An evaluation harness (
eval/) — a blinded, expert-judged comparison of four systems across two experiments, testing whether grounding a RAG system in published research actually changes output quality.
See ARCHITECTURE.md for the base design, CLAUDE.md for the build constraints,
and the doc map at the bottom for everything else.
Knowledge reno.db (SQLite + FTS5) <- ingest.py
Retrieval FTS5 keyword + graph walk + <- query.py
direct lookup (threads/flags/metrics)
Reasoning Claude cross-document synthesis <- query.py
Interface FastAPI JSON API + React SPA <- api.py, web/
python3 -m venv .venv
./.venv/bin/pip install -r requirements.txt # includes the eval deps too
./.venv/bin/python ingest.py # build reno.db from the workbook
cd web && npm install && cd .. # frontend depsTwo processes. The API uses port 8077 (5000/8000 are often taken on macOS); the frontend uses 5180.
ANTHROPIC_API_KEY=sk-... ./.venv/bin/uvicorn api:app --reload --port 8077
cd web && npm run dev # separate terminalOpen http://localhost:5180. Vite proxies /api to the backend. Without a key,
direct lookups still work and queries still retrieve + cite — only the written
synthesis is disabled.
./.venv/bin/python query.py "how are DMV holds affecting parking revenue?"
./.venv/bin/python query.py --no-llm "what are the high priority flags?"flags question -> flags table (by priority)
metric question -> metric series (when the query names one + asks quantitatively)
issue-thread named -> thread + its memos
otherwise -> FTS5 keyword search -> one-hop graph walk -> Claude synthesis
Direct-lookup routes never call the LLM. Reasoning is reserved for genuine cross-document questions.
A side-by-side, blinded, expert-judged comparison. There is no automated scoring and no LLM-as-judge anywhere — the expert is the only evaluator.
| System | Experiment | Data source | Grounding |
|---|---|---|---|
| A-research | A | spreadsheet (reno.db) |
full (CLAUDE.md + ARCHITECTURE.md) |
| A-vanilla | A | spreadsheet rows, flattened | none (control) |
| B-research | B | raw council PDFs | full extraction pipeline |
| B-vanilla | B | raw PDF text | none (control) |
- Experiment A isolates grounding's effect on retrieval + reasoning (both systems start from clean, curated data).
- Experiment B tests grounding on the whole pipeline including extraction (both start from raw PDFs). B-research builds its own structured knowledge base from the PDFs; B-vanilla just chunks their text.
Within each pair, only research grounding varies — model, generation params
(claude-sonnet-4-6, max_tokens 2048, temperature 0), the prompt text, and the
document set are held constant.
eval/prompts.json— the locked 15-prompt set. Each prompt'sreference_keyis internal-only; it is never sent to a system or shown to the expert.eval/systems.config.json— per-system config (model, chunk size, top-k, embedding model, generic prompt, Experiment B ingestion filters).eval/run-record.schema.json— the shape of each append-only run record.
# 1. Resolve the Experiment B document set (live PDFs, filtered + fetched + logged)
./.venv/bin/python eval/experiment_b_ingest.py # --dry-run to classify without fetching
# 2. Build indexes / knowledge bases
./.venv/bin/python eval/vanilla_rag.py --build A-vanilla
./.venv/bin/python eval/vanilla_rag.py --build B-vanilla
./.venv/bin/python eval/experiment_b_research.py --build # cached; --rebuild to force
# (A-research uses the existing reno.db via query.py — nothing to build)
# 3. Run every prompt through every system (append-only -> eval/runs/)
./.venv/bin/python eval/runner.py # or --prompt P01 / --system A-vanillacd eval && ../.venv/bin/uvicorn review_api:app --reload --port 8088 # backend
cd eval/review && npm install && npm run dev # http://localhost:5190- Pick a prompt, toggle Experiment A / B, judge the two blinded outputs, Save.
- Left/right order is randomized once and persisted (
eval/review/blind_map.json). - Experiment B also shows the structured data each system built (entities / relationships / fields).
- Unblind is a separate, confirm-gated action; system identity is never shown by default.
- Answers save to
eval/review/responses/{prompt}__{experiment}.jsonwith the blind mapping attached.
- All four systems generate with the shared params above; the vanilla pair uses a
local
sentence-transformers/all-MiniLM-L6-v2embedder (no API key needed). - The two B systems consume the byte-identical document set; the runner asserts a
single shared
document_set_hashbefore any B run.
A summarizer of single memos, a keyword search box, or a source of claims about the world beyond what the memos say. It distinguishes what the memos say (faithful) from what is true (true) and labels uncertainty. The evaluation is a qualitative demonstration over 15 prompts — not statistical proof.
| File | What it covers |
|---|---|
README.md |
this file — overview + how to run |
ARCHITECTURE.md |
base-system design and build spec |
CLAUDE.md |
build constraints, relation vocab, failure modes |
EVALUATION_PLAN.md |
the experiment design and honesty guardrails |
EVALUATION_PROMPTS.md |
the 15 prompts and their (internal) reference keys |
EXPERIMENT_BUILD_SUMMARY.md |
what was built, step by step, and the decisions made |
TECHNICAL_DEEP_DIVE.md |
code-level choices and differences across the four systems |
eval/README.md |
the harness quick-reference |