Skip to content

Latest commit

 

History

History
285 lines (201 loc) · 17.4 KB

File metadata and controls

285 lines (201 loc) · 17.4 KB

Search Mode Evaluation Methodology

How v0.32.3 measures the difference between conservative, balanced, and tokenmax. Written haters-immune: every claim is reproducible from the committed dataset + raw outputs.

1. What this measures and what it doesn't

Measures: retrieval quality and operational cost on fixed public datasets, under each named search mode, against the same brain content.

Does NOT measure:

  • Your specific brain content (this is a benchmark, not your bill).
  • Your specific query distribution.
  • End-user satisfaction or downstream task success.
  • Latency under concurrent load.
  • Production cost (the cost numbers are model-pricing estimates × dataset size, not your actual API spend).

If you want to know how a mode behaves on YOUR brain, run gbrain search stats --days 30 after a real usage window, then run gbrain search tune for actionable recommendations.

2. Datasets and sizes

  • LongMemEval — public split, n=500 questions. Downloaded from Hugging Face. The corpus + answer keys are pinned to a specific commit; recorded in every per-run record.
  • Replay captures — NDJSON from the sibling gbrain-evals repo, n=200 queries. Each query carries a retrieved_slugs baseline + a latency_ms measurement from the original production run.
  • BrainBench v1n=1240 documents / n=350 qrels (binary relevance judgments). Lives in the sibling gbrain-evals repo, SHA-pinned at every run.

No private brain content is used in any reported result. The committed NDJSON dumps under <repo>/.gbrain-evals/ contain only the LongMemEval question IDs + the rank-ordered retrieved session IDs.

3. Sample selection

  • Random seed: 42 throughout. Set via --seed N on gbrain eval run-all; recorded in every per-run record.
  • No per-question curation. Splits are taken whole; no question is filtered for reporting.
  • No mode-specific tuning. The same dataset + same seed feeds every mode. The mode is the only independent variable.
  • Stability across re-runs: with --seed 42 and the same dataset SHA, two runs of the same (mode, suite) produce identical retrieval orderings (modulo the optional Haiku expansion call, which is non-deterministic). Persisted in eval_results so anyone can re-score from the committed dumps.

4. Run procedure

The command is the doc. Anyone can reproduce.

# Setup: in your gbrain working tree, with OPENAI_API_KEY + ANTHROPIC_API_KEY exported.
git rev-parse HEAD  # record the commit for the methodology footer

# Sweep all 3 modes × 2 retrieval-focused suites with seed 42.
gbrain eval run-all \
  --modes conservative,balanced,tokenmax \
  --suites longmemeval,replay \
  --seed 42 \
  --limit 500 \
  --budget-usd-retrieval 5 \
  --budget-usd-answer 20 \
  --output docs/eval/results/v0.32.3/

# Render the comparison.
gbrain eval compare --md > docs/eval/results/v0.32.3/README.md
gbrain eval compare --json > docs/eval/results/v0.32.3/comparison.json

The orchestrator writes per-run records to <repo>/.gbrain-evals/eval-results.jsonl. Every record carries: run_id, ran_at, suite, mode, commit, seed, limit, params, status, duration_ms. The dumps under docs/eval/results/v0.32.3/ carry the raw question-level outputs so a reviewer can re-score with their own metric implementation.

5. Threats to validity

Honest list. We name what would let a critic dismiss the numbers.

  • LongMemEval skews English + technical. The questions are software-engineering and consumer-product flavored. Performance on a brain rich in non-English / non-technical content (writing, art history, etc.) may differ.
  • BrainBench is small (1240 docs) relative to a production brain (10K-100K pages). Absolute scores aren't predictive of your hit rate; the delta between modes is.
  • char/4 token heuristic. Token-budget enforcement and cost estimates use a character-count / 4 heuristic. Accurate within ~5-10% for English with the OpenAI tiktoken family; off worse for Voyage (we don't use Voyage in chat retrieval, so it doesn't bias the reported numbers, but if you do, your budget caps will be approximate).
  • Expansion's quality lift varies by query distribution. The eval data shows ~97.6% relative quality with LLM expansion vs without (i.e., barely measurable lift) on the LongMemEval corpus. On rarer-entity / longer-tail queries, the lift can be larger. We report the corpus we measured; YMMV.
  • Paired bootstrap assumes question-level independence. Multi-hop questions within the same conversation thread aren't independent; the bootstrap CI is slightly tighter than reality.
  • Single brain instance per benchmark. The benchmark spins up an in-memory PGLite per question. Cache hit rate measured here doesn't reflect a long-running production brain's cache state.

6. Per-question raw outputs

Every reported metric is reproducible from the NDJSON dumps committed at docs/eval/results/v0.32.3/. The commit SHA in the methodology footer pins the code version.

Examples per mode: the auto-generated README.md next to the dumps includes both winning and losing examples per mode, chosen by the deterministic rule:

  • Wins: the 3 questions where this mode's score exceeded the next-best mode by the largest margin.
  • Losses: the 3 questions where this mode's score fell short of the next-best mode by the largest margin.

Picked by the score delta, NOT cherry-picked by hand. The README documents the rule so a critic can verify.

7. Pre-registered expectations

Before running, we expect:

  1. tokenmax wins Recall@10 by 5-15 percentage points over conservative. LLM expansion + 50-result ceiling helps rare-entity surface forms.
  2. conservative wins cost-per-query by 5-15× over tokenmax. No Haiku expansion + tight 4K budget cap = single-digit-cent queries.
  3. balanced lands within 3pp of tokenmax on Recall@10. Intent weighting (zero-LLM cost) closes most of the expansion gap on common queries.
  4. No mode breaks nDCG@10 ≥ 0.65 — the published "ship it" threshold for hybrid retrieval on technical corpora.

Then we publish whether the data agrees. If a hypothesis fails, that's documented honestly in the release README, not buried. Pre-registration is what makes the comparison defensible — without it, a "we expected X and got X" outcome is observation, not prediction.

8. Re-run cadence

This document + the eval results are regenerated on every release that touches retrieval-affecting code. The gbrain doctor eval_drift check surfaces changes to the curated watch-list in src/core/eval/drift-watch.ts:

  • src/core/search/**
  • src/core/embedding.ts
  • src/core/chunkers/**
  • src/core/ai/recipes/anthropic.ts
  • src/core/ai/recipes/openai.ts
  • src/core/operations.ts

Additions to the watch-list require a CHANGELOG line.

Statistical-significance discipline

When gbrain eval compare --md reports a Δ between two modes, it computes:

  • Paired bootstrap with 10,000 resamples per metric. Each resample draws question-level pairs (same question, mode A vs mode B), so question-level variance is differenced out.
  • Bonferroni correction across the 12 comparisons (3 modes × 4 metrics). The reported p-value is the comparison's raw p-value × 12 (clamped at 1.0).
  • 95% confidence intervals computed from the bootstrap distribution.

If the CI for a Δ includes 0 OR the Bonferroni-adjusted p-value exceeds 0.05, the difference is not statistically significant. The MD report says "not significant" verbatim.

Glossary

Every metric the report prints has a plain-English entry in docs/eval/METRIC_GLOSSARY.md, auto-generated from src/core/eval/metric-glossary.ts. The CI guard at scripts/check-eval-glossary-fresh.sh regenerates and diffs against the committed file on every test run; a stale doc fails the build.

Cost anchors

The mode-picker prompt at gbrain init and the CLAUDE.md ## Search Mode table both surface these rough cost anchors. Working through the math so they're auditable:

Variables:

  • T = avg tokens per search-result chunk. The recursive chunker targets 300 words / chunk → ~400 tokens (English, OpenAI tiktoken approx).
  • N = chunks delivered per query (capped by the mode's searchLimit).
  • R = downstream model input rate. Sonnet 4.6 = $3/M. Opus 4.7 = $5/M. Haiku 4.5 = $1/M.
  • Q = queries per month.

Per-query input cost (downstream agent reads the chunks):

cost_per_query = T × N × R
Mode T (tokens) N (chunks) Sonnet ($3/M) Opus ($5/M) Haiku ($1/M)
conservative (4K cap, 10 max) ~400 10 (or fewer if budget hits) $0.012 $0.020 $0.004
balanced (12K cap, 25 max) ~400 ~25 $0.030 $0.050 $0.010
tokenmax (no cap, 50 max) ~400 ~50 $0.060 $0.100 $0.020

Monthly cost (Q × per-query):

Mode @ Sonnet 1K Q/mo 10K Q/mo 100K Q/mo
conservative $12 $120 $1,200
balanced $30 $300 $3,000
tokenmax $60 $600 $6,000
Mode @ Opus 1K Q/mo 10K Q/mo 100K Q/mo
conservative $20 $200 $2,000
balanced $50 $500 $5,000
tokenmax $100 $1,000 $10,000

gbrain's own cost on top:

  • Query embedding (text-embedding-3-large @ $0.13/M tokens): ~$0.00001 per query. Negligible at every scale.
  • Tokenmax Haiku expansion call ($1/M input, $5/M output, ~500 input + 200 output per call): ~$0.0015 per query, or $150/mo at 100K queries. Cache hits cut this in half.
  • Per-page indexing (one-time): bounded by your import volume, not query volume. Not modeled here.

Cache hit adjustment. A warmed brain typically sees 30-50% cache hits on repeat-query traffic. Cache hits skip the downstream input cost entirely (the cached result was already in the agent's context once). So real-world costs run ~50-70% of the table above on a busy brain.

Why these numbers DRIFT from your actual bill:

  • Your agent's system prompt + reasoning tokens add input that gbrain doesn't see.
  • Compaction reduces input over a long session.
  • Most agents make 1-5 searches per turn; cost-per-turn is what bills you, not cost-per-query.
  • The model price column drifts as providers reprice; pin the rate via src/core/model-pricing.ts (the canonical chat-pricing table) for a current snapshot.

The picker copy + CLAUDE.md table are the canonical user-facing source. Update them in lockstep when the underlying chunker size or default searchLimit changes.

Mode × Model matrix (the 25x spread)

The per-query math above assumes Sonnet 4.6 downstream. In reality, the downstream model tier is the BIGGER cost lever. Per-query cost at 10K queries/month (typical single-user volume), search payload only (no cache savings):

Mode (search tokens) Haiku 4.5 ($1/M) Sonnet 4.6 ($3/M) Opus 4.7 ($5/M)
conservative (~4K) $40/mo $120/mo $200/mo
balanced (~10K) $100/mo $300/mo $500/mo
tokenmax (~20K) $200/mo $600/mo $1,000/mo

Scales linearly: multiply by 10 for 100K/mo (heavy power user / multi-user fleet); divide by 10 for 1K/mo (light usage).

Natural pairings span ~4x (cheap model + tight mode → frontier model + loose mode). Mismatches waste capacity:

  • tokenmax + Haiku: Haiku gets 20K of search results stuffed into its context per query. Haiku's reasoning is weaker; more chunks = more noise, not more signal. You pay Haiku rates but get sub-Haiku quality. Wrong direction.
  • conservative + Opus: Opus has 200K context window and can synthesize across many chunks. Capping at 10 chunks / 4K tokens leaves Opus reasoning underfed. You pay Opus rates but get conservative-shape retrieval. Wasted spend.

Right-sizing rule: match the mode's searchLimit to the downstream model's "useful context depth":

  • Haiku struggles past ~5-10 chunks of cross-referenced content → conservative
  • Sonnet handles ~25-40 chunks well → balanced
  • Opus benefits from 50+ chunks for multi-hop reasoning → tokenmax

Realistic-scale anchor (single power-user agent loop)

The per-query math above is honest but theoretical: it treats each search as an isolated billable event. Real agent loops amortize a lot of context across turns via Anthropic prompt caching. Here's what one heavy power-user loop actually looks like in production, anonymized + scaled so the numbers represent a representative power user rather than any specific deployment.

Reference shape — tokenmax in production at a single-user scale:

Quantity Approximate value
30-day total agent spend ~$700/mo
30-day total tokens billed ~800M
Turns per month ~860 (~29/day; one active agent loop)
Average tokens per turn ~900K
Average cost per turn ~$0.85
Anthropic prompt-cache hit rate ~88%

A "turn" here is one agent loop iteration: read user message, plan, execute tool calls (including gbrain searches), generate response. Each turn typically includes 2-4 gbrain searches.

Per-mode scaling from the tokenmax anchor:

The cost difference between modes is concentrated in the search-attributable fraction of per-turn cost. System prompt, tool definitions, conversation history, and reasoning tokens don't change with mode — only the chunks gbrain delivers do. Assume 3 searches per turn at the mode's searchLimit:

Mode Search tokens/turn Search cost/turn (at $3/M effective) Search-attributable @ 860 turns Δ vs tokenmax
tokenmax ~60K (3 × 20K) ~$0.18 ~$155/mo
balanced ~30K (3 × 10K) ~$0.09 ~$77/mo -$78
conservative ~12K (3 × 4K) ~$0.036 ~$31/mo -$124

Implied total agent spend by NATURAL PAIRING (mode + matched downstream model). Per-turn cost scales with the downstream model's per-token rate, since the cached prefix + uncached portion + reasoning tokens all bill at that rate:

Pairing Per-turn cost Total @ 860 turns/mo
tokenmax + Opus (frontier, max quality) ~$0.85 ~$700/mo
balanced + Sonnet (the sweet spot) ~$0.50 ~$430/mo
conservative + Haiku (cost-sensitive) ~$0.20 ~$170/mo

4x spread across natural pairings. The model tier dominates because the per-token rate applies to the WHOLE per-turn payload (system + tools

  • history + reasoning + search), not just gbrain's chunks. Mode choice contributes ~10-20% on top of that base.

Mismatched pairings push you off the curve:

Pairing Per-turn estimate Total @ 860 turns/mo Compared to natural
tokenmax + Haiku ~$0.20 ~$170/mo Same cost as conservative+Haiku, worse quality
conservative + Opus ~$0.75 ~$640/mo 92% of tokenmax+Opus spend, conservative-shape retrieval

The mismatch math says: a tokenmax+Haiku user pays the same as conservative+Haiku but gets a noisier context (Haiku can't filter signal from 50 chunks). A conservative+Opus user pays nearly the same as tokenmax+Opus but starves Opus on retrieval depth. Both burn budget for no improvement.

What this anchor tells us that the per-query math doesn't:

  1. At realistic agent-loop scale with disciplined prompt caching, mode choice saves 10-20% of total agent spend — meaningful, but smaller than the per-query 5x ratio implies. Disciplined prompt-cache layouts blunt the mode delta because most of the per-turn cost is the cached prefix, not the search payload.

  2. Without that prompt-cache discipline, the per-query framing reasserts itself. Setups that churn the prompt prefix on every turn (frequent system-prompt edits, untemplated tool defs, no prompt-cache structuring) see search payload contribute a much larger fraction of total cost. Those setups should care about mode choice more, not less.

  3. The cache hit rate quoted here (~88%) is achievable but not automatic. It requires structuring the prompt so the cached prefix stays stable across turns: system prompt + tool defs first, history compacted but cache-aware, retrieved chunks appended LAST (where their volatility doesn't invalidate the prefix). Agents that interleave search results inside the cached region pay the prefix-rebuild tax on every turn.

Caveats stacked here:

  • The anchor represents ONE power-user loop. Multi-user fleets aggregate proportionally; the per-user shape doesn't change.
  • The "3 searches per turn" assumption varies wildly. A code-review agent might issue 10+ searches per turn; a chat-only loop might do 0.
  • The 88% cache hit rate is the high end of what's achievable. Half that is closer to a default agent without cache-aware prompt layout.
  • The "Δ vs tokenmax" math assumes the OTHER cost components (system, tools, history, reasoning) stay constant. In practice, conservative's smaller per-turn payload also leaves more room in the context window for history → which can change agent behavior in either direction.

This anchor + the per-query math both live in this doc on purpose. The per-query framing is what an isolated benchmark would measure (and what gbrain eval run-all will produce). The realistic-scale anchor is what an operator actually pays. Both are honest; neither is the whole truth.

Reproducibility footer

Every release that publishes eval numbers includes a footer with:

  • Code commit SHA
  • Dataset SHA (LongMemEval, BrainBench, Replay)
  • --seed N
  • Run commands verbatim
  • API model identifiers used (Anthropic + OpenAI + judge model)

Without these, the numbers are unfalsifiable. With them, anyone with API keys can re-score.