Search Mode Evaluation Methodology

How v0.32.3 measures the difference between conservative, balanced, and tokenmax. Written haters-immune: every claim is reproducible from the committed dataset + raw outputs.

1. What this measures and what it doesn't

Measures: retrieval quality and operational cost on fixed public datasets, under each named search mode, against the same brain content.

Does NOT measure:

Your specific brain content (this is a benchmark, not your bill).
Your specific query distribution.
End-user satisfaction or downstream task success.
Latency under concurrent load.
Production cost (the cost numbers are model-pricing estimates × dataset size, not your actual API spend).

If you want to know how a mode behaves on YOUR brain, run gbrain search stats --days 30 after a real usage window, then run gbrain search tune for actionable recommendations.

2. Datasets and sizes

LongMemEval — public split, n=500 questions. Downloaded from Hugging Face. The corpus + answer keys are pinned to a specific commit; recorded in every per-run record.
Replay captures — NDJSON from the sibling gbrain-evals repo, n=200 queries. Each query carries a retrieved_slugs baseline + a latency_ms measurement from the original production run.
BrainBench v1 — n=1240 documents / n=350 qrels (binary relevance judgments). Lives in the sibling gbrain-evals repo, SHA-pinned at every run.

No private brain content is used in any reported result. The committed NDJSON dumps under <repo>/.gbrain-evals/ contain only the LongMemEval question IDs + the rank-ordered retrieved session IDs.

3. Sample selection

Random seed: 42 throughout. Set via --seed N on gbrain eval run-all; recorded in every per-run record.
No per-question curation. Splits are taken whole; no question is filtered for reporting.
No mode-specific tuning. The same dataset + same seed feeds every mode. The mode is the only independent variable.
Stability across re-runs: with --seed 42 and the same dataset SHA, two runs of the same (mode, suite) produce identical retrieval orderings (modulo the optional Haiku expansion call, which is non-deterministic). Persisted in eval_results so anyone can re-score from the committed dumps.

4. Run procedure

The command is the doc. Anyone can reproduce.

# Setup: in your gbrain working tree, with OPENAI_API_KEY + ANTHROPIC_API_KEY exported.
git rev-parse HEAD  # record the commit for the methodology footer

# Sweep all 3 modes × 2 retrieval-focused suites with seed 42.
gbrain eval run-all \
  --modes conservative,balanced,tokenmax \
  --suites longmemeval,replay \
  --seed 42 \
  --limit 500 \
  --budget-usd-retrieval 5 \
  --budget-usd-answer 20 \
  --output docs/eval/results/v0.32.3/

# Render the comparison.
gbrain eval compare --md > docs/eval/results/v0.32.3/README.md
gbrain eval compare --json > docs/eval/results/v0.32.3/comparison.json

The orchestrator writes per-run records to <repo>/.gbrain-evals/eval-results.jsonl. Every record carries: run_id, ran_at, suite, mode, commit, seed, limit, params, status, duration_ms. The dumps under docs/eval/results/v0.32.3/ carry the raw question-level outputs so a reviewer can re-score with their own metric implementation.

5. Threats to validity

Honest list. We name what would let a critic dismiss the numbers.

LongMemEval skews English + technical. The questions are software-engineering and consumer-product flavored. Performance on a brain rich in non-English / non-technical content (writing, art history, etc.) may differ.
BrainBench is small (1240 docs) relative to a production brain (10K-100K pages). Absolute scores aren't predictive of your hit rate; the delta between modes is.
char/4 token heuristic. Token-budget enforcement and cost estimates use a character-count / 4 heuristic. Accurate within ~5-10% for English with the OpenAI tiktoken family; off worse for Voyage (we don't use Voyage in chat retrieval, so it doesn't bias the reported numbers, but if you do, your budget caps will be approximate).
Expansion's quality lift varies by query distribution. The eval data shows ~97.6% relative quality with LLM expansion vs without (i.e., barely measurable lift) on the LongMemEval corpus. On rarer-entity / longer-tail queries, the lift can be larger. We report the corpus we measured; YMMV.
Paired bootstrap assumes question-level independence. Multi-hop questions within the same conversation thread aren't independent; the bootstrap CI is slightly tighter than reality.
Single brain instance per benchmark. The benchmark spins up an in-memory PGLite per question. Cache hit rate measured here doesn't reflect a long-running production brain's cache state.

6. Per-question raw outputs

Every reported metric is reproducible from the NDJSON dumps committed at docs/eval/results/v0.32.3/. The commit SHA in the methodology footer pins the code version.

Examples per mode: the auto-generated README.md next to the dumps includes both winning and losing examples per mode, chosen by the deterministic rule:

Wins: the 3 questions where this mode's score exceeded the next-best mode by the largest margin.
Losses: the 3 questions where this mode's score fell short of the next-best mode by the largest margin.

Picked by the score delta, NOT cherry-picked by hand. The README documents the rule so a critic can verify.

7. Pre-registered expectations

Before running, we expect:

tokenmax wins Recall@10 by 5-15 percentage points over conservative. LLM expansion + 50-result ceiling helps rare-entity surface forms.
conservative wins cost-per-query by 5-15× over tokenmax. No Haiku expansion + tight 4K budget cap = single-digit-cent queries.
balanced lands within 3pp of tokenmax on Recall@10. Intent weighting (zero-LLM cost) closes most of the expansion gap on common queries.
No mode breaks nDCG@10 ≥ 0.65 — the published "ship it" threshold for hybrid retrieval on technical corpora.

Then we publish whether the data agrees. If a hypothesis fails, that's documented honestly in the release README, not buried. Pre-registration is what makes the comparison defensible — without it, a "we expected X and got X" outcome is observation, not prediction.

8. Re-run cadence

This document + the eval results are regenerated on every release that touches retrieval-affecting code. The gbrain doctor eval_drift check surfaces changes to the curated watch-list in src/core/eval/drift-watch.ts:

src/core/search/**
src/core/embedding.ts
src/core/chunkers/**
src/core/ai/recipes/anthropic.ts
src/core/ai/recipes/openai.ts
src/core/operations.ts

Additions to the watch-list require a CHANGELOG line.

Statistical-significance discipline

When gbrain eval compare --md reports a Δ between two modes, it computes:

Paired bootstrap with 10,000 resamples per metric. Each resample draws question-level pairs (same question, mode A vs mode B), so question-level variance is differenced out.
Bonferroni correction across the 12 comparisons (3 modes × 4 metrics). The reported p-value is the comparison's raw p-value × 12 (clamped at 1.0).
95% confidence intervals computed from the bootstrap distribution.

If the CI for a Δ includes 0 OR the Bonferroni-adjusted p-value exceeds 0.05, the difference is not statistically significant. The MD report says "not significant" verbatim.

Glossary

Every metric the report prints has a plain-English entry in docs/eval/METRIC_GLOSSARY.md, auto-generated from src/core/eval/metric-glossary.ts. The CI guard at scripts/check-eval-glossary-fresh.sh regenerates and diffs against the committed file on every test run; a stale doc fails the build.

Cost anchors

The mode-picker prompt at gbrain init and the CLAUDE.md ## Search Mode table both surface these rough cost anchors. Working through the math so they're auditable:

Variables:

T = avg tokens per search-result chunk. The recursive chunker targets 300 words / chunk → ~400 tokens (English, OpenAI tiktoken approx).
N = chunks delivered per query (capped by the mode's searchLimit).
R = downstream model input rate. Sonnet 4.6 = $3/M. Opus 4.7 = $5/M. Haiku 4.5 = $1/M.
Q = queries per month.

Per-query input cost (downstream agent reads the chunks):

cost_per_query = T × N × R

Mode	T (tokens)	N (chunks)	Sonnet ($3/M)	Opus ($5/M)	Haiku ($1/M)
conservative (4K cap, 10 max)	~400	10 (or fewer if budget hits)	$0.012	$0.020	$0.004
balanced (12K cap, 25 max)	~400	~25	$0.030	$0.050	$0.010
tokenmax (no cap, 50 max)	~400	~50	$0.060	$0.100	$0.020

Monthly cost (Q × per-query):

Mode @ Sonnet	1K Q/mo	10K Q/mo	100K Q/mo
conservative	$12	$120	$1,200
balanced	$30	$300	$3,000
tokenmax	$60	$600	$6,000

Mode @ Opus	1K Q/mo	10K Q/mo	100K Q/mo
conservative	$20	$200	$2,000
balanced	$50	$500	$5,000
tokenmax	$100	$1,000	$10,000

gbrain's own cost on top:

Query embedding (text-embedding-3-large @ $0.13/M tokens): ~$0.00001 per query. Negligible at every scale.
Tokenmax Haiku expansion call ($1/M input, $5/M output, ~500 input + 200 output per call): ~$0.0015 per query, or $150/mo at 100K queries. Cache hits cut this in half.
Per-page indexing (one-time): bounded by your import volume, not query volume. Not modeled here.

Cache hit adjustment. A warmed brain typically sees 30-50% cache hits on repeat-query traffic. Cache hits skip the downstream input cost entirely (the cached result was already in the agent's context once). So real-world costs run ~50-70% of the table above on a busy brain.

Why these numbers DRIFT from your actual bill:

Your agent's system prompt + reasoning tokens add input that gbrain doesn't see.
Compaction reduces input over a long session.
Most agents make 1-5 searches per turn; cost-per-turn is what bills you, not cost-per-query.
The model price column drifts as providers reprice; pin the rate via src/core/model-pricing.ts (the canonical chat-pricing table) for a current snapshot.

The picker copy + CLAUDE.md table are the canonical user-facing source. Update them in lockstep when the underlying chunker size or default searchLimit changes.

Mode × Model matrix (the 25x spread)

The per-query math above assumes Sonnet 4.6 downstream. In reality, the downstream model tier is the BIGGER cost lever. Per-query cost at 10K queries/month (typical single-user volume), search payload only (no cache savings):

Mode (search tokens)	Haiku 4.5 ($1/M)	Sonnet 4.6 ($3/M)	Opus 4.7 ($5/M)
conservative (~4K)	$40/mo	$120/mo	$200/mo
balanced (~10K)	$100/mo	$300/mo	$500/mo
tokenmax (~20K)	$200/mo	$600/mo	$1,000/mo

Scales linearly: multiply by 10 for 100K/mo (heavy power user / multi-user fleet); divide by 10 for 1K/mo (light usage).

Natural pairings span ~4x (cheap model + tight mode → frontier model + loose mode). Mismatches waste capacity:

tokenmax + Haiku: Haiku gets 20K of search results stuffed into its context per query. Haiku's reasoning is weaker; more chunks = more noise, not more signal. You pay Haiku rates but get sub-Haiku quality. Wrong direction.
conservative + Opus: Opus has 200K context window and can synthesize across many chunks. Capping at 10 chunks / 4K tokens leaves Opus reasoning underfed. You pay Opus rates but get conservative-shape retrieval. Wasted spend.

Right-sizing rule: match the mode's searchLimit to the downstream model's "useful context depth":

Haiku struggles past ~5-10 chunks of cross-referenced content → conservative
Sonnet handles ~25-40 chunks well → balanced
Opus benefits from 50+ chunks for multi-hop reasoning → tokenmax

Realistic-scale anchor (single power-user agent loop)

The per-query math above is honest but theoretical: it treats each search as an isolated billable event. Real agent loops amortize a lot of context across turns via Anthropic prompt caching. Here's what one heavy power-user loop actually looks like in production, anonymized + scaled so the numbers represent a representative power user rather than any specific deployment.

Reference shape — tokenmax in production at a single-user scale:

Quantity	Approximate value
30-day total agent spend	~$700/mo
30-day total tokens billed	~800M
Turns per month	~860 (~29/day; one active agent loop)
Average tokens per turn	~900K
Average cost per turn	~$0.85
Anthropic prompt-cache hit rate	~88%

A "turn" here is one agent loop iteration: read user message, plan, execute tool calls (including gbrain searches), generate response. Each turn typically includes 2-4 gbrain searches.

Per-mode scaling from the tokenmax anchor:

The cost difference between modes is concentrated in the search-attributable fraction of per-turn cost. System prompt, tool definitions, conversation history, and reasoning tokens don't change with mode — only the chunks gbrain delivers do. Assume 3 searches per turn at the mode's searchLimit:

Mode	Search tokens/turn	Search cost/turn (at $3/M effective)	Search-attributable @ 860 turns	Δ vs tokenmax
tokenmax	~60K (3 × 20K)	~$0.18	~$155/mo	—
balanced	~30K (3 × 10K)	~$0.09	~$77/mo	-$78
conservative	~12K (3 × 4K)	~$0.036	~$31/mo	-$124

Implied total agent spend by NATURAL PAIRING (mode + matched downstream model). Per-turn cost scales with the downstream model's per-token rate, since the cached prefix + uncached portion + reasoning tokens all bill at that rate:

Pairing	Per-turn cost	Total @ 860 turns/mo
tokenmax + Opus (frontier, max quality)	~$0.85	~$700/mo
balanced + Sonnet (the sweet spot)	~$0.50	~$430/mo
conservative + Haiku (cost-sensitive)	~$0.20	~$170/mo

4x spread across natural pairings. The model tier dominates because the per-token rate applies to the WHOLE per-turn payload (system + tools

history + reasoning + search), not just gbrain's chunks. Mode choice contributes ~10-20% on top of that base.

Mismatched pairings push you off the curve:

Pairing	Per-turn estimate	Total @ 860 turns/mo	Compared to natural
tokenmax + Haiku	~$0.20	~$170/mo	Same cost as conservative+Haiku, worse quality
conservative + Opus	~$0.75	~$640/mo	92% of tokenmax+Opus spend, conservative-shape retrieval

The mismatch math says: a tokenmax+Haiku user pays the same as conservative+Haiku but gets a noisier context (Haiku can't filter signal from 50 chunks). A conservative+Opus user pays nearly the same as tokenmax+Opus but starves Opus on retrieval depth. Both burn budget for no improvement.

What this anchor tells us that the per-query math doesn't:

At realistic agent-loop scale with disciplined prompt caching, mode choice saves 10-20% of total agent spend — meaningful, but smaller than the per-query 5x ratio implies. Disciplined prompt-cache layouts blunt the mode delta because most of the per-turn cost is the cached prefix, not the search payload.
Without that prompt-cache discipline, the per-query framing reasserts itself. Setups that churn the prompt prefix on every turn (frequent system-prompt edits, untemplated tool defs, no prompt-cache structuring) see search payload contribute a much larger fraction of total cost. Those setups should care about mode choice more, not less.
The cache hit rate quoted here (~88%) is achievable but not automatic. It requires structuring the prompt so the cached prefix stays stable across turns: system prompt + tool defs first, history compacted but cache-aware, retrieved chunks appended LAST (where their volatility doesn't invalidate the prefix). Agents that interleave search results inside the cached region pay the prefix-rebuild tax on every turn.

Caveats stacked here:

The anchor represents ONE power-user loop. Multi-user fleets aggregate proportionally; the per-user shape doesn't change.
The "3 searches per turn" assumption varies wildly. A code-review agent might issue 10+ searches per turn; a chat-only loop might do 0.
The 88% cache hit rate is the high end of what's achievable. Half that is closer to a default agent without cache-aware prompt layout.
The "Δ vs tokenmax" math assumes the OTHER cost components (system, tools, history, reasoning) stay constant. In practice, conservative's smaller per-turn payload also leaves more room in the context window for history → which can change agent behavior in either direction.

This anchor + the per-query math both live in this doc on purpose. The per-query framing is what an isolated benchmark would measure (and what gbrain eval run-all will produce). The realistic-scale anchor is what an operator actually pays. Both are honest; neither is the whole truth.

Reproducibility footer

Every release that publishes eval numbers includes a footer with:

Code commit SHA
Dataset SHA (LongMemEval, BrainBench, Replay)
--seed N
Run commands verbatim
API model identifiers used (Anthropic + OpenAI + judge model)

Without these, the numbers are unfalsifiable. With them, anyone with API keys can re-score.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Search Mode Evaluation Methodology

1. What this measures and what it doesn't

2. Datasets and sizes

3. Sample selection

4. Run procedure

5. Threats to validity

6. Per-question raw outputs

7. Pre-registered expectations

8. Re-run cadence

Statistical-significance discipline

Glossary

Cost anchors

Mode × Model matrix (the 25x spread)

Realistic-scale anchor (single power-user agent loop)

Reproducibility footer

FilesExpand file tree

SEARCH_MODE_METHODOLOGY.md

Latest commit

History

SEARCH_MODE_METHODOLOGY.md

File metadata and controls

Search Mode Evaluation Methodology

1. What this measures and what it doesn't

2. Datasets and sizes

3. Sample selection

4. Run procedure

5. Threats to validity

6. Per-question raw outputs

7. Pre-registered expectations

8. Re-run cadence

Statistical-significance discipline

Glossary

Cost anchors

Mode × Model matrix (the 25x spread)

Realistic-scale anchor (single power-user agent loop)

Reproducibility footer