Commit: e9dc710
Bench: coding-agent-life-v1 (15 sessions, 15 queries)
N: 15
K: 5
Hardware: macOS 15 (Apple Silicon)
agentmemory: v0.9.21
iii-engine: v0.11.2
Embedding provider: local default
Sandbox: isolated data dir at /tmp/agentmemory-eval-sandbox/, ports 3411/3412
agentmemory-hybrid hits 100% top-5 hit rate, R@5 = 0.967, P@5 = 0.578.
Same corpus, grep baseline: R@5 = 0.967, P@5 = 0.267 — same recall, but 2.2× worse precision. Hybrid's top-5 is mostly gold; grep's top-5 is half noise.
| Adapter | P@5 | R@5 | Hit rate | p50 latency |
|---|---|---|---|---|
| grep (tokenized substring) | 0.267 | 0.967 | 15 / 15 | 0 ms |
agentmemory-hybrid |
0.578 | 0.967 | 15 / 15 | 14 ms |
agentmemory-hybrid runs through the production smart-search endpoint (POST /agentmemory/smart-search) so it exercises the full BM25 + embedding + reranker stack.
P@5, grep vs agentmemory-hybrid:
| Type | grep | hybrid | hybrid lift |
|---|---|---|---|
| single-session-bug | 0.20 | 0.33 | 1.7× |
| single-session-infra (n=2) | 0.20 | 0.50 | 2.5× |
| single-session-refactor | 0.20 | 0.50 | 2.5× |
| single-session-feature | 0.50 | 0.50 | tie |
| single-session-test | 0.20 | 0.33 | 1.7× |
| single-session-perf | 0.20 | 0.50 | 2.5× |
| single-session-api | 0.20 | 0.50 | 2.5× |
| single-session-db | 0.20 | 0.50 | 2.5× |
| single-session-release | 0.20 | 0.33 | 1.7× |
| multi-session-causal | 0.40 | 0.40 | tie |
| preference (n=2) | 0.20 | 0.42 | 2.1× |
| multi-session-review | 0.40 | 0.67 | 1.7× |
| temporal (R@5 = 0.50 grep / 1.00 hybrid) | 0.50 | 0.67 | 1.3× |
Temporal queries (What was shipped on April 8th 2026?) need both gold sessions to score full recall. grep finds 1/2; hybrid finds 2/2.
- 15 fictional Claude Code sessions across a 10-day stretch of a Rust CLI project (
shipctl) — bug fixes, refactors, infra, perf, schema migrations, preferences, post-mortem - 15 hand-graded queries with
goldSessionIds[]covering single-session, multi-session causal, multi-session review, preference, temporal - Each session ingested via
POST /agentmemory/rememberwithtype=eval-sessionandconcepts=[session_id] - Each query hits
POST /agentmemory/smart-searchwithlimit=50; dedupe by session ID; truncate to K=5 - No LLM in the retrieval loop
- Sandbox: clean
~/.agentmemoryviaHOMEoverride + alt ports (3411/3412) so no cross-contamination from a user's real store
git checkout e9dc710
npm install --legacy-peer-deps
npm run build
source eval/scripts/sandbox.sh
npm run eval:coding-life -- --adapters grep,agentmemoryOutputs land in eval/reports/coding-life/: scores.ndjson (per-query rows) and summary.json (per-adapter and per-type aggregates).
- The single-session-feature tie (
Which PR introduced helm chart support?) is interesting: query saysPR introduced helm chartand gold session hashelm chartliterally — grep wins on lexical exactness, hybrid matches but doesn't outperform. - The corpus is intentionally small for fast iteration. Hardening targets: paraphrased queries, synonym substitution, in-corpus distractors with shared keywords, longer multi-session chains.
- Vector adapter not measured here — requires
OPENAI_API_KEY; will be added in a follow-up scorecard alongside LongMemEval_s.