Skip to content

Commit 411cda2

Browse files
authored
bench: clean v0.24.0 board — 6/6, recent 3/3 sources at 1.00 support (#94)
The promised unconfounded re-run: fresh npm ci + build in an isolated worktree at the release commit, only the shipped change set. The news adapter alone accounts for recent's recovery (1/3 sources at v0.23.0 -> 3/3 at 1.00 support).
1 parent e73c377 commit 411cda2

1 file changed

Lines changed: 24 additions & 0 deletions

File tree

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
# deepdive bench — 2026-06-12 (v0.24.0 release tag, clean run)
2+
3+
model: `claude-sonnet-4-6 (default)` · base-url: `http://localhost:3456 (default)` · deepdive v0.24.0 (`e73c377`)
4+
5+
Context: the pre-release 6/6 board for the news adapter carried a disclosed
6+
confound — a concurrent planner-prompt edit shared the build. This run is the
7+
clean reading promised there: fresh `npm ci` + build in an isolated worktree
8+
at the release commit, containing ONLY the shipped 0.24.0 change set.
9+
10+
| question | run completed | enough sources | citation support | answer length | topical keywords | under cost ceiling | verdict | time |
11+
|---|---|---|---|---|---|---|---|---|
12+
| factual-lookup | ✅ exit 0 | ✅ 12/4 | ✅ 1/1 (1.00) | ✅ 138 words | ✅ all present | ✅ $0.133 / $2.00 | **PASS** | 29s |
13+
| technical-deep | ✅ exit 0 | ✅ 12/5 | ✅ 8/8 (1.00) | ✅ 399 words | ✅ all present | ✅ $0.150 / $2.00 | **PASS** | 55s |
14+
| academic | ✅ exit 0 | ✅ 11/5 | ✅ 12/17 (0.71) | ✅ 682 words | ✅ all present | ✅ $0.078 / $2.00 | **PASS** | 97s |
15+
| comparison | ✅ exit 0 | ✅ 12/4 | ✅ 27/31 (0.87) | ✅ 746 words | ✅ all present | ✅ $0.162 / $2.00 | **PASS** | 75s |
16+
| recent | ✅ exit 0 | ✅ 3/3 | ✅ 4/4 (1.00) | ✅ 257 words | ✅ all present | ✅ $0.053 / $2.00 | **PASS** | 54s |
17+
| niche-ops | ✅ exit 0 | ✅ 10/3 | ✅ 2/4 (0.50) | ✅ 155 words | ✅ all present | ✅ $0.089 / $2.00 | **PASS** | 33s |
18+
19+
**6/6 passed.** Structural gates only — read the answers before trusting a comparison.
20+
21+
The perfect board reproduces without the confound: the news-adapter change
22+
set alone takes `recent` from 1/3 sources (v0.23.0 validation) to 3/3 at
23+
1.00 citation support. Scoreboard trajectory: 1/6 (v0.21.0, throttled DDG)
24+
→ 4/6 (v0.22.0) → 5/6 (v0.23.0) → 6/6 (v0.24.0).

0 commit comments

Comments
 (0)