What happened
Attempted the roadmap's synthesize.ts prompt tuning (2026-06-12 evening) with full bench discipline. Three variants, four full boards, five spot-checks — and a control experiment that invalidated the whole comparison environment.
Variants tried (all on branch feat/synth-grounding, parked at 958be2c, NOT merged):
- v1: + verbatim-source-terms rule, + one-claim-per-cited-sentence rule, + "leave your own analytic sentences uncited", +
Today's date line with "…not your training data" recency phrasing. Board: 4/6 — factual-lookup collapsed to a 97-word zero-citation answer; support down on 4 questions.
- v2: dropped the uncited-analysis rule and the training-data phrasing; kept date line + 2 grounding rules. Spot-checks passed; full board: 4/6 — factual-lookup zero-citation AGAIN, academic 0.25.
- v3: dropped the date line too (grounding rules only). Spots: 1 PASS, 1 zero-citation FAIL.
Control (the decisive bit): the UNMODIFIED prompt also produced the zero-citation hedge — 1 of 3 runs, same signature (≤97 words, no [N], hedged "current X" answer). Every variant's failures are within the noise of that environmental flake.
Conclusions
- The zero-citation hedge on
factual-lookup is environmental, not prompt-caused. It appeared only this evening (every morning/afternoon board cited at 0.86–1.00) and hits stock and tuned prompts alike at ~1/3 frequency. Most plausible class: upstream serving-side behavioral flux on claude-sonnet-4-6 (same family as the 2026-06-12 morning client-system-prompt drift, fixed in dario 4.8.66 — this one is subtler: structure obeyed, content hedged).
- The bench cannot resolve prompt deltas while this flake is live. Any synth-prompt comparison needs a stock factual-lookup baseline of 3/3 first — treat that as the precondition for resuming this lane.
- Two genuinely-rejected rules (real signal, not flake): "leave analytic sentences uncited" licenses fully uncited answers; date grounding IN THE SYNTH PROMPT correlates with hedge-mode even though the same line in the PLANNER (v0.25.0) measurably helps. Documented in
src/synthesize.ts comments on the parked branch.
- The two surviving rules (verbatim source terms, one claim per cited sentence) showed weak positive signal (technical-deep 0.89→1.00 twice, niche-ops 14/14 @ 1.00) but cannot be confirmed under the flake.
Resume protocol
- Run
--only=factual-lookup ×3 on stock master. All 3 must cite (support ≥ 0.5). If not, the environment is still polluted — do not tune.
- Rebase
feat/synth-grounding, re-run the full board, compare against bench/results/2026-06-12-v0.25.0-validation.md (the clean stock baseline: 5/6, support 0.83–1.00).
- Boards on the parked branch:
2026-06-12-synth-grounding-v2-rejected.md shows what rejection looks like.
What happened
Attempted the roadmap's synthesize.ts prompt tuning (2026-06-12 evening) with full bench discipline. Three variants, four full boards, five spot-checks — and a control experiment that invalidated the whole comparison environment.
Variants tried (all on branch
feat/synth-grounding, parked at958be2c, NOT merged):Today's dateline with "…not your training data" recency phrasing. Board: 4/6 — factual-lookup collapsed to a 97-word zero-citation answer; support down on 4 questions.Control (the decisive bit): the UNMODIFIED prompt also produced the zero-citation hedge — 1 of 3 runs, same signature (≤97 words, no [N], hedged "current X" answer). Every variant's failures are within the noise of that environmental flake.
Conclusions
factual-lookupis environmental, not prompt-caused. It appeared only this evening (every morning/afternoon board cited at 0.86–1.00) and hits stock and tuned prompts alike at ~1/3 frequency. Most plausible class: upstream serving-side behavioral flux onclaude-sonnet-4-6(same family as the 2026-06-12 morning client-system-prompt drift, fixed in dario 4.8.66 — this one is subtler: structure obeyed, content hedged).src/synthesize.tscomments on the parked branch.Resume protocol
--only=factual-lookup×3 on stock master. All 3 must cite (support ≥ 0.5). If not, the environment is still polluted — do not tune.feat/synth-grounding, re-run the full board, compare againstbench/results/2026-06-12-v0.25.0-validation.md(the clean stock baseline: 5/6, support 0.83–1.00).2026-06-12-synth-grounding-v2-rejected.mdshows what rejection looks like.