synth prompt tuning PARKED: bench can't resolve prompt deltas while factual-lookup hedges nondeterministically (stock 1/3)

## What happened

Attempted the roadmap's synthesize.ts prompt tuning (2026-06-12 evening) with full bench discipline. Three variants, four full boards, five spot-checks — and a control experiment that invalidated the whole comparison environment.

**Variants tried** (all on branch `feat/synth-grounding`, parked at `958be2c`, NOT merged):

- **v1**: + verbatim-source-terms rule, + one-claim-per-cited-sentence rule, + "leave your own analytic sentences uncited", + `Today's date` line with "…not your training data" recency phrasing. Board: 4/6 — factual-lookup collapsed to a 97-word zero-citation answer; support down on 4 questions.
- **v2**: dropped the uncited-analysis rule and the training-data phrasing; kept date line + 2 grounding rules. Spot-checks passed; full board: 4/6 — factual-lookup zero-citation AGAIN, academic 0.25.
- **v3**: dropped the date line too (grounding rules only). Spots: 1 PASS, 1 zero-citation FAIL.

**Control (the decisive bit): the UNMODIFIED prompt also produced the zero-citation hedge — 1 of 3 runs**, same signature (≤97 words, no [N], hedged "current X" answer). Every variant's failures are within the noise of that environmental flake.

## Conclusions

1. **The zero-citation hedge on `factual-lookup` is environmental, not prompt-caused.** It appeared only this evening (every morning/afternoon board cited at 0.86–1.00) and hits stock and tuned prompts alike at ~1/3 frequency. Most plausible class: upstream serving-side behavioral flux on `claude-sonnet-4-6` (same family as the 2026-06-12 morning client-system-prompt drift, fixed in dario 4.8.66 — this one is subtler: structure obeyed, content hedged).
2. **The bench cannot resolve prompt deltas while this flake is live.** Any synth-prompt comparison needs a stock factual-lookup baseline of 3/3 first — treat that as the precondition for resuming this lane.
3. **Two genuinely-rejected rules** (real signal, not flake): "leave analytic sentences uncited" licenses fully uncited answers; date grounding IN THE SYNTH PROMPT correlates with hedge-mode even though the same line in the PLANNER (v0.25.0) measurably helps. Documented in `src/synthesize.ts` comments on the parked branch.
4. The two surviving rules (verbatim source terms, one claim per cited sentence) showed weak positive signal (technical-deep 0.89→1.00 twice, niche-ops 14/14 @ 1.00) but cannot be confirmed under the flake.

## Resume protocol

1. Run `--only=factual-lookup` ×3 on stock master. All 3 must cite (support ≥ 0.5). If not, the environment is still polluted — do not tune.
2. Rebase `feat/synth-grounding`, re-run the full board, compare against `bench/results/2026-06-12-v0.25.0-validation.md` (the clean stock baseline: 5/6, support 0.83–1.00).
3. Boards on the parked branch: `2026-06-12-synth-grounding-v2-rejected.md` shows what rejection looks like.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

synth prompt tuning PARKED: bench can't resolve prompt deltas while factual-lookup hedges nondeterministically (stock 1/3) #97

What happened

Conclusions

Resume protocol

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

synth prompt tuning PARKED: bench can't resolve prompt deltas while factual-lookup hedges nondeterministically (stock 1/3) #97

Description

What happened

Conclusions

Resume protocol

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions