Skip to content

synth prompt tuning PARKED: bench can't resolve prompt deltas while factual-lookup hedges nondeterministically (stock 1/3) #97

@askalf

Description

@askalf

What happened

Attempted the roadmap's synthesize.ts prompt tuning (2026-06-12 evening) with full bench discipline. Three variants, four full boards, five spot-checks — and a control experiment that invalidated the whole comparison environment.

Variants tried (all on branch feat/synth-grounding, parked at 958be2c, NOT merged):

  • v1: + verbatim-source-terms rule, + one-claim-per-cited-sentence rule, + "leave your own analytic sentences uncited", + Today's date line with "…not your training data" recency phrasing. Board: 4/6 — factual-lookup collapsed to a 97-word zero-citation answer; support down on 4 questions.
  • v2: dropped the uncited-analysis rule and the training-data phrasing; kept date line + 2 grounding rules. Spot-checks passed; full board: 4/6 — factual-lookup zero-citation AGAIN, academic 0.25.
  • v3: dropped the date line too (grounding rules only). Spots: 1 PASS, 1 zero-citation FAIL.

Control (the decisive bit): the UNMODIFIED prompt also produced the zero-citation hedge — 1 of 3 runs, same signature (≤97 words, no [N], hedged "current X" answer). Every variant's failures are within the noise of that environmental flake.

Conclusions

  1. The zero-citation hedge on factual-lookup is environmental, not prompt-caused. It appeared only this evening (every morning/afternoon board cited at 0.86–1.00) and hits stock and tuned prompts alike at ~1/3 frequency. Most plausible class: upstream serving-side behavioral flux on claude-sonnet-4-6 (same family as the 2026-06-12 morning client-system-prompt drift, fixed in dario 4.8.66 — this one is subtler: structure obeyed, content hedged).
  2. The bench cannot resolve prompt deltas while this flake is live. Any synth-prompt comparison needs a stock factual-lookup baseline of 3/3 first — treat that as the precondition for resuming this lane.
  3. Two genuinely-rejected rules (real signal, not flake): "leave analytic sentences uncited" licenses fully uncited answers; date grounding IN THE SYNTH PROMPT correlates with hedge-mode even though the same line in the PLANNER (v0.25.0) measurably helps. Documented in src/synthesize.ts comments on the parked branch.
  4. The two surviving rules (verbatim source terms, one claim per cited sentence) showed weak positive signal (technical-deep 0.89→1.00 twice, niche-ops 14/14 @ 1.00) but cannot be confirmed under the flake.

Resume protocol

  1. Run --only=factual-lookup ×3 on stock master. All 3 must cite (support ≥ 0.5). If not, the environment is still polluted — do not tune.
  2. Rebase feat/synth-grounding, re-run the full board, compare against bench/results/2026-06-12-v0.25.0-validation.md (the clean stock baseline: 5/6, support 0.83–1.00).
  3. Boards on the parked branch: 2026-06-12-synth-grounding-v2-rejected.md shows what rejection looks like.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions