Skip to content

fix: synthesizer closes broad-scope subgoals too aggressively (terminates overnight runs early) #159

@bradtaylorsf

Description

@bradtaylorsf

Summary

The synthesizer is closing subgoals as confirmed too aggressively on broad-scope goals. Project 2025 overnight test (2026-05-07): synthesizer marked all 4 subgoals confirmed after 45 tasks, terminating what was supposed to be an all-night run after only 13 minutes.

This is the calibration follow-up to #119: subgoal-done tracking is wired correctly, but the prompt threshold for declaring confirmed is too generous when the goal is scope_class: broad or comprehensive.

Reproducer

uv run research start --skip-intake --local \
  --goal "Project 2025 implementation tracker: identify which specific policy proposals from the Heritage Foundation's Project 2025 document have been adopted, attempted, withdrawn, or remain pending under the current Trump administration. Organize by federal department (DOJ, DOI, EPA, DHS, State, etc.). For each tracked proposal, surface news coverage, public statements, and any pushback or legal challenges. Prioritize primary sources and date-stamp every finding." \
  --max-tasks 1000 --time-cap 10

Observed:

  • Initial plan: scope_class=broad, 4 subgoals, 16 search tasks ✓
  • 45 tasks executed, 1 drain_replan fired ✓
  • Synthesis pass: closed=[1,2,3,4] — gemma confidently declared every subgoal done after seeing partial evidence on each.
  • Loop terminated. Run ended in ~13 minutes.

The closed subgoals were:

  1. "Identify core policy pillars and specific proposals from the Project 2025 document"
  2. "Map identified policies to their respective federal departments and check for implementation status"
  3. "Collect evidence of legal challenges, news coverage, and official public statements"
  4. (a fourth one in the same vein)

For a 920-page policy document with proposals across 20+ federal departments, none of these should be closeable after 45 tasks of crawling.

Root cause

prompts/synthesizer.md (the v5 trailing-JSON contract from #119) instructs:

  • confirmed — the findings affirmatively answer the subgoal. Closes it.
  • inconclusive — findings are insufficient, contradictory, or absent.

The prompt does NOT differentiate by scope. Gemma — which favors decisiveness — defaults to confirmed whenever it can write any affirmative answer, regardless of completeness. For a narrow scope this is fine; for broad/comprehensive it terminates the run prematurely.

Acceptance Criteria

  • prompts/synthesizer.md adds a scope-aware closure rule: when the plan's scope_class is broad or comprehensive, default subgoal status to inconclusive unless ALL of:
    • At least 5 distinct sources cited per subgoal in the corpus collected so far, AND
    • Synthesizer can articulate at least 2 specific examples per subgoal that resolve the question, AND
    • Findings span at least 3 distinct domains/entities the subgoal references (e.g., for "policies across federal departments", findings must touch ≥3 departments)
  • The synthesis context includes the plan's scope_class so the synthesizer knows which threshold to apply.
  • Add a unit test: feed a synthesized 45-task broad-scope corpus into the synthesizer and assert subgoal_status map values are predominantly inconclusive, not confirmed.
  • After fix: re-run the Project 2025 goal locally; expect ≥1 subgoal to remain inconclusive after the first synthesis, driving drain-replan to fire ≥3 times instead of stopping after 1.

Files

  • src/research_agent/prompts/synthesizer.md (the closure-status rules)
  • src/research_agent/orchestrator/synth.py (pass scope_class into the synthesis context)
  • tests/test_orchestrator_synth.py (new test for the scope-aware threshold)

Why this matters

Without this fix, every overnight run on a broad goal terminates early — exactly the problem #117/#118/#119 were supposed to solve. The architecture works; the prompt-level calibration is the last gate to "actually runs all night."

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingenhancementNew feature or requestin-reviewPR created, awaiting review

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions