Summary
The synthesizer is closing subgoals as confirmed too aggressively on broad-scope goals. Project 2025 overnight test (2026-05-07): synthesizer marked all 4 subgoals confirmed after 45 tasks, terminating what was supposed to be an all-night run after only 13 minutes.
This is the calibration follow-up to #119: subgoal-done tracking is wired correctly, but the prompt threshold for declaring confirmed is too generous when the goal is scope_class: broad or comprehensive.
Reproducer
uv run research start --skip-intake --local \
--goal "Project 2025 implementation tracker: identify which specific policy proposals from the Heritage Foundation's Project 2025 document have been adopted, attempted, withdrawn, or remain pending under the current Trump administration. Organize by federal department (DOJ, DOI, EPA, DHS, State, etc.). For each tracked proposal, surface news coverage, public statements, and any pushback or legal challenges. Prioritize primary sources and date-stamp every finding." \
--max-tasks 1000 --time-cap 10
Observed:
- Initial plan: scope_class=broad, 4 subgoals, 16 search tasks ✓
- 45 tasks executed, 1 drain_replan fired ✓
- Synthesis pass:
closed=[1,2,3,4] — gemma confidently declared every subgoal done after seeing partial evidence on each.
- Loop terminated. Run ended in ~13 minutes.
The closed subgoals were:
- "Identify core policy pillars and specific proposals from the Project 2025 document"
- "Map identified policies to their respective federal departments and check for implementation status"
- "Collect evidence of legal challenges, news coverage, and official public statements"
- (a fourth one in the same vein)
For a 920-page policy document with proposals across 20+ federal departments, none of these should be closeable after 45 tasks of crawling.
Root cause
prompts/synthesizer.md (the v5 trailing-JSON contract from #119) instructs:
confirmed — the findings affirmatively answer the subgoal. Closes it.
inconclusive — findings are insufficient, contradictory, or absent.
The prompt does NOT differentiate by scope. Gemma — which favors decisiveness — defaults to confirmed whenever it can write any affirmative answer, regardless of completeness. For a narrow scope this is fine; for broad/comprehensive it terminates the run prematurely.
Acceptance Criteria
Files
src/research_agent/prompts/synthesizer.md (the closure-status rules)
src/research_agent/orchestrator/synth.py (pass scope_class into the synthesis context)
tests/test_orchestrator_synth.py (new test for the scope-aware threshold)
Why this matters
Without this fix, every overnight run on a broad goal terminates early — exactly the problem #117/#118/#119 were supposed to solve. The architecture works; the prompt-level calibration is the last gate to "actually runs all night."
Summary
The synthesizer is closing subgoals as
confirmedtoo aggressively on broad-scope goals. Project 2025 overnight test (2026-05-07): synthesizer marked all 4 subgoals confirmed after 45 tasks, terminating what was supposed to be an all-night run after only 13 minutes.This is the calibration follow-up to #119: subgoal-done tracking is wired correctly, but the prompt threshold for declaring
confirmedis too generous when the goal isscope_class: broadorcomprehensive.Reproducer
uv run research start --skip-intake --local \ --goal "Project 2025 implementation tracker: identify which specific policy proposals from the Heritage Foundation's Project 2025 document have been adopted, attempted, withdrawn, or remain pending under the current Trump administration. Organize by federal department (DOJ, DOI, EPA, DHS, State, etc.). For each tracked proposal, surface news coverage, public statements, and any pushback or legal challenges. Prioritize primary sources and date-stamp every finding." \ --max-tasks 1000 --time-cap 10Observed:
closed=[1,2,3,4]— gemma confidently declared every subgoal done after seeing partial evidence on each.The closed subgoals were:
For a 920-page policy document with proposals across 20+ federal departments, none of these should be closeable after 45 tasks of crawling.
Root cause
prompts/synthesizer.md(the v5 trailing-JSON contract from #119) instructs:confirmed— the findings affirmatively answer the subgoal. Closes it.inconclusive— findings are insufficient, contradictory, or absent.The prompt does NOT differentiate by scope. Gemma — which favors decisiveness — defaults to
confirmedwhenever it can write any affirmative answer, regardless of completeness. For anarrowscope this is fine; forbroad/comprehensiveit terminates the run prematurely.Acceptance Criteria
prompts/synthesizer.mdadds a scope-aware closure rule: when the plan'sscope_classisbroadorcomprehensive, default subgoal status toinconclusiveunless ALL of:broad-scope corpus into the synthesizer and assertsubgoal_statusmap values are predominantlyinconclusive, notconfirmed.inconclusiveafter the first synthesis, driving drain-replan to fire ≥3 times instead of stopping after 1.Files
src/research_agent/prompts/synthesizer.md(the closure-status rules)src/research_agent/orchestrator/synth.py(pass scope_class into the synthesis context)tests/test_orchestrator_synth.py(new test for the scope-aware threshold)Why this matters
Without this fix, every overnight run on a broad goal terminates early — exactly the problem #117/#118/#119 were supposed to solve. The architecture works; the prompt-level calibration is the last gate to "actually runs all night."