feat(plan): date-ground the planner and critic prompts#95
Merged
Conversation
The planner had no notion of the current date, so recency-sensitive questions got queries anchored to the model's training-time sense of "recent" - and with --since set, the recency filter then culled most of what those stale queries returned. Bench evidence: the "recent" golden question (--since=180d) kept 1 source against a 3-source minimum on the v0.23.0 validation run. - plannerSystem()/criticSystem() now carry "Today's date: YYYY-MM-DD" (PlanContext.now, injectable for tests). - Event-shaped sub-queries (releases, announcements, news, versions) are told to use absolute dates instead of "latest"/"recent"; conceptual and scholarly sub-queries are told to stay timeless - blanket year-anchoring measurably hurt the scholarly bench question (citation support 0.68 -> 0.44, reproduced twice) because bare year tokens distort keyword-matched sources like arXiv/OpenAlex. - When --since is set, the prompts disclose the cutoff date and that older sources will be dropped, so queries are shaped to the surviving window. - agent.ts passes sinceMs at both planQueries/critique call sites. Spot-checked live: "recent" 1/3 -> 3/3 and 4/3 sources (1.00 citation support), "academic" recovered to PASS after the rule was scoped. Full before/after scoreboard to follow once DDG rate-limiting on this IP cools - the after-run hit HTTP 202 throttling mid-bench.
…recent 1/3->2-4/3 (bistable at gate)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
The planner had no notion of the current date. Recency-sensitive questions got sub-queries anchored to the model's training-time sense of "recent", and with
--sinceset, the post-fetch recency filter then culled most of what those stale queries returned. Bench evidence: therecentgolden question (--since=180d) kept 1 source against a 3-source minimum on the v0.23.0 validation board, deterministically.What
plannerSystem()/criticSystem()now carryToday's date: YYYY-MM-DD(injectablePlanContext.nowfor tests).--since, the prompts disclose the cutoff date and that older sources get dropped, so queries are shaped to the surviving window.agent.tspassessinceMsat bothplanQueries/critiquecall sites.test/plan-prompt.test.mjs.Bench (before → after, healthy DDG both)
Before = committed v0.23.0 validation; after =
bench/results/2026-06-12-v0.25.0-date-grounding.md, run from an isolated worktree at this branch (does NOT include 0.24.0's news fallback — the two changes are complementary and independently measured).Honest read on
recent: clearly improved (never 1 source again, 0.75–1.00 citation support) but still flaky against its 3-source minimum on this branch alone. v0.24.0's recency-aware news fallback attacks the same weakness from the source side (its board shows recent 3/3 @ 1.00); merged together they should hold the gate. Ifrecentstill flaps on a post-merge board, the next lever is raising the planner's query count for--sinceruns.