Skip to content

Commit 31dba97

Browse files
authored
feat: --since recency filter (drop sources older than a cutoff) (#62)
Builds on v0.14's published-date extraction. --since=<date|duration> (env DEEPDIVE_SINCE) drops a fetched web source whose detected publication date precedes the cutoff — absolute date (2024, 2024-06-15) or duration meaning "that long ago" (30d, 2w). Dateless sources are kept (no penalty for missing metadata); --include / continue sources are exempt. Emits a new `stale` fetch.skipped reason. A supplied-but-unparseable value is a hard error (exit 2). New pure resolveSince (exported); persistable as `since` in the config file. 5 new tests (547 total green).
1 parent 967e2b9 commit 31dba97

10 files changed

Lines changed: 156 additions & 2 deletions

File tree

CHANGELOG.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,10 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
66

77
## [Unreleased]
88

9+
### Added — `--since` recency filter
10+
11+
- **`--since=<date|duration>`** (env `DEEPDIVE_SINCE`) — drop fetched sources published before a cutoff, building on v0.14's published-date extraction. Accepts an absolute date (`2024`, `2024-06`, `2024-06-15`) or a relative duration meaning "that long ago" (`30d`, `12h`, `2w`). A web source whose detected publication date precedes the cutoff is skipped (new `stale` `fetch.skipped` reason); sources with no detectable date are kept (no penalty for missing metadata). Doesn't apply to `--include` / `continue` sources. New pure `resolveSince` (exported); persistable as `since` in the config file. A supplied-but-unparseable `--since` is a hard error (exit 2), not a silent no-op.
12+
913
## [0.14.0] - 2026-06-09
1014

1115
### Added — config file, named profiles, shell completion

README.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -188,6 +188,12 @@ Two signals that help you read a report at a glance.
188188

189189
**Published dates.** When deepdive fetches a page, it tries to recover the page's publication date from the rendered HTML — JSON-LD `datePublished`, `<meta property="article:published_time">` and friends, or a `<time datetime>` element. When it finds one, the source row shows it (`fetched 2026-05-07 · published 2024-03-15`), the HTML export shows it, the JSON carries it as `publishedAt`, **and** the synthesizer sees it — so when sources disagree it can prefer the more recent one and flag claims that come from an older page. Pages that don't expose a date (many SPAs) simply don't get the annotation; nothing breaks.
190190

191+
**Recency filter.** Pass `--since` (or `DEEPDIVE_SINCE`) to drop stale sources outright — an absolute date (`--since=2024`, `--since=2024-06-15`) or a duration meaning "that long ago" (`--since=30d`, `--since=2w`). A fetched page whose detected publication date is before the cutoff is skipped (`stale` in `--verbose`); pages with no detectable date are kept, so a missing-metadata page is never penalized. Useful for fast-moving topics where a 2019 blog post is worse than no answer.
192+
193+
```bash
194+
deepdive "best way to deploy a node app in 2026" --since=365d --deep
195+
```
196+
191197
**Confidence.** After each run, alongside the cost line, deepdive prints a one-line coverage read:
192198

193199
```

src/agent.ts

Lines changed: 22 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -106,6 +106,11 @@ export interface AgentConfig {
106106
// v0.14.0 — when true, the synthesizer leads with a one-paragraph TL;DR.
107107
// Opt-in (CLI --tldr); default off keeps output identical to v0.13.
108108
tldr?: boolean;
109+
// v0.15.0 — recency filter. When set, a fetched web source whose extracted
110+
// publication date is older than this epoch-ms cutoff is dropped (emits a
111+
// `stale` skip). Sources with no extractable date are kept (not penalized
112+
// for missing metadata). Does not apply to include[]/preKept sources.
113+
sinceMs?: number;
109114
onEvent?: (event: AgentEvent) => void;
110115
// Fires for each SSE token emitted by the synthesizer. When set, the agent
111116
// uses the streaming LLM path for synthesize() calls. CLI callers enable
@@ -137,7 +142,8 @@ export type AgentEvent =
137142
| "pdf-no-extractor"
138143
| "pdf-extract-error"
139144
| "domain-deny"
140-
| "domain-not-allowed";
145+
| "domain-not-allowed"
146+
| "stale";
141147
}
142148
| { type: "include.done"; ingested: number; skipped: number }
143149
| { type: "synthesize.start"; sourceCount: number; round: number }
@@ -443,6 +449,21 @@ export async function runAgent(
443449
// simply yield undefined — additive, never blocks keeping the source.
444450
const publishedAt = isPdf ? undefined : extractPublishedDate(f.page.html);
445451

452+
// Recency filter (--since): drop a source dated before the cutoff.
453+
// Dateless sources pass (we don't penalize missing metadata).
454+
if (
455+
config.sinceMs !== undefined &&
456+
publishedAt !== undefined &&
457+
publishedAt < config.sinceMs
458+
) {
459+
emit(config, {
460+
type: "fetch.skipped",
461+
url: f.page.finalUrl || f.page.url,
462+
reason: "stale",
463+
});
464+
continue;
465+
}
466+
446467
keptSources.push({
447468
id: keptSources.length + 1,
448469
url: f.page.finalUrl || f.page.url,

src/cli.ts

Lines changed: 18 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -132,6 +132,10 @@ Flags:
132132
exclusively (e.g. github.com,docs.anthropic.com).
133133
--deny-domain=<list> Comma-separated hostname suffixes to drop
134134
(e.g. pinterest.com,quora.com).
135+
--since=<date|duration> Drop sources published before this — an absolute
136+
date (2024, 2024-06, 2024-06-15) or a duration
137+
(30d, 12h, 2w = that long ago). Sources with no
138+
detectable date are kept. Env: DEEPDIVE_SINCE.
135139
--api-format=<anthropic|openai>
136140
Wire format for the LLM endpoint. Default:
137141
auto-detected from --base-url (api.openai.com,
@@ -163,7 +167,7 @@ Environment:
163167
DEEPDIVE_NO_VERIFY_CITES, DEEPDIVE_STRICT_CITES, DEEPDIVE_CITE_MIN_RECALL,
164168
DEEPDIVE_NO_COST, DEEPDIVE_PRICE_INPUT_PER_MTOK, DEEPDIVE_PRICE_OUTPUT_PER_MTOK,
165169
DEEPDIVE_INCLUDE, DEEPDIVE_PDF_MAX_PAGES,
166-
DEEPDIVE_ALLOW_DOMAIN, DEEPDIVE_DENY_DOMAIN, DEEPDIVE_API_FORMAT,
170+
DEEPDIVE_ALLOW_DOMAIN, DEEPDIVE_DENY_DOMAIN, DEEPDIVE_SINCE, DEEPDIVE_API_FORMAT,
167171
DEEPDIVE_NO_SESSIONS, DEEPDIVE_SESSIONS_DIR, DEEPDIVE_CONFIG
168172
169173
Config file:
@@ -373,6 +377,9 @@ export function parseArgs(argv: string[]): ParsedArgs {
373377
case "profile":
374378
flags.profile = value;
375379
break;
380+
case "since":
381+
flags.since = value;
382+
break;
376383
case "format":
377384
flags.format = value.toLowerCase();
378385
break;
@@ -676,6 +683,15 @@ interface RunResearchOptions {
676683

677684
async function runResearch(opts: RunResearchOptions): Promise<number> {
678685
const { question, parsed, config, preKept, parentId } = opts;
686+
// A --since value that was supplied but didn't parse is a user error — fail
687+
// loud rather than silently running with no recency filter.
688+
if (config.sinceRaw && config.sinceMs === undefined) {
689+
process.stderr.write(
690+
`deepdive: --since must be a date (2024, 2024-06, 2024-06-15) or a duration ` +
691+
`(30d, 12h, 2w); got: ${config.sinceRaw}\n`,
692+
);
693+
return 2;
694+
}
679695
const search = await resolveSearchAdapter(config.searchAdapter, process.env);
680696
const cache = config.cache.enabled
681697
? createCache({ dir: config.cache.dir, ttlMs: config.cache.ttlMs })
@@ -725,6 +741,7 @@ async function runResearch(opts: RunResearchOptions): Promise<number> {
725741
include: config.include,
726742
domainFilter: config.domainFilter,
727743
tldr: config.tldr,
744+
sinceMs: config.sinceMs,
728745
env: process.env,
729746
onEvent: (e) => {
730747
if (config.verbose) process.stderr.write(renderEvent(e) + "\n");

src/config-file.ts

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -67,6 +67,7 @@ const KEY_MAP: Record<string, { env: string; kind: Kind }> = {
6767
sessionsDir: { env: "DEEPDIVE_SESSIONS_DIR", kind: "string" },
6868
allowDomain: { env: "DEEPDIVE_ALLOW_DOMAIN", kind: "list" },
6969
denyDomain: { env: "DEEPDIVE_DENY_DOMAIN", kind: "list" },
70+
since: { env: "DEEPDIVE_SINCE", kind: "string" },
7071
include: { env: "DEEPDIVE_INCLUDE", kind: "list" },
7172
tldr: { env: "DEEPDIVE_TLDR", kind: "bool" },
7273
strictCites: { env: "DEEPDIVE_STRICT_CITES", kind: "bool" },

src/config.ts

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ import { parseDomainList, type DomainFilter } from "./domain-filter.js";
99
import { detectApiFormat, type ApiFormat } from "./llm-format.js";
1010
import { defaultSessionsDir } from "./sessions.js";
1111
import { parseMaxCost } from "./budget.js";
12+
import { resolveSince } from "./dates.js";
1213

1314
export interface RuntimeConfig {
1415
llm: LLMConfig;
@@ -41,6 +42,12 @@ export interface RuntimeConfig {
4142
maxCostUsd?: number;
4243
// v0.14.0 — lead the answer with a one-paragraph TL;DR. Opt-in.
4344
tldr: boolean;
45+
// v0.15.0 — recency cutoff (epoch ms). Sources dated before this are
46+
// dropped. Undefined = no recency filter. Set via --since / DEEPDIVE_SINCE.
47+
sinceMs?: number;
48+
// The raw --since value (if any), so the CLI can distinguish "not set" from
49+
// "set but unparseable" and error on the latter.
50+
sinceRaw?: string;
4451
}
4552

4653
export interface CLIFlags {
@@ -79,6 +86,7 @@ export interface CLIFlags {
7986
noStream?: boolean;
8087
verbose?: boolean;
8188
tldr?: boolean;
89+
since?: string;
8290
// v0.11.0 — already-parsed budget cap in USD. CLI parser converts
8391
// "--max-cost=$0.50" / "$5" / "0.25" into a number; resolveConfig
8492
// accepts the parsed value (parseMaxCost lives in budget.ts and the
@@ -251,6 +259,8 @@ export function resolveConfig(
251259
const streamEnabled = !streamOptOut && !jsonOutput;
252260
const verbose = flags.verbose ?? env.DEEPDIVE_VERBOSE === "1";
253261
const tldr = flags.tldr ?? env.DEEPDIVE_TLDR === "1";
262+
const sinceRaw = flags.since ?? env.DEEPDIVE_SINCE;
263+
const sinceMs = sinceRaw ? resolveSince(sinceRaw) : undefined;
254264

255265
// v0.11.0 — budget cap. Flag takes a pre-parsed number from cli.ts
256266
// (which uses parseMaxCost on the raw string). Env var is parsed here.
@@ -298,6 +308,8 @@ export function resolveConfig(
298308
verbose,
299309
maxCostUsd,
300310
tldr,
311+
sinceMs,
312+
sinceRaw,
301313
};
302314
}
303315

src/dates.ts

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -180,6 +180,29 @@ function walk(node: unknown, out: { published?: string; modified?: string }): vo
180180
}
181181
}
182182

183+
// Exported for unit tests. Resolve a `--since` value to an absolute epoch-ms
184+
// cutoff. Accepts a relative duration with an explicit unit (`30d`, `2w`,
185+
// `12h`) — interpreted as "now minus that" — or an absolute date (`2024`,
186+
// `2024-06`, `2024-06-15`). A bare 4-digit number is treated as a YEAR, not a
187+
// day count, since `--since=2024` overwhelmingly means "since 2024". Returns
188+
// undefined for unparseable input.
189+
export function resolveSince(value: string, now: number = Date.now()): number | undefined {
190+
const v = value.trim().toLowerCase();
191+
const dur = /^(\d+)\s*(w|d|h|m|s)$/.exec(v);
192+
if (dur) {
193+
const mult: Record<string, number> = {
194+
s: 1000,
195+
m: 60_000,
196+
h: 3_600_000,
197+
d: 86_400_000,
198+
w: 604_800_000,
199+
};
200+
return now - Number(dur[1]) * mult[dur[2]];
201+
}
202+
if (/^\d{4}$/.test(v)) return toEpoch(`${v}-01-01`, now);
203+
return toEpoch(v, now);
204+
}
205+
183206
// Exported for unit tests. Parse a date string to epoch ms, rejecting values
184207
// outside [1990-01-01, now + 2 days]. Bare YYYY / YYYY-MM are accepted.
185208
export function toEpoch(s: string, now: number = Date.now()): number | undefined {

src/index.ts

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,7 @@ export {
3030
metaTags,
3131
jsonLdDates,
3232
toEpoch,
33+
resolveSince,
3334
} from "./dates.js";
3435
export {
3536
assessConfidence,

test/agent-loop.test.mjs

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1013,6 +1013,53 @@ test("agent: maxCostUsd aborts after the call that crosses the cap", async () =>
10131013
}
10141014
});
10151015

1016+
test("agent: --since drops sources dated before the cutoff, keeps fresh + dateless", async () => {
1017+
const planJson = '{"queries":["q1"]}';
1018+
const synthText = "Answer [1][2].";
1019+
const { server } = makeLLMServer([planJson, synthText]);
1020+
const baseUrl = await startServer(server);
1021+
1022+
const meta = (d) =>
1023+
`<html><head><meta property="article:published_time" content="${d}"></head><body>${LOREM}</body></html>`;
1024+
const search = mockSearch({
1025+
q1: [
1026+
{ url: "https://ex.com/old", title: "Old", snippet: "" },
1027+
{ url: "https://ex.com/fresh", title: "Fresh", snippet: "" },
1028+
{ url: "https://ex.com/nodate", title: "NoDate", snippet: "" },
1029+
],
1030+
});
1031+
const pages = {
1032+
"https://ex.com/old": { text: LOREM, title: "Old", html: meta("2020-03-01") },
1033+
"https://ex.com/fresh": { text: LOREM, title: "Fresh", html: meta("2026-05-01") },
1034+
"https://ex.com/nodate": { text: LOREM, title: "NoDate" }, // html has no date
1035+
};
1036+
const skipped = [];
1037+
try {
1038+
const result = await runAgent("q", {
1039+
llm: { baseUrl, apiKey: "t", model: "test", maxTokens: 512 },
1040+
search,
1041+
browser: { headless: true, timeoutMs: 5000, maxBytes: 1_000_000 },
1042+
resultsPerQuery: 5,
1043+
maxSources: 12,
1044+
maxWordsPerSource: 2000,
1045+
deepRounds: 0,
1046+
concurrency: 2,
1047+
sinceMs: Date.UTC(2024, 0, 1),
1048+
browserFactory: mockBrowserFactory(pages),
1049+
onEvent: (e) => {
1050+
if (e.type === "fetch.skipped") skipped.push(e);
1051+
},
1052+
});
1053+
const urls = result.sources.map((s) => s.url).sort();
1054+
assert.deepEqual(urls, ["https://ex.com/fresh", "https://ex.com/nodate"]);
1055+
assert.equal(skipped.length, 1);
1056+
assert.equal(skipped[0].reason, "stale");
1057+
assert.equal(skipped[0].url, "https://ex.com/old");
1058+
} finally {
1059+
await stopServer(server);
1060+
}
1061+
});
1062+
10161063
test("agent: undefined maxCostUsd means no cap, run completes", async () => {
10171064
// Same setup as above but no cap — should finish.
10181065
const planJson =

test/dates.test.mjs

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ import {
77
metaTags,
88
jsonLdDates,
99
toEpoch,
10+
resolveSince,
1011
} from "../dist/dates.js";
1112

1213
const NOW = Date.UTC(2026, 5, 1);
@@ -93,3 +94,24 @@ test("extractPublishedDate: a future/garbage meta date is rejected, not returned
9394
const html = `<meta name="date" content="3999-01-01">`;
9495
assert.equal(extractPublishedDate(html, NOW), undefined);
9596
});
97+
98+
// ── resolveSince ─────────────────────────────────────────────────────────────
99+
100+
test("resolveSince: a duration with a unit is relative to now", () => {
101+
assert.equal(resolveSince("30d", NOW), NOW - 30 * 86_400_000);
102+
assert.equal(resolveSince("2w", NOW), NOW - 14 * 86_400_000);
103+
assert.equal(resolveSince("12h", NOW), NOW - 12 * 3_600_000);
104+
});
105+
106+
test("resolveSince: a bare 4-digit value is a YEAR, not a day count", () => {
107+
assert.equal(iso(resolveSince("2024", NOW)), "2024-01-01");
108+
});
109+
110+
test("resolveSince: absolute dates parse", () => {
111+
assert.equal(iso(resolveSince("2024-06-15", NOW)), "2024-06-15");
112+
});
113+
114+
test("resolveSince: junk returns undefined", () => {
115+
assert.equal(resolveSince("whenever", NOW), undefined);
116+
assert.equal(resolveSince("3999", NOW), undefined); // out of range year
117+
});

0 commit comments

Comments
 (0)