feat: respect robots.txt before fetching pages by askalf · Pull Request #8 · askalf/deepdive

askalf · 2026-04-23T01:30:24Z

Summary

Production-grade crawlers check robots.txt. deepdive's per-query fetch volume is low (~12 URLs) but it's still the polite thing; sites with explicit scraper deny rules shouldn't be surprised.

Behavior

Before every agent.fetchOne, check <scheme>://<host>/robots.txt with User-Agent deepdive-bot:

deny → skip URL, emit fetch.skipped event
allow / unknown → proceed
Network error → fall back to unknown (err on the side of fetching; publishers who care have working robots.txt)

robots.txt content is cached in-memory per run (one GET per origin).

Opt-out

--ignore-robots / DEEPDIVE_IGNORE_ROBOTS=1 bypasses entirely — for operators with their own relationship to the target.

Parser

User-agent blocks with case-insensitive substring match (exact agent beats *)
Disallow + Allow with longest-prefix wins, ties go to Allow (RFC 9309)
Empty Disallow: grants everything
Wildcard * in paths, $ end-anchor
Crawl-delay field captured (not yet enforced — would need per-host pacing in the fetch worker-pool)
# comments stripped

New library exports

canFetch, createRobotsCache, parseRobotsTxt, isPathAllowed, DEFAULT_USER_AGENT, and types RobotsCache, ParsedRobots, RobotsCheckResult, CanFetchOptions.

Test plan

npm run build — clean under strict: true
npm test — 181 pass (up from 164), 0 fail
17 new assertions:
- Parser (11): empty, disallow-all, path prefix, allow-overrides-disallow, UA-specific beats *, UA substring+case-insensitive, comments, empty-Disallow grants, wildcard+$ patterns, Crawl-delay captured, malformed lines skipped
- canFetch integration (6): 200+Disallow→deny, 404→allow, network-error→unknown, cache prevents duplicate fetches, non-http allowed, malformed URL allowed

Production-grade crawlers check robots.txt. deepdive's per-query fetch volume is low (~12 URLs) but it's still the polite thing; sites with explicit scraper deny rules shouldn't be surprised. Behavior: - Before every agent.fetchOne, we check <scheme>://<host>/robots.txt with User-Agent "deepdive-bot" (configurable via AgentConfig.robotsUserAgent). - On "deny", skip the URL + emit a new fetch.skipped event so --verbose output shows the skip reason. - On "allow" or "unknown", proceed as before. - robots.txt content is cached in-memory per run (one GET per origin). - Network errors fetching robots.txt err on the side of "fetch" rather than "deny" — publishers who care have working robots.txt. Opt-out: --ignore-robots / DEEPDIVE_IGNORE_ROBOTS=1 bypasses the check entirely (for operators with their own relationship to the target). Parser supports: User-agent blocks (case-insensitive substring match, exact agent beats *), Disallow + Allow with longest-prefix wins (ties go to Allow per RFC 9309), empty Disallow = allow everything, wildcard * in paths, $ end-anchor, Crawl-delay field, # comments. Tests: 17 new assertions (12 parser unit, 5 canFetch integration, 2 CLI). 198 total.

askalf enabled auto-merge (squash) April 23, 2026 01:30

Merge branch 'master' into feat/robots-txt

b88ff18

askalf merged commit bb48b68 into master Apr 23, 2026
4 checks passed

askalf deleted the feat/robots-txt branch April 23, 2026 21:43

This was referenced Apr 23, 2026

fix: 8th ReDoS — doctor.ts uses trimTrailingSlashes helper #9

Merged

ci: foundation parity with dario / claude-bridge (actionlint, dependabot, stale, typecheck) #10

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: respect robots.txt before fetching pages#8

feat: respect robots.txt before fetching pages#8
askalf merged 2 commits into
masterfrom
feat/robots-txt

askalf commented Apr 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

askalf commented Apr 23, 2026

Summary

Behavior

Opt-out

Parser

New library exports

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant