Skip to content

feat: respect robots.txt before fetching pages#8

Merged
askalf merged 2 commits into
masterfrom
feat/robots-txt
Apr 23, 2026
Merged

feat: respect robots.txt before fetching pages#8
askalf merged 2 commits into
masterfrom
feat/robots-txt

Conversation

@askalf

@askalf askalf commented Apr 23, 2026

Copy link
Copy Markdown
Owner

Summary

Production-grade crawlers check robots.txt. deepdive's per-query fetch volume is low (~12 URLs) but it's still the polite thing; sites with explicit scraper deny rules shouldn't be surprised.

Behavior

Before every agent.fetchOne, check <scheme>://<host>/robots.txt with User-Agent deepdive-bot:

  • deny → skip URL, emit fetch.skipped event
  • allow / unknown → proceed
  • Network error → fall back to unknown (err on the side of fetching; publishers who care have working robots.txt)

robots.txt content is cached in-memory per run (one GET per origin).

Opt-out

--ignore-robots / DEEPDIVE_IGNORE_ROBOTS=1 bypasses entirely — for operators with their own relationship to the target.

Parser

  • User-agent blocks with case-insensitive substring match (exact agent beats *)
  • Disallow + Allow with longest-prefix wins, ties go to Allow (RFC 9309)
  • Empty Disallow: grants everything
  • Wildcard * in paths, $ end-anchor
  • Crawl-delay field captured (not yet enforced — would need per-host pacing in the fetch worker-pool)
  • # comments stripped

New library exports

canFetch, createRobotsCache, parseRobotsTxt, isPathAllowed, DEFAULT_USER_AGENT, and types RobotsCache, ParsedRobots, RobotsCheckResult, CanFetchOptions.

Test plan

  • npm run build — clean under strict: true
  • npm test — 181 pass (up from 164), 0 fail
  • 17 new assertions:
    • Parser (11): empty, disallow-all, path prefix, allow-overrides-disallow, UA-specific beats *, UA substring+case-insensitive, comments, empty-Disallow grants, wildcard+$ patterns, Crawl-delay captured, malformed lines skipped
    • canFetch integration (6): 200+Disallow→deny, 404→allow, network-error→unknown, cache prevents duplicate fetches, non-http allowed, malformed URL allowed

Production-grade crawlers check robots.txt. deepdive's per-query fetch
volume is low (~12 URLs) but it's still the polite thing; sites with
explicit scraper deny rules shouldn't be surprised.

Behavior:
- Before every agent.fetchOne, we check <scheme>://<host>/robots.txt
  with User-Agent "deepdive-bot" (configurable via
  AgentConfig.robotsUserAgent).
- On "deny", skip the URL + emit a new fetch.skipped event so --verbose
  output shows the skip reason.
- On "allow" or "unknown", proceed as before.
- robots.txt content is cached in-memory per run (one GET per origin).
- Network errors fetching robots.txt err on the side of "fetch" rather
  than "deny" — publishers who care have working robots.txt.

Opt-out: --ignore-robots / DEEPDIVE_IGNORE_ROBOTS=1 bypasses the check
entirely (for operators with their own relationship to the target).

Parser supports: User-agent blocks (case-insensitive substring match,
exact agent beats *), Disallow + Allow with longest-prefix wins (ties
go to Allow per RFC 9309), empty Disallow = allow everything,
wildcard * in paths, $ end-anchor, Crawl-delay field, # comments.

Tests: 17 new assertions (12 parser unit, 5 canFetch integration, 2
CLI). 198 total.
@askalf askalf enabled auto-merge (squash) April 23, 2026 01:30
@askalf askalf merged commit bb48b68 into master Apr 23, 2026
4 checks passed
@askalf askalf deleted the feat/robots-txt branch April 23, 2026 21:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant