Skip to content

DOAB harvest gap: ~20.6k active records missing (and ~7.3k stale records served) — coverage audit findings #1151

@rdhyee

Description

@rdhyee

Why

Direct ID-level set-diff between unglue.it's Identifier(type='doab') rows and DOAB's authoritative OAI ListIdentifiers feed (run 2026-05-14, post-deploy of #1146/#1148/#1150) shows the rate-limited harvest period left a real coverage gap that the cron's 3-day rolling window cannot self-heal. The "we're caught up" signal from the cron log (clean runs since the deploy) is true going forward but doesn't reflect what was missed during the rate-limit window.

Related context: #1143, #1144, #1146, #1147, #1148, #1149, #1150 (the OAI 429 visibility / Retry-After / fork-pin / circuit-breaker series).

Measurement

Pulled all 125,585 <header> entries from https://directory.doabooks.org/oai/request?verb=ListIdentifiers&metadataPrefix=oai_dc with status tracking (manual paginator, ~17 min, 0.3s pacing, Retry-After-respecting). Compared against Identifier.objects.filter(type='doab').values_list('value', flat=True) on prod.

unglue.it DOAB identifiers:        99,393
DOAB active (OAI, status != deleted):  110,792
DOAB deleted (OAI status = deleted):    14,793

=== ACTUAL GAP ===
Active in DOAB, missing from us:   20,608   (18.6% of active DOAB)
In us but DOAB marks deleted:       7,267   (stale records still served to users)
In us but absent from DOAB OAI:     1,942   (orphans — predate OAI feed?)

Note: the doab-check website's "99,451 books and book chapters" count and our 99,393 looked deceptively close in a totals-only comparison. They were close because the OAI feed includes ~15k deleted records that mask the real active-record gap. ID-level set-diff is the only reliable check.

Distribution — where the loss happened

DOAB IDs are roughly chronological. Missing-records-as-percent-of-active by ID range:

ID range Missing % gap
20k–60k 35–443 0.4–5.8% (background noise)
60k–70k 371 4.4%
70k–80k 1,104 13.9%
80k–100k 4,935 22–32%
100k–120k 495 6–10%
120k–140k 3,061 25–27%
140k–150k 804 12%
150k–180k 9,095 34–48% (most recent records)

Pattern is unambiguous: gap is concentrated in newer record ranges. The recovery spikes visible in /var/log/regluit/doab-harvest.log (2026-04-23: 464 new, 2026-05-06: 478 new) helped but didn't fully catch up. Records dated in windows where every covering 3-day cron hit a 429 were lost from the incremental harvest until DOAB modifies them again.

Two distinct cleanup tasks

1. Backfill ~20,608 active records missing from us

Concentrated in recent IDs, but spans 20k–179k. Per established preference for slow & gentle DOAB harvesting (we serve DOAB by directing users to DOAB-hosted content; looking like a misbehaving client undermines that), this should not be a single 20k-record blast.

Proposed approach (open to alternatives):

  • One-off command that reads a list of missing DOAB IDs, fetches each via getRecord (not listRecords), small batches, Retry-After-respecting
  • Pace at e.g. 200 records/day via cron → ~100 days to complete; or a one-time gentle run at e.g. 1 req/3sec → ~17h spread across off-peak hours
  • Optionally restrict to records with high-value subsets first (e.g., post-2024 records only)

2. Retire ~7,267 stale (DOAB-deleted) records

These point to ebooks DOAB has withdrawn — we're showing users links to removed content. Purely local cleanup, no OAI traffic. Likely warrants its own decision (do we soft-delete? mark as withdrawn? keep history?). Probably belongs in a separate issue once approach is decided.

3. Investigate 1,942 orphans

Identifier(type='doab') rows whose value never appears in DOAB OAI. May predate the OAI feed, may be data-entry artifacts. Lower priority — may not be actionable.

Follow-up: doab-check coverage is unmeasured (worth checking, not assumed similar)

EbookFoundation/doab-check runs on different infrastructure (DigitalOcean, separate IP from unglue.it) and has its own incident history independent of regluit's. The two services may have different gap shapes:

  • 2026-03-11 (doab-check#2): DOAB issued a 24-hour IP-ban (Retry-After: 86400) against the doab-check harvester. The fix in that issue only logged the 429 — it didn't retry. Records modified during that 24-hour window were lost from the incremental harvest.
  • 2026-05-05 (doab-check#12): silent cursor-stall bug — not a 429. The OAI resumption-token loop terminated after one page; cron returned ~98 records/run for 3 consecutive days while the cursor stayed pinned at 2026-05-01 06:05:25. Manual catch-up loaded 1,520 records (146 new) for just May 1–5 alone. Workaround merged 2026-05-13 (commit d680efd).

Two distinct loss mechanisms, both unmeasured. The doab-check website's "65,620 links being checked" number is a derived count (links extracted from harvested records, not all records) and doesn't directly reveal coverage. Recommend filing a parallel issue in EbookFoundation/doab-check to run the same set-diff — the manual paginator script in this audit is reusable as-is. Don't assume the gap shape matches regluit's; measure separately.

Reproducer

Manual paginator with state file (resumes on transient errors), saved at /tmp/doab_diff3.py on prod. Outputs:

  • /tmp/doab_oai_active.txt — 110,792 active DOAB IDs
  • /tmp/doab_oai_deleted.txt — 14,793 deleted DOAB IDs
  • Set-diff script at /tmp/doab_finaldiff.py

Files are still on prod (unglue.it) for verification.

Checklist

  • Confirm gap with Eric; agree on backfill pacing
  • Implement backfill command (separate PR)
  • Decide treatment of 7,267 stale-deleted records (separate issue)
  • File parallel issue in EbookFoundation/doab-check and measure independently (don't extrapolate from regluit)
  • After backfill: re-run set-diff to confirm closure

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    In Progress

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions