Why
Direct ID-level set-diff between unglue.it's Identifier(type='doab') rows and DOAB's authoritative OAI ListIdentifiers feed (run 2026-05-14, post-deploy of #1146/#1148/#1150) shows the rate-limited harvest period left a real coverage gap that the cron's 3-day rolling window cannot self-heal. The "we're caught up" signal from the cron log (clean runs since the deploy) is true going forward but doesn't reflect what was missed during the rate-limit window.
Related context: #1143, #1144, #1146, #1147, #1148, #1149, #1150 (the OAI 429 visibility / Retry-After / fork-pin / circuit-breaker series).
Measurement
Pulled all 125,585 <header> entries from https://directory.doabooks.org/oai/request?verb=ListIdentifiers&metadataPrefix=oai_dc with status tracking (manual paginator, ~17 min, 0.3s pacing, Retry-After-respecting). Compared against Identifier.objects.filter(type='doab').values_list('value', flat=True) on prod.
unglue.it DOAB identifiers: 99,393
DOAB active (OAI, status != deleted): 110,792
DOAB deleted (OAI status = deleted): 14,793
=== ACTUAL GAP ===
Active in DOAB, missing from us: 20,608 (18.6% of active DOAB)
In us but DOAB marks deleted: 7,267 (stale records still served to users)
In us but absent from DOAB OAI: 1,942 (orphans — predate OAI feed?)
Note: the doab-check website's "99,451 books and book chapters" count and our 99,393 looked deceptively close in a totals-only comparison. They were close because the OAI feed includes ~15k deleted records that mask the real active-record gap. ID-level set-diff is the only reliable check.
Distribution — where the loss happened
DOAB IDs are roughly chronological. Missing-records-as-percent-of-active by ID range:
| ID range |
Missing |
% gap |
| 20k–60k |
35–443 |
0.4–5.8% (background noise) |
| 60k–70k |
371 |
4.4% |
| 70k–80k |
1,104 |
13.9% |
| 80k–100k |
4,935 |
22–32% |
| 100k–120k |
495 |
6–10% |
| 120k–140k |
3,061 |
25–27% |
| 140k–150k |
804 |
12% |
| 150k–180k |
9,095 |
34–48% (most recent records) |
Pattern is unambiguous: gap is concentrated in newer record ranges. The recovery spikes visible in /var/log/regluit/doab-harvest.log (2026-04-23: 464 new, 2026-05-06: 478 new) helped but didn't fully catch up. Records dated in windows where every covering 3-day cron hit a 429 were lost from the incremental harvest until DOAB modifies them again.
Two distinct cleanup tasks
1. Backfill ~20,608 active records missing from us
Concentrated in recent IDs, but spans 20k–179k. Per established preference for slow & gentle DOAB harvesting (we serve DOAB by directing users to DOAB-hosted content; looking like a misbehaving client undermines that), this should not be a single 20k-record blast.
Proposed approach (open to alternatives):
- One-off command that reads a list of missing DOAB IDs, fetches each via
getRecord (not listRecords), small batches, Retry-After-respecting
- Pace at e.g. 200 records/day via cron → ~100 days to complete; or a one-time gentle run at e.g. 1 req/3sec → ~17h spread across off-peak hours
- Optionally restrict to records with high-value subsets first (e.g., post-2024 records only)
2. Retire ~7,267 stale (DOAB-deleted) records
These point to ebooks DOAB has withdrawn — we're showing users links to removed content. Purely local cleanup, no OAI traffic. Likely warrants its own decision (do we soft-delete? mark as withdrawn? keep history?). Probably belongs in a separate issue once approach is decided.
3. Investigate 1,942 orphans
Identifier(type='doab') rows whose value never appears in DOAB OAI. May predate the OAI feed, may be data-entry artifacts. Lower priority — may not be actionable.
Follow-up: doab-check coverage is unmeasured (worth checking, not assumed similar)
EbookFoundation/doab-check runs on different infrastructure (DigitalOcean, separate IP from unglue.it) and has its own incident history independent of regluit's. The two services may have different gap shapes:
- 2026-03-11 (doab-check#2): DOAB issued a 24-hour IP-ban (
Retry-After: 86400) against the doab-check harvester. The fix in that issue only logged the 429 — it didn't retry. Records modified during that 24-hour window were lost from the incremental harvest.
- 2026-05-05 (doab-check#12): silent cursor-stall bug — not a 429. The OAI resumption-token loop terminated after one page; cron returned ~98 records/run for 3 consecutive days while the cursor stayed pinned at
2026-05-01 06:05:25. Manual catch-up loaded 1,520 records (146 new) for just May 1–5 alone. Workaround merged 2026-05-13 (commit d680efd).
Two distinct loss mechanisms, both unmeasured. The doab-check website's "65,620 links being checked" number is a derived count (links extracted from harvested records, not all records) and doesn't directly reveal coverage. Recommend filing a parallel issue in EbookFoundation/doab-check to run the same set-diff — the manual paginator script in this audit is reusable as-is. Don't assume the gap shape matches regluit's; measure separately.
Reproducer
Manual paginator with state file (resumes on transient errors), saved at /tmp/doab_diff3.py on prod. Outputs:
/tmp/doab_oai_active.txt — 110,792 active DOAB IDs
/tmp/doab_oai_deleted.txt — 14,793 deleted DOAB IDs
- Set-diff script at
/tmp/doab_finaldiff.py
Files are still on prod (unglue.it) for verification.
Checklist
Why
Direct ID-level set-diff between unglue.it's
Identifier(type='doab')rows and DOAB's authoritative OAIListIdentifiersfeed (run 2026-05-14, post-deploy of #1146/#1148/#1150) shows the rate-limited harvest period left a real coverage gap that the cron's 3-day rolling window cannot self-heal. The "we're caught up" signal from the cron log (clean runs since the deploy) is true going forward but doesn't reflect what was missed during the rate-limit window.Related context: #1143, #1144, #1146, #1147, #1148, #1149, #1150 (the OAI 429 visibility / Retry-After / fork-pin / circuit-breaker series).
Measurement
Pulled all 125,585
<header>entries fromhttps://directory.doabooks.org/oai/request?verb=ListIdentifiers&metadataPrefix=oai_dcwith status tracking (manual paginator, ~17 min, 0.3s pacing, Retry-After-respecting). Compared againstIdentifier.objects.filter(type='doab').values_list('value', flat=True)on prod.Note: the doab-check website's "99,451 books and book chapters" count and our
99,393looked deceptively close in a totals-only comparison. They were close because the OAI feed includes ~15k deleted records that mask the real active-record gap. ID-level set-diff is the only reliable check.Distribution — where the loss happened
DOAB IDs are roughly chronological. Missing-records-as-percent-of-active by ID range:
Pattern is unambiguous: gap is concentrated in newer record ranges. The recovery spikes visible in
/var/log/regluit/doab-harvest.log(2026-04-23: 464 new, 2026-05-06: 478 new) helped but didn't fully catch up. Records dated in windows where every covering 3-day cron hit a 429 were lost from the incremental harvest until DOAB modifies them again.Two distinct cleanup tasks
1. Backfill ~20,608 active records missing from us
Concentrated in recent IDs, but spans 20k–179k. Per established preference for slow & gentle DOAB harvesting (we serve DOAB by directing users to DOAB-hosted content; looking like a misbehaving client undermines that), this should not be a single 20k-record blast.
Proposed approach (open to alternatives):
getRecord(notlistRecords), small batches, Retry-After-respecting2. Retire ~7,267 stale (DOAB-deleted) records
These point to ebooks DOAB has withdrawn — we're showing users links to removed content. Purely local cleanup, no OAI traffic. Likely warrants its own decision (do we soft-delete? mark as withdrawn? keep history?). Probably belongs in a separate issue once approach is decided.
3. Investigate 1,942 orphans
Identifier(type='doab')rows whose value never appears in DOAB OAI. May predate the OAI feed, may be data-entry artifacts. Lower priority — may not be actionable.Follow-up: doab-check coverage is unmeasured (worth checking, not assumed similar)
EbookFoundation/doab-checkruns on different infrastructure (DigitalOcean, separate IP from unglue.it) and has its own incident history independent of regluit's. The two services may have different gap shapes:Retry-After: 86400) against the doab-check harvester. The fix in that issue only logged the 429 — it didn't retry. Records modified during that 24-hour window were lost from the incremental harvest.2026-05-01 06:05:25. Manual catch-up loaded 1,520 records (146 new) for just May 1–5 alone. Workaround merged 2026-05-13 (commitd680efd).Two distinct loss mechanisms, both unmeasured. The doab-check website's "65,620 links being checked" number is a derived count (links extracted from harvested records, not all records) and doesn't directly reveal coverage. Recommend filing a parallel issue in EbookFoundation/doab-check to run the same set-diff — the manual paginator script in this audit is reusable as-is. Don't assume the gap shape matches regluit's; measure separately.
Reproducer
Manual paginator with state file (resumes on transient errors), saved at
/tmp/doab_diff3.pyon prod. Outputs:/tmp/doab_oai_active.txt— 110,792 active DOAB IDs/tmp/doab_oai_deleted.txt— 14,793 deleted DOAB IDs/tmp/doab_finaldiff.pyFiles are still on prod (
unglue.it) for verification.Checklist