Skip to content

bin4re/WaybackSiteDump

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

wayback-site-dump

Download successful Wayback Machine captures for one or more URL prefixes, with resumable state and retry support.

What This Tool Does

  • Discovers captures from the Wayback CDX API in prefix scope.
  • Downloads replay content for each capture (.body) plus metadata (.meta.json).
  • Persists state in SQLite so runs can resume safely.
  • Supports query filtering via repeatable --query-equals KEY=VALUE.
  • Includes a helper command to convert downloaded .body files to UTF-8 in a new directory.

Requirements

  • Python >=3.12
  • Network access to https://web.archive.org

Install

uv sync

Quick Start

wayback_site_dump \
  --seed http://blog.xfocus.net/ \
  --seed http://hi.baidu.com/xxxx/ \
  --output ./out

Main CLI entry points are equivalent:

  • wayback_site_dump ...
  • uv run -m wayback_site_dump.cli ...

Main CLI Reference

Required Arguments

Option Description
--seed URL Seed URL prefix. Repeatable.
--output PATH Output root directory.

Optional Arguments

Option Default Description
--from YYYYMMDDhhmmss none Start timestamp (inclusive).
--to YYYYMMDDhhmmss none End timestamp (inclusive).
--workers N 6 Concurrent download workers. Must be > 0.
--timeout SECONDS 30 Per-request timeout. Must be > 0.
--retries N 4 Retry count for retryable failures. Must be >= 0.
--retry-backoff X 1.8 Exponential backoff base. Must be >= 1.0.
--user-agent TEXT wayback-site-dump/0.1 (+contact) HTTP User-Agent header value.
--max-captures N none Optional cap of discovered captures per seed. Must be > 0 when set.
--cdx-page-size N 5000 CDX page size. Must be > 0.
--progress-every N 200 Force a detailed progress line every N processed attempts. Must be > 0.
--progress-interval SECONDS 10 Heartbeat progress interval. Must be > 0.
--query-equals KEY=VALUE none Keep only captures whose original URL query contains key/value. Repeatable and combined with AND semantics.
--retry-failed off Requeue historical failed captures before downloading.
--resume on Resume using existing state DB and CDX cursor.
--no-resume off Ignore prior state by removing existing seed state DB before run.

Behavior Notes

  • CDX discovery requests are filtered to successful snapshots (statuscode:200).
  • URL scope is host + path-prefix constrained by each --seed.
  • Query filters are applied after URL scope filtering.
  • Duplicate seed IDs are skipped within a single run.

Query Filter Example

wayback_site_dump \
  --seed http://blog.xfocus.net/ \
  --output ./out \
  --query-equals blogId=9

This matches URLs such as:

  • http://blog.xfocus.net/index.php?op=Default&Date=200412&blogId=9
  • http://blog.xfocus.net/index.php?blogId=9

Repeat --query-equals to require multiple conditions:

wayback_site_dump \
  --seed http://blog.xfocus.net/ \
  --output ./out \
  --query-equals blogId=9 \
  --query-equals op=Default

Resume And Retry Flow

  • State is stored per seed in state/state.sqlite.
  • Default (--resume) keeps previously discovered records and CDX cursor.
  • --no-resume removes that seed DB first, effectively restarting state for the seed.
  • --retry-failed moves previously failed records back to queued state before downloading.
  • Maximum attempts for a capture is retries + 1.

Example retry run:

wayback_site_dump \
  --seed http://blog.xfocus.net/ \
  --output ./out \
  --query-equals blogId=9 \
  --retry-failed \
  --workers 2 \
  --timeout 90 \
  --retries 6

Output Layout

For each normalized seed ID, output is written under:

<output>/<seed_id>/
  captures/
    <url_key>/
      <timestamp>.body
      <timestamp>.meta.json
  manifests/
    run_report.json
    url_index.jsonl
  logs/
    failures.jsonl
  state/
    state.sqlite

Notes:

  • seed_id is generated from normalized host + path_prefix.
  • url_key is a SHA-1 key derived from canonicalized original URL.
  • run_report.json summarizes discovery/download stats and bytes written.
  • failures.jsonl contains unresolved failures after retries.

Exit Codes (Main CLI)

  • 0: run completed and no failed captures remain.
  • 1: argument/runtime error.
  • 2: run completed but one or more captures failed.
  • 130: interrupted by user (Ctrl+C).

Convert .body Files To UTF-8

This helper rewrites only .body files as UTF-8 into a separate output directory (non-destructive).

uv run -m wayback_site_dump.convert_utf8 \
  --input out/hi.baidu.com_xxxx \
  --output out/hi.baidu.com_xxxx_utf8

Conversion CLI Arguments

Option Default Description
--input PATH required Input directory from a previous dump run.
--output PATH sibling <input>_utf8 Output directory for converted tree.
--from-encoding NAME auto-detect Force source encoding for .body files (for example gb18030).
--force off Allow writing into an existing output directory.

Encoding aliases include gb2312, gbk, x-gbk, cp936 -> gb18030, and utf8 -> utf-8.

Conversion Notes

  • Only .body files are decoded and rewritten.
  • Non-.body files are copied as-is.
  • Input directory is never modified.
  • A conversion report is written to conversion_report.json in the output directory.

Exit Codes (Conversion CLI)

  • 0: all .body files converted successfully.
  • 1: argument/runtime error.
  • 2: run completed but one or more .body files failed to decode.
  • 130: interrupted by user (Ctrl+C).

Development

Run tests:

uv run -m unittest discover -s tests -p "test_*.py"

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages