Download successful Wayback Machine captures for one or more URL prefixes, with resumable state and retry support.
- Discovers captures from the Wayback CDX API in prefix scope.
- Downloads replay content for each capture (
.body) plus metadata (.meta.json). - Persists state in SQLite so runs can resume safely.
- Supports query filtering via repeatable
--query-equals KEY=VALUE. - Includes a helper command to convert downloaded
.bodyfiles to UTF-8 in a new directory.
- Python
>=3.12 - Network access to
https://web.archive.org
uv syncwayback_site_dump \
--seed http://blog.xfocus.net/ \
--seed http://hi.baidu.com/xxxx/ \
--output ./outMain CLI entry points are equivalent:
wayback_site_dump ...uv run -m wayback_site_dump.cli ...
| Option | Description |
|---|---|
--seed URL |
Seed URL prefix. Repeatable. |
--output PATH |
Output root directory. |
| Option | Default | Description |
|---|---|---|
--from YYYYMMDDhhmmss |
none | Start timestamp (inclusive). |
--to YYYYMMDDhhmmss |
none | End timestamp (inclusive). |
--workers N |
6 |
Concurrent download workers. Must be > 0. |
--timeout SECONDS |
30 |
Per-request timeout. Must be > 0. |
--retries N |
4 |
Retry count for retryable failures. Must be >= 0. |
--retry-backoff X |
1.8 |
Exponential backoff base. Must be >= 1.0. |
--user-agent TEXT |
wayback-site-dump/0.1 (+contact) |
HTTP User-Agent header value. |
--max-captures N |
none | Optional cap of discovered captures per seed. Must be > 0 when set. |
--cdx-page-size N |
5000 |
CDX page size. Must be > 0. |
--progress-every N |
200 |
Force a detailed progress line every N processed attempts. Must be > 0. |
--progress-interval SECONDS |
10 |
Heartbeat progress interval. Must be > 0. |
--query-equals KEY=VALUE |
none | Keep only captures whose original URL query contains key/value. Repeatable and combined with AND semantics. |
--retry-failed |
off | Requeue historical failed captures before downloading. |
--resume |
on | Resume using existing state DB and CDX cursor. |
--no-resume |
off | Ignore prior state by removing existing seed state DB before run. |
- CDX discovery requests are filtered to successful snapshots (
statuscode:200). - URL scope is host + path-prefix constrained by each
--seed. - Query filters are applied after URL scope filtering.
- Duplicate seed IDs are skipped within a single run.
wayback_site_dump \
--seed http://blog.xfocus.net/ \
--output ./out \
--query-equals blogId=9This matches URLs such as:
http://blog.xfocus.net/index.php?op=Default&Date=200412&blogId=9http://blog.xfocus.net/index.php?blogId=9
Repeat --query-equals to require multiple conditions:
wayback_site_dump \
--seed http://blog.xfocus.net/ \
--output ./out \
--query-equals blogId=9 \
--query-equals op=Default- State is stored per seed in
state/state.sqlite. - Default (
--resume) keeps previously discovered records and CDX cursor. --no-resumeremoves that seed DB first, effectively restarting state for the seed.--retry-failedmoves previously failed records back to queued state before downloading.- Maximum attempts for a capture is
retries + 1.
Example retry run:
wayback_site_dump \
--seed http://blog.xfocus.net/ \
--output ./out \
--query-equals blogId=9 \
--retry-failed \
--workers 2 \
--timeout 90 \
--retries 6For each normalized seed ID, output is written under:
<output>/<seed_id>/
captures/
<url_key>/
<timestamp>.body
<timestamp>.meta.json
manifests/
run_report.json
url_index.jsonl
logs/
failures.jsonl
state/
state.sqlite
Notes:
seed_idis generated from normalizedhost + path_prefix.url_keyis a SHA-1 key derived from canonicalized original URL.run_report.jsonsummarizes discovery/download stats and bytes written.failures.jsonlcontains unresolved failures after retries.
0: run completed and no failed captures remain.1: argument/runtime error.2: run completed but one or more captures failed.130: interrupted by user (Ctrl+C).
This helper rewrites only .body files as UTF-8 into a separate output directory (non-destructive).
uv run -m wayback_site_dump.convert_utf8 \
--input out/hi.baidu.com_xxxx \
--output out/hi.baidu.com_xxxx_utf8| Option | Default | Description |
|---|---|---|
--input PATH |
required | Input directory from a previous dump run. |
--output PATH |
sibling <input>_utf8 |
Output directory for converted tree. |
--from-encoding NAME |
auto-detect | Force source encoding for .body files (for example gb18030). |
--force |
off | Allow writing into an existing output directory. |
Encoding aliases include gb2312, gbk, x-gbk, cp936 -> gb18030, and utf8 -> utf-8.
- Only
.bodyfiles are decoded and rewritten. - Non-
.bodyfiles are copied as-is. - Input directory is never modified.
- A conversion report is written to
conversion_report.jsonin the output directory.
0: all.bodyfiles converted successfully.1: argument/runtime error.2: run completed but one or more.bodyfiles failed to decode.130: interrupted by user (Ctrl+C).
Run tests:
uv run -m unittest discover -s tests -p "test_*.py"