kronikier recovers emails and phone numbers from the
web.archive.org history of a domain. The tool is driven by a time
timeout β you tell it how long you're willing to wait, and it picks
how many snapshots to scan so the run fits.
For the rationale behind the design choices, see ARCHITECTURE.md.
pip install -e . # runtime only
pip install -e ".[dev]" # plus pytest, responsesRuntime dependencies: requests, beautifulsoup4, phonenumbers, rich.
Python 3.10+.
# Default 5-minute timeout β fast, contact-URL filter on.
kronikier theranos.com
# Longer timeout on a harder case:
kronikier mysite.ru --timeout 900
# Scan every URL (not just contact pages), still inside the default timeout:
kronikier wirecard.com --all
# Unlimited time, every URL β for small defunct sites where you want everything:
kronikier old-defunct-corp.com --exhaustive
# Pipe machine-readable output to another tool:
kronikier mysite.ru --json > mysite.json
# Batch: scan a list of domains/URLs from a file (one per line):
kronikier --targets-file targets.txt
# Single-URL mode: examine one specific page's history across snapshots.
kronikier --single-url https://www.theranos.com/contact-usThe first ever invocation runs a one-time calibration (~3-5 sec) to
measure your machine's archive-fetch latency, cached at
~/.cache/kronikier/calibration.json and reused for 14 days.
When you run kronikier example.com --timeout 300, the tool:
- Pulls a cheap size signal from CDX (
showNumPages, ~1 sec). - Reads the cached calibration (e.g.
0.42 s/snapshot). - Computes how many snapshots fit in your timeout:
capacity = timeout Γ concurrency / avg_latency. - If the whole site fits in that capacity β scans every URL (no contact-URL filter β you can see everything).
- If it doesn't fit β keeps the contact-URL filter on and fetches the
top-
capacityranked snapshots. - Stops when the deadline hits. In-flight fetches finish so no data is lost mid-extraction.
- If the first pass found nothing, automatically retries with a broader strategy (drops the URL filter, or doubles the timeout) β one shot only.
The named flags are exact shortcuts. They show the same Plan: β¦ line
when they run.
| Flag | Equivalent to | Use when⦠| Typical time |
|---|---|---|---|
--default |
--timeout 300 |
Quick triage on an unknown domain. The default. | up to ~5 min |
--auto |
--timeout 300 |
Same as --default β kept for muscle memory. |
up to ~5 min |
--deep |
--timeout 900 |
Filtered scan didn't find enough; or known to be a hard target. | up to ~15 min |
--exhaustive |
--timeout 0 --all |
Defunct site, you want every snapshot of every URL ever archived. | minutes to hours |
--all |
(modifier, not a timeout) | Scan every URL the site has β drops the contact-URL filter. | depends on timeout |
--all is additive: --deep --all means "15-min timeout, no URL filter".
If a scan finishes with zero contacts and timeout time was left, the
tool auto-escalates one step (drop the filter, then extend the timeout).
Pass --no-escalate to disable that.
The tool runs a one-time latency calibration on its first ever invocation (8 canonical wayback snapshots, picks an average). Cached for 14 days.
# Refresh the calibration without running a scan:
kronikier --calibrate
# Run a scan *and* refresh the calibration first:
kronikier example.com --recalibrateA stale or wrong calibration just makes the timeoutβcapacity estimate slightly off; the deadline check is the real cutoff. The diagnostic line at the end of every scan shows the observed vs. cached latency β recalibrate if they drift significantly.
kronikier <domain> [options]
kronikier --targets-file PATH [options] # batch
kronikier --single-url URL [options] # one specific page across time
kronikier --calibrate # one-off, no scan
kronikier --clear-cache # wipe the snapshot cache
| Flag | Effect |
|---|---|
--timeout SECONDS |
Wall-clock timeout for the scan. 0 = unlimited. Default 300. |
--auto |
Alias for --timeout 300. Default when no flag is given. |
--default |
Alias for --timeout 300. |
--deep |
Alias for --timeout 900. |
--exhaustive |
Alias for --timeout 0 --all. |
| Flag | Effect |
|---|---|
--all |
Disable the contact-URL CDX filter. Independent of --timeout; combines. |
--no-escalate |
Disable the one-shot zero-result escalation (broaden filter / extend timeout). |
| Flag | Effect |
|---|---|
--calibrate |
Refresh the latency cache and exit (no scan). |
--recalibrate |
Refresh the calibration before this run. |
| Flag | Effect |
|---|---|
<domain> (positional) |
Default: scan all the contact-bearing pages of a host. |
--targets-file PATH |
Read a list of targets from PATH β one per line, # for comments, blanks ignored. Each entry is a bare domain or a full URL; URL paths extend that domain's well-known probe list. Multiple entries sharing a host are merged. |
--single-url URL |
Scan only the captures of one exact URL (CDX matchType=exact). Probe and contact-URL filter are skipped β we already know what we're looking at. Useful for "how did this one page change?" timeline questions. |
| Flag | Effect |
|---|---|
--max-snapshots N |
Optional hard ceiling on fetched pages (layered on top of timeout-derived capacity). |
--from-year YYYY |
Limit CDX query to >= year. |
--to-year YYYY |
Limit CDX query to <= year. |
--no-subdomains |
Restrict to exact host instead of domain + subdomains. |
--no-probe |
Skip the well-known-paths availability probe. The path list itself lives in kronikier/data/well_known_paths.txt β one path per line, # for comments β edit it to add domain-specific guesses without touching the code. |
--min-score N |
Drop URLs whose path-classifier score is below N. |
| Flag | Effect |
|---|---|
--workers N |
Concurrent fetches (default 4). |
--rate F |
Max requests/sec to wayback (default 4). Be polite to IA. |
--cdx-timeout N |
Read-timeout for the CDX query in seconds (default 300). Raise for very large domains where IA's filter scan is slow. |
| Flag | Effect |
|---|---|
--regions LIST |
Phone regions to try for bare local numbers without + (default RU,BY,UA,KZ,US,GB,DE,FR). The domain's TLD is automatically prepended to this list with the right country in front: 70+ ccTLDs are mapped (.by β BY, .fr β FR, .au β AU, β¦), and generic TLDs (.com / .org / .net / .io / .co / .app / .ai) default to US since those domains are overwhelmingly US-anchored in practice. Override --regions when investigating a .com site that isn't. Numbers in international (+-prefixed) format don't need this. |
| Flag | Effect |
|---|---|
--csv PATH |
Write the CSV report to PATH (default <domain>_<YYYYMMDD_HHMMSS>.csv in CWD). |
--no-csv |
Don't write a CSV. |
--json |
Print a JSON object on stdout instead of the human-readable table. CSV (if not --no-csv) is still written and the path is reported on stderr. |
--no-progress |
Disable progress bars and the live contact feed. Auto-disabled when stderr is not a TTY. |
-v |
Verbose contact feed: date + full snapshot URL under each found contact. |
-d |
DEBUG-level logs (probe / fetch / cache internals). Independent of -v. |
| Flag | Effect |
|---|---|
--no-cache |
Don't read from or write to the local snapshot cache during this run β every fetch goes to wayback. |
--clear-cache |
Delete every cached snapshot file and exit. Use to free disk space; doesn't run a scan. |
The cache root is $XDG_CACHE_HOME/kronikier/snapshots/ (default
~/.cache/kronikier/snapshots/), overridable with the KRONIEKER_CACHE_DIR
env var. See the dedicated section below for how cache files are laid out.
Domain: theranos.com
Timeout: 300s | Elapsed: 88.4s | Considered: 4,201 snapshots | Fetched: 60 | Distinct contacts: 18
Plan: contact-URL filter on; will fetch top 800 of ~12,350,000 snapshots
Contact First seen Last seen
βββββββββββββββββββββββββββββββββββ ββββββββββ ββββββββββ
email info@theranos.com 2007-11-22 2016-04-03
phone +1 650 838 9292 2014-09-02 2016-04-03
phone +1 855 843 7200 2014-09-02 2015-11-18
β¦
CSV saved: ./theranos.com_20260520_134522.csv
Hint: Contact-URL filter was on. For obscure custom contact pages, try
`--all` (scan every URL) or `--exhaustive`.
Empty results don't print an empty table β they print No contacts found.
and skip the CSV file entirely.
Columns:
kind, value, value_human, value_raw, first_seen, last_seen, sightings_count,
first_archive_url, last_archive_url
valueis the canonical form (E.164 phone, lowercased email).value_humanis the INTERNATIONAL-formatted phone or the email as-is.value_rawis the as-seen literal text from the page(s), with distinct renderings joined by|. E.g. a Belarusian landline shown as8-0162-51-12-54on one snapshot and+375 162 51-12-54on another lands in CSV as8-0162-51-12-54 | +375 162 51-12-54.
Phone reconstruction caveat. The
valuecolumn is a normalised E.164 form produced by libphonenumber β there's a guess about country code, trunk prefix, and grouping involved. Most of the time it's right, but on weird formats (truncated numbers, ambiguous regions, OCR typos) it can land on a plausible-but-wrong number. If a phone invaluelooks off, cross-checkvalue_rawagainst the original page viafirst_archive_url. That's the ground truth.
first_archive_url/last_archive_urllink to the playback page on web.archive.org so a reviewer can verify the source by clicking.- UTF-8 with BOM, so Excel opens it without garbled Cyrillic.
{
"domain": "theranos.com",
"timeout_seconds": 300.0,
"elapsed_seconds": 88.4,
"timeout_exhausted": false,
"plan_rationale": "contact-URL filter on; will fetch top 800 of ~12,350,000 snapshots",
"url_filter_active": true,
"snapshots_considered": 4201,
"snapshots_fetched": 60,
"hint": "Contact-URL filter was on. For obscure custom contact pages, try `--all` (scan every URL) or `--exhaustive`.",
"csv_path": "./theranos.com_20260520_134522.csv",
"contacts": [
{
"kind": "email",
"value": "info@theranos.com",
"value_human": "info@theranos.com",
"value_raw": "info@theranos.com",
"first_seen": "2007-11-22",
"last_seen": "2016-04-03",
"sightings": 12,
"first_archive_url": "https://web.archive.org/web/20071122β¦",
"last_archive_url": "https://web.archive.org/web/20160403β¦"
}
],
"errors": []
}Wayback snapshots are immutable, so rerunning the same scan can answer without spending more IA bytes. The CLI keeps an on-disk cache of fetched HTML, enabled by default.
Layout: one HTML file per (timestamp, url) pair, grouped by host:
~/.cache/kronikier/snapshots/
βββ theranos.com/
β βββ 20140902120000__contact-us__a3f9d4e1.html
β βββ 20120101000000__index.html__b2c01f9a.html
βββ wirecard.com/
βββ 20151204093015__about-us__7e2d1a08.html
The filename is {timestamp}__{sanitized-path}__{url-hash}.html so each
file is browsable on disk β open it in a browser, grep it, diff
historical versions with code -d. The 8-char hash disambiguates URLs
that collapse to the same path after fs-sanitisation.
Only successful 200-OK HTML responses are cached. Errors, redirects, and 404s are intentionally not persisted (they could be transient). A scan prints a one-line summary at the end:
[*] Cache: 152 hit / 47 miss (saved 152 fetches on web.archive.org)
Overrides:
--no-cacheβ disable read+write for this run.--clear-cacheβ wipe and exit.KRONIEKER_CACHE_DIR=/some/pathβ use a different root.
The cache itself is best-effort; any filesystem error is treated as a miss, never blocks the scan.
--single-url URL switches the scanner from "find contacts on this host"
to "show me every archived snapshot of this exact URL". The CDX query
becomes matchType=exact, the well-known-paths probe is skipped, and
the contact-URL filter is bypassed β the analyst already chose the
page.
# Every archived version of theranos.com's leadership page:
kronikier --single-url https://www.theranos.com/leadership
# How did one specific subdomain page change over time?
kronikier --single-url https://news.theranos.com/2018/05/...When to reach for it:
- A specific URL surfaced in another investigation and you want its archive timeline.
- The site has unusual structure and you want to inspect one page without paying for a host-wide scan.
- Rebuilding the chain of contact changes on a single contact page.
The URL must be an absolute http:// or https:// URL with a host.
--single-url is mutually exclusive with both the positional <domain>
and --targets-file.
kronikier theranos.com --exhaustive --max-snapshots 500Use --exhaustive (= --timeout 0 --all) when you want every snapshot
of every page (e.g. a contact page that changed numbers every quarter)
and you don't mind waiting. --max-snapshots caps the upper end if
the site is unexpectedly huge.
kronikier mysite.ruDefault timeout (300 s) with the contact-URL filter on. If the first
pass finds nothing, the tool auto-broadens to --all semantics for
the same timeout. A final Hint: line suggests --exhaustive if
that's still not enough.
kronikier --targets-file targets.txt --no-progresstargets.txt is plain text, one entry per line β bare domains or full
URLs. # starts a comment; blank lines are ignored. Example:
# Suspects from case 2026-04
theranos.com
https://theranos.com/leadership # extra probe path
https://theranos.com/contact-us
enron.com
https://www.theranos.com/leadership # subdomain β separate scan
Mechanics:
- Multiple lines sharing a host merge into one scan. Their URL
paths are added to the well-known probe list for that scan β so
/uslugi/excavatorand/o-nasbecome bonus probe targets in addition to the bundledWELL_KNOWN_PATHS. - Different hosts (e.g.
theranos.comandwww.theranos.com) run as separate scans β subdomain semantics belong to CDX'smatchType=domainand we don't second-guess. - Calibration and the HTTPS connection pool are shared across the whole batch.
- Each scan writes its own CSV (default per-domain naming);
--csv PATHis rejected in batch mode to avoid clobbering. --targets-fileis mutually exclusive with the positional domain.
Old-school shell loop still works if you prefer per-target JSON files in a directory:
for d in $(cat domains.txt); do
kronikier "$d" --json --no-progress > "results/$d.json"
done--no-progress + --json gives clean machine-readable output; CSV
files still land in the working directory.
kronikier old-corp.com --timeout 900 --from-year 2008 --to-year 2012Useful when a known event (a sale, a rebrand, a scandal) gives you a specific window of interest. Affects the CDX query β well-known-paths probing still samples five decades.
Some giant sites (marketplace-scale, 5 000+ CDX pages) need longer than the default 300 s timeout. Bump it:
kronikier huge-marketplace.example --cdx-timeout 900The error message also prints this hint.
Three things to check, in order:
- Did the timeout cap the scan? If
Timeout exhaustedshows in the header, raise it:--timeout 900or--exhaustive. - Is the missing contact in a non-typical URL? The default scan
filters CDX by a regex of contact-y slugs (
/contact,/about, etc.). Custom paths like/get-in-touchslip through the filter β add--allto scan every URL. - Is the calibration stale? Run
kronikier --calibrateto refresh. A wrong calibration leads to a wrong capacity estimate (slower or faster than your machine really is).
The tool keeps numbers that fail libphonenumber's strict validation β as
long as they're written with a leading +. Without + (bare local
form) the matcher stays strict; it's the only way to keep postal codes,
tax IDs, and order numbers out of the result.
For numbers written without a +, the parser tries libphonenumber
region-by-region and the first valid interpretation wins. The list is
prioritised by the domain's TLD (see --regions above), so:
theranos.comβUSfirst β(855) 843-7200parses as+1 855-843-7200.avito.ruβRUfirst β8(863)-218-22-22parses as+7 863 218-22-22.
If your target's site is on a TLD that doesn't match its real country
(e.g. a Russian SMB hosted on .com), override the order:
kronikier russian-smb.example.com --regions RU,US,GBThe CSV value_raw column always holds the as-seen original β pivot off
that when the canonical value is suspicious.
kronikier example.com -d --no-progress-d (or --debug) raises the log level to DEBUG. Useful when a fetch is
stalled and you want to see the URL it's stuck on, or when the cache
hits/misses don't match your mental model. Independent from -v, which
only controls the live contact feed verbosity.
Yes. The tool does no domain-based filtering of emails. For OSINT a
contact email on a free provider is usually the strongest signal β
small-business sites use them everywhere. Cross-reference the
first_archive_url to verify on the actual snapshot.
The CLI is a thin wrapper around kronikier.scan_domain. Library
callers don't have to go through the CLI:
from kronikier import scan_domain
result = scan_domain(
"theranos.com",
timeout_seconds=300, # or 900, or 0 (= unlimited)
force_all=False, # equivalent of --all
from_year=2007,
to_year=2016,
default_phone_regions=("US",),
)
for timestamp, kind, value, source_url in result.timeline():
print(timestamp, kind, value, source_url)
print(
f"Timeout: {result.timeout_seconds}s, elapsed: {result.elapsed_seconds:.1f}s, "
f"exhausted={result.timeout_exhausted}"
)Calibration is not auto-run from library code (it's a CLI-side UX
feature). If you want an accurate capacity estimate, pass your own
Calibration:
from kronikier.calibration import ensure_calibration
cal = ensure_calibration(announce=False)
result = scan_domain("theranos.com", timeout_seconds=300, calibration=cal)Other scan_domain keyword arguments map directly to CLI flags:
| Library kwarg | CLI equivalent |
|---|---|
timeout_seconds |
--timeout |
force_all |
--all |
no_escalate |
--no-escalate |
max_snapshots |
--max-snapshots |
from_year / to_year |
--from-year / --to-year |
include_subdomains |
inverse of --no-subdomains |
probe_well_known |
inverse of --no-probe |
min_score |
--min-score |
max_workers |
--workers |
rate_limit_per_sec |
--rate |
cdx_timeout |
--cdx-timeout |
default_phone_regions |
--regions |
extra_well_known_paths |
URLs in --targets-file |
single_url |
--single-url |
cache |
(off if --no-cache) |
The on-disk snapshot cache is not attached automatically when
scan_domain is called from a library β instantiate it yourself if you
want it:
from kronikier.cache import SnapshotCache, default_cache_dir
cache = SnapshotCache(default_cache_dir())
result = scan_domain("theranos.com", cache=cache)- Default
--rate 4(4 RPS to wayback) is polite. Don't raise it for routine work; IA serves the archive for everyone. - All data extracted is already public β IA published it. The tool doesn't bypass paywalls, crawler-blocks, or any access control.
- Contacts in archives may belong to people no longer associated with
the domain. The timeline (
first_seen/last_seen) is there so you can judge what was current when. - Use only within the scope of legal investigations.
Wayback-for-contact-recovery is a well-documented technique. Publicly referenced uses:
- Bellingcat β Skripal / GRU (2018-2019). Recovering archived versions of Russian military unit pages to obtain personnel contacts scrubbed after Salisbury.
- Brian Krebs (KrebsOnSecurity). Multiple investigations of malware and fraud site operators identified via archived contact emails.
- OCCRP shell-company investigations. "Contact" pages of shell sites routinely name the human beneficiary before being scrubbed.
- John Carreyrou / WSJ β Theranos. Material from archived
theranos.comused as evidence after the company shut down.
The end-to-end test suite (pytest -m e2e) hits the live wayback for
theranos.com and enron.com and asserts that the canonical contacts
of both still surface β the OSINT use case is the regression test.