feat(datasets): add local: source handler and fix url: gzip truncation check#9
feat(datasets): add local: source handler and fix url: gzip truncation check#9Mea1Ma wants to merge 3 commits into
Conversation
…ged as truncated downloads The truncation check in url.py compared bytes *written to disk* against Content-Length. That breaks for Content-Encoding responses: when the server sends gzip, Content-Length is the compressed size while iter_bytes() yields the larger decompressed body, so written bytes always exceed Content-Length and the download is falsely rejected. Compare r.num_bytes_downloaded (raw on-the-wire bytes, before httpx decompresses) against Content-Length instead — the two are now measured on the same, compressed scale, so gzip-served files pass while genuinely truncated transfers still fail. Regression: SQuAD's train-v2.0.json is gzip-served; Content-Length=9551051 but the decoded body is ~42MB, which the old written-bytes check rejected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
feat(datasets): add LocalHandler to stage local:<relpath> assets from sieval/datasets/_data/ into dest_root/<dataset_name>/ feat(datasets): enforce normalized package-relative paths and reject absolute / traversal inputs in _bundled_path() feat(datasets): align output filename behavior with URL downloader via shared url_path_basename fallback (download) test(datasets): cover scheme validation, traversal guards, copy/skip/force behavior, and is_downloaded() checks for local downloader Co-authored-by: Claude Sonnet 4.6 (Anthropic) noreply@anthropic.com
ethan-scitix
left a comment
There was a problem hiding this comment.
Request changes — two correctness issues in the local: handler; the gzip fix is correct.
The copy-into-{data_dir}/<name>/ staging model is the right call — it keeps both the sieval dataset download UX and the loader read-path uniform across hf:/url:/local:. The two findings below are about making that copy safe, not changing the approach.
Bug: non-atomic copy — sieval/datasets/downloaders/local.py:39
shutil.copyfile(bundled, target) writes straight to the final path. An interrupted copy (SIGKILL/ENOSPC) leaves a truncated file that is_downloaded (local.py:43) reports as ready → silent corrupt load, no re-stage. Mirror the url: sibling (url.py:33, url.py:52-55): copy to .partial, replace on success, unlink(missing_ok=True) on error.
Bug (latent): collision guard skips local: — sieval/core/datasets/meta.py:196
_validate_url_basenames_unique only scans url: sources, but local: stages to the same {dest}/{name}/{basename} (local.py:36). Two colliding basenames (local/local or local/url in one dataset) silently overwrite, defeating the guard's documented guarantee. Include local: sources too — this touches meta.py (outside this diff), but this PR is what makes local: a live staging target. RULER itself likely won't trip it, so this is lower urgency than the atomicity fix.
gzip fix: confirmed correct. Verified against httpx 0.28.1 — iter_bytes() delegates to iter_raw(), which increments num_bytes_downloaded on the raw on-the-wire bytes before decoding. So it equals Content-Length for a gzip response and still falls short on a truncated stream.
…ocal: Address review on PR scitix#9 (two correctness issues in the local: handler): - local: now copies bundled files via a .partial temp + atomic replace, with cleanup on error. An interrupted copy (SIGKILL/ENOSPC) no longer leaves a truncated file that is_downloaded reports as ready. Mirrors the url: handler's staging contract. - The basename collision guard previously scanned only url: sources, but local: stages to the same <dest>/<name>/<basename> layout. Rename _validate_url_basenames_unique -> _validate_staged_basenames_unique and scan both url: and local:, so url/url, local/local, and url/local collisions are all rejected at @sieval_dataset registration. Tests: add local/local and url/local collision cases to test_meta.py. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ocal: Address review on PR scitix#9 (two correctness issues in the local: handler): - local: now copies bundled files via a .partial temp + atomic replace, with cleanup on error. An interrupted copy (SIGKILL/ENOSPC) no longer leaves a truncated file that is_downloaded reports as ready. Mirrors the url: handler's staging contract. - The basename collision guard previously scanned only url: sources, but local: stages to the same <dest>/<name>/<basename> layout. Rename _validate_url_basenames_unique -> _validate_staged_basenames_unique and scan both url: and local:, so url/url, local/local, and url/local collisions are all rejected at @sieval_dataset registration. Tests: add local/local and url/local collision cases to test_meta.py. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Type
Summary
Two independent dataset-download infrastructure changes, both prerequisites for
bundling generated corpora (used next by the RULER benchmark, sent separately):
local:source handler — stages a package-bundled file fromsieval/datasets/_data/into the same{dest_root}/<name>/layout thehf:/url:handlers use, so a dataset whose corpus is generated once andcommitted in-tree resolves through the identical
load(name_or_path)path.Registered alongside
HFHandler/URLHandler; noSourceHandlerProtocolchange. Path traversal (
../absolute paths escaping_data/) is rejected.url:gzip truncation fix — the download verifier compared byteswritten against
Content-Length, but when the server sendsContent-Encoding: gzip,Content-Lengthis the compressed size while thedecoded body is larger, so valid downloads were falsely flagged as truncated.
Now compares against
num_bytes_downloaded(on-the-wire bytes).Related Issues
Test Plan
Automated
ruff check && ruff format --check)ty check)pdm run pytest) — 1945 passed; 58 intests/unit/datasets/downloaders/(newtest_local.py, expandedtest_url.pyfor the gzip case)Checklist
Required (all PRs)
type(scope): description)AI-Generated Code - <model> (<provider>)in module docstringcore/