Skip to content

feat(datasets): add local: source handler and fix url: gzip truncation check#9

Open
Mea1Ma wants to merge 3 commits into
scitix:mainfrom
Mea1Ma:feat/local-downloader
Open

feat(datasets): add local: source handler and fix url: gzip truncation check#9
Mea1Ma wants to merge 3 commits into
scitix:mainfrom
Mea1Ma:feat/local-downloader

Conversation

@Mea1Ma

@Mea1Ma Mea1Ma commented Jun 22, 2026

Copy link
Copy Markdown

Type

  • feature — new benchmark, task, or capability

Summary

Two independent dataset-download infrastructure changes, both prerequisites for
bundling generated corpora (used next by the RULER benchmark, sent separately):

  • local: source handler — stages a package-bundled file from
    sieval/datasets/_data/ into the same {dest_root}/<name>/ layout the
    hf:/url: handlers use, so a dataset whose corpus is generated once and
    committed in-tree resolves through the identical load(name_or_path) path.
    Registered alongside HFHandler/URLHandler; no SourceHandler Protocol
    change. Path traversal (../absolute paths escaping _data/) is rejected.
  • url: gzip truncation fix — the download verifier compared bytes
    written against Content-Length, but when the server sends
    Content-Encoding: gzip, Content-Length is the compressed size while the
    decoded body is larger, so valid downloads were falsely flagged as truncated.
    Now compares against num_bytes_downloaded (on-the-wire bytes).

Related Issues

Test Plan

Automated

  • Lint/format clean (ruff check && ruff format --check)
  • Type check clean (ty check)
  • Unit tests pass (pdm run pytest) — 1945 passed; 58 in
    tests/unit/datasets/downloaders/ (new test_local.py, expanded
    test_url.py for the gzip case)

Checklist

Required (all PRs)

  • PR title follows conventional format (type(scope): description)
  • No internal paths, credentials, or personal info in committed files
  • AI-generated code has AI-Generated Code - <model> (<provider>) in module docstring
  • No new upper-layer dependencies added to core/
  • Deleted code verified — no remaining call sites depend on it

Mea1Ma and others added 2 commits June 22, 2026 11:09
…ged as truncated downloads

The truncation check in url.py compared bytes *written to disk* against
Content-Length. That breaks for Content-Encoding responses: when the
server sends gzip, Content-Length is the compressed size while
iter_bytes() yields the larger decompressed body, so written bytes
always exceed Content-Length and the download is falsely rejected.

Compare r.num_bytes_downloaded (raw on-the-wire bytes, before httpx
decompresses) against Content-Length instead — the two are now measured
on the same, compressed scale, so gzip-served files pass while genuinely
truncated transfers still fail.

Regression: SQuAD's train-v2.0.json is gzip-served; Content-Length=9551051
but the decoded body is ~42MB, which the old written-bytes check rejected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
feat(datasets): add LocalHandler to stage local:<relpath> assets from sieval/datasets/_data/ into dest_root/<dataset_name>/
feat(datasets): enforce normalized package-relative paths and reject absolute / traversal inputs in _bundled_path()
feat(datasets): align output filename behavior with URL downloader via shared url_path_basename fallback (download)
test(datasets): cover scheme validation, traversal guards, copy/skip/force behavior, and is_downloaded() checks for local downloader
Co-authored-by: Claude Sonnet 4.6 (Anthropic) noreply@anthropic.com

@ethan-scitix ethan-scitix left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Request changes — two correctness issues in the local: handler; the gzip fix is correct.

The copy-into-{data_dir}/<name>/ staging model is the right call — it keeps both the sieval dataset download UX and the loader read-path uniform across hf:/url:/local:. The two findings below are about making that copy safe, not changing the approach.

Bug: non-atomic copysieval/datasets/downloaders/local.py:39
shutil.copyfile(bundled, target) writes straight to the final path. An interrupted copy (SIGKILL/ENOSPC) leaves a truncated file that is_downloaded (local.py:43) reports as ready → silent corrupt load, no re-stage. Mirror the url: sibling (url.py:33, url.py:52-55): copy to .partial, replace on success, unlink(missing_ok=True) on error.

Bug (latent): collision guard skips local:sieval/core/datasets/meta.py:196
_validate_url_basenames_unique only scans url: sources, but local: stages to the same {dest}/{name}/{basename} (local.py:36). Two colliding basenames (local/local or local/url in one dataset) silently overwrite, defeating the guard's documented guarantee. Include local: sources too — this touches meta.py (outside this diff), but this PR is what makes local: a live staging target. RULER itself likely won't trip it, so this is lower urgency than the atomicity fix.

gzip fix: confirmed correct. Verified against httpx 0.28.1 — iter_bytes() delegates to iter_raw(), which increments num_bytes_downloaded on the raw on-the-wire bytes before decoding. So it equals Content-Length for a gzip response and still falls short on a truncated stream.

…ocal:

Address review on PR scitix#9 (two correctness issues in the local: handler):

- local: now copies bundled files via a .partial temp + atomic replace,
  with cleanup on error. An interrupted copy (SIGKILL/ENOSPC) no longer
  leaves a truncated file that is_downloaded reports as ready. Mirrors
  the url: handler's staging contract.

- The basename collision guard previously scanned only url: sources, but
  local: stages to the same <dest>/<name>/<basename> layout. Rename
  _validate_url_basenames_unique -> _validate_staged_basenames_unique and
  scan both url: and local:, so url/url, local/local, and url/local
  collisions are all rejected at @sieval_dataset registration.

Tests: add local/local and url/local collision cases to test_meta.py.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Mea1Ma added a commit to Mea1Ma/sieval that referenced this pull request Jun 25, 2026
…ocal:

Address review on PR scitix#9 (two correctness issues in the local: handler):

- local: now copies bundled files via a .partial temp + atomic replace,
  with cleanup on error. An interrupted copy (SIGKILL/ENOSPC) no longer
  leaves a truncated file that is_downloaded reports as ready. Mirrors
  the url: handler's staging contract.

- The basename collision guard previously scanned only url: sources, but
  local: stages to the same <dest>/<name>/<basename> layout. Rename
  _validate_url_basenames_unique -> _validate_staged_basenames_unique and
  scan both url: and local:, so url/url, local/local, and url/local
  collisions are all rejected at @sieval_dataset registration.

Tests: add local/local and url/local collision cases to test_meta.py.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants