Skip to content

Phase 3 follow-ups for auto-decompression (#1417): sftp, dc: derived caches, local-file streaming, dual-path unification #3988

@jqnatividad

Description

@jqnatividad

Follow-up to #1417 (Phases 1 & 2 shipped in #3986 and #3987 — local & remote auto-decompression of zip/gz/zlib/zst/snappy for the Config reader, luau/validate/describegpt lookup tables, and get/dc: ingest).

These items were explicitly scoped out of #1417 and are tracked here for a future pass. None are regressions.

Remote / get sources

  • sftp:// sources (behind a get_sftp sub-feature).
  • HTTP/3 for downloads.
  • Derived stats/frequency caches for dc: inputs — persist computed stats/frequency alongside cached resources.
  • Streaming decompression for local compressed filesdiskcache::ingest_local still fs::reads the whole file then decompresses in memory (decompress_source). The remote path already streams (gz/zlib/zst via IngestSink::Decode); local large compressed ingests could OOM similarly. ✅ Done in fix: stream local compressed ingests instead of buffering whole file (#3988) #3990ingest_local now streams .gz/.zlib/.zst into BlobSink via the same IngestSink abstraction the remote paths use (bounded memory); zip/sz still full-buffer per IngestSink's per-format policy.

Decompression semantics (shared)

  • Unify the two zip/compressed-input paths. util::process_input (command-level: extracts ALL entries) and Config special-format (reader-level: first tabular entry) disagree on multi-entry zip semantics. Decide a single multi-entry policy and converge. ✅ Done in refactor: unify zip-input handling into one shared module (#3988) #3995 (option D) — both paths now share one zip module with a single selection rule; entries are returned tabular-first, so a single-input command and a Config-only command pick the same first entry from a mixed multi-entry zip (they could previously read different entries). Multi-input commands (cat/sqlp/to/validate/scoresql) still receive every entry, nested special formats are preserved, and a zip with no supported entry now errors clearly. QSV_SKIP_FORMAT_CHECK is honored for zip members.
  • avro/jsonl conversion-error swallow. A polars-native special-format conversion failure currently falls back to reading raw bytes (preserved historical behavior). slice_from_avro/slice_from_jsonl_* "pass" only because the asserted substring is embedded in the binary. Decide whether to surface these errors (and fix/replace the fragile fixtures) or keep the per-format swallow. ✅ Done in fix: surface special-format conversion errors instead of swallowing them (#3988) #3989 — conversion failures are now surfaced as hard errors for all special formats (escape hatch: QSV_SKIP_FORMAT_CHECK); regenerated the unreadable Avro fixture and reworked the two slice Decimal-pschema tests (which only passed via the swallow) onto the compressed-CSV path that genuinely applies a Decimal pschema.
  • Nested special formats (e.g. a .parquet inside a .zip) are unsupported by design — document or support. ✅ Decision: won't support — documented. Parquet/Avro/Arrow are already compressed, so nesting them in a .zip is not a real-world workflow; provide such files directly (qsv reads them natively). Documented in the README "Extended Input Support" section and in select_zip_entry's doc comment.

Lookup-table caching

  • A remote .zip whose inner tabular file is non-CSV-delimited can't have its delimiter inferred from the URL (the cache file defaults to .csv). gz/zlib/zst/sz already carry the inner extension via the URL stem. ✅ Done in fix: infer remote .zip lookup table's inner delimiter from its entry (#3988) #3991 — the downloader now names the cache file from the inner entry's extension discovered during extraction (resetting back to .csv on a later csv-inner refresh), and the cache-hit path probes the tabular extensions to find it. Generalized to ckan:// too (its resolved data URL's extension isn't knowable up front); dathere:// and explicit-extension URLs stay deterministic.

🤖 Generated with Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request. Once marked with this label, its in the backlog.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions