Fast partial-read path for sharding_indexed codec#264
Conversation
The current ShardingCodec read path always decodes the full outer
chunk: a slice that touches one inner shard still pays decompression
for every other inner shard in the outer chunk. On a 740-shard layout
(4y of seconds, daily inner shards) that's a ~700x decompression tax
on surgical slices.
This change closes the gap to zarr-python with two layered fast paths:
In-memory partial decode (src/Codecs/V3/V3.jl):
- sharding_codec(p) detects a "pure" sharding pipeline (no
array->array codecs before, no bytes->bytes codecs after) so
readblock! can opt into the fast path conservatively.
- read_shard_partial_with_source! takes a byte-source closure and
only decodes inner chunks intersecting the requested slice.
- read_shard_partial! is the in-memory wrapper: caller has the
whole outer chunk in memory, we slice it for index + inner-chunk
bytes.
Storage-aware partial reads:
- supports_partial_reads / read_range / getsize on AbstractStore
(defaults preserve existing behavior — fall back to full read +
in-memory slice).
- DirectoryStore opts in: open + seek + read for read_range,
filesize for getsize. Other store types are unchanged until
someone implements byte-range reads for them.
- _readblock_sharded_partial! in ZArray.jl wires it together: per
outer chunk, full reads stay on the existing path; partial reads
fetch only the index + intersecting inner chunks via byte-range
reads.
Toggle:
- Zarr.enable_partial_shard_storage_reads[] (Ref{Bool}, default
true), modeled on the existing concurrent_io_tasks Ref. Set to
false to fall back to the in-memory partial-decode path for A/B
debugging.
Measured speedups on a (67, 127M) sharded float64 archive:
partial 1-day query: 665s -> 1.0s (665x)
partial week query: 97s -> 1.0s (97x)
Existing test suite: 2482/2482 pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tests
- test/v3_codecs.jl: new "sharding partial-read fast path" testset
with 5 sub-testsets covering:
* sharding_codec() detection on pure / wrapped / non-sharded
pipelines and on a non-V3Pipeline argument
* in-memory partial path via DictStore (single-inner-chunk,
cross-inner, full-chunk, cross-outer, tail, whole-array slices)
* storage-aware partial path via DirectoryStore (same slice
patterns; results must match the in-memory path byte-for-byte)
* Zarr.enable_partial_shard_storage_reads[] toggle preserves
correctness in both states
* fill_value over a partial slice of an empty/never-written
outer chunk
- test/storage.jl: new "Partial-read storage interface" testset with
sub-testsets covering the AbstractStore defaults (DictStore) and
the DirectoryStore overrides (supports_partial_reads, read_range,
getsize, missing-key fallthrough).
Docs
- docs/src/UserGuide/partial_shard_reads.md: explains when the fast
path applies, the storage-interface methods stores opt into, and
the toggle. References the existing module-level docstrings.
No production code changed in this commit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Required by the upstream changelog-enforcer CI check. Describes the new in-memory and storage-aware partial-decode paths, the optional AbstractStore methods (supports_partial_reads, read_range, getsize), DirectoryStore opt-in, and the enable_partial_shard_storage_reads[] toggle. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Pushed a follow-up commit (e382675) to handle CI feedback:
|
|
this looks promising,
unfortunately, reviewing such a long PR is beyond me at the moment. But, we should get it in, in some form. |
What this changes
The current
ShardingCodecdecode path always materializes the entire outer chunk, regardless of how small the user's slice is. A request that touches one inner shard pays decompression for every other inner shard in the outer chunk — on a typical layout (e.g. 740 daily inner shards in one outer chunk) that's roughly a 700× decompression tax on partial reads, andzarr-pythondoesn't pay it.This PR adds two layered fast paths for the
sharding_indexedcodec, both transparent to user code (arr[a:b, c]is unchanged):Codecs/V3/V3.jl). When the chunk's bytes are already in memory, only decode the inner chunks intersecting the user's slice, copying each intersection straight into the output. Skipped inner chunks pay nothing.ZArray.jl+Storage/Storage.jl). When the storage backend supports byte-range reads, fetch only the shard index plus the bytes of the intersecting inner chunks. Skipped inner chunks aren't read off disk either.Storage interface
Three new optional methods on
AbstractStore, each with a safe default that falls back to the existing full-read path:supports_partial_reads(::AbstractStore) -> Bool(defaultfalse)read_range(s, key, byte_range) -> Union{Vector{UInt8}, Nothing}(default:s[key][byte_range])getsize(s, key) -> Int(default:length(s[key]))DirectoryStoreopts in (usingseek+readbytes!andfilesize). Other built-in backends inherit the defaults — they automatically use the in-memory partial-decode path with no behavior change.Toggle
Zarr.enable_partial_shard_storage_reads[](Ref{Bool}, defaulttrue), modeled on the existingconcurrent_io_tasks::Ref. Set it tofalseto fall back to the in-memory partial-decode path even on stores that opt in — useful for A/B comparisons.Pure-pipeline detection
The fast path is gated by a small helper,
sharding_codec(p::V3Pipeline), that returns the innerShardingCodeconly when the pipeline is "pure" (no array→array codecs before, no bytes→bytes codecs after). Compound pipelines (e.g. transpose + sharding, or sharding wrapped by an outer compressor) keep the existing decode path unchanged.Tests
Two new testsets, 39 new test cases:
test/v3_codecs.jl :: "sharding partial-read fast path"— pure-vs-impure pipeline detection, in-memory partial path viaDictStore, storage-aware partial path viaDirectoryStore, the toggle round-trip, andfill_valueover partial slices of unwritten outer chunks. Slice patterns include single-inner-chunk, cross-inner-chunk, full-chunk, cross-outer-chunk, and whole-array reads.test/storage.jl :: "Partial-read storage interface"—AbstractStoredefaults viaDictStore(including missing-key handling) and theDirectoryStoreoverrides.Existing test suite stays at 2482 passes; this PR adds 39 → 2521 / 2521 passing.
Docs
docs/src/UserGuide/partial_shard_reads.mddescribes when the fast path applies, the storage-interface methods stores opt into, and the toggle.Performance
Measured on a
(67 syms × 127M seconds)float64sharded archive (the layout zarr-python writes by default for 4-year second-resolution archives), one-day single-symbol query:That's the in-memory + storage-aware paths combined for
DirectoryStore. Other partial-read patterns see similar 50-100× speedups; full-shard reads are unchanged.Compatibility
truefor the perf win; setting it tofalserecovers the previous in-memory behavior exactly.🤖 Generated with Claude Code