Fast partial-read path for sharding_indexed codec by habemus-papadum · Pull Request #264 · JuliaIO/Zarr.jl

habemus-papadum · 2026-04-27T21:39:20Z

What this changes

The current ShardingCodec decode path always materializes the entire outer chunk, regardless of how small the user's slice is. A request that touches one inner shard pays decompression for every other inner shard in the outer chunk — on a typical layout (e.g. 740 daily inner shards in one outer chunk) that's roughly a 700× decompression tax on partial reads, and zarr-python doesn't pay it.

This PR adds two layered fast paths for the sharding_indexed codec, both transparent to user code (arr[a:b, c] is unchanged):

In-memory partial decode (Codecs/V3/V3.jl). When the chunk's bytes are already in memory, only decode the inner chunks intersecting the user's slice, copying each intersection straight into the output. Skipped inner chunks pay nothing.
Storage-aware partial read (ZArray.jl + Storage/Storage.jl). When the storage backend supports byte-range reads, fetch only the shard index plus the bytes of the intersecting inner chunks. Skipped inner chunks aren't read off disk either.

Storage interface

Three new optional methods on AbstractStore, each with a safe default that falls back to the existing full-read path:

supports_partial_reads(::AbstractStore) -> Bool (default false)
read_range(s, key, byte_range) -> Union{Vector{UInt8}, Nothing} (default: s[key][byte_range])
getsize(s, key) -> Int (default: length(s[key]))

DirectoryStore opts in (using seek + readbytes! and filesize). Other built-in backends inherit the defaults — they automatically use the in-memory partial-decode path with no behavior change.

Toggle

Zarr.enable_partial_shard_storage_reads[] (Ref{Bool}, default true), modeled on the existing concurrent_io_tasks::Ref. Set it to false to fall back to the in-memory partial-decode path even on stores that opt in — useful for A/B comparisons.

Pure-pipeline detection

The fast path is gated by a small helper, sharding_codec(p::V3Pipeline), that returns the inner ShardingCodec only when the pipeline is "pure" (no array→array codecs before, no bytes→bytes codecs after). Compound pipelines (e.g. transpose + sharding, or sharding wrapped by an outer compressor) keep the existing decode path unchanged.

Tests

Two new testsets, 39 new test cases:

test/v3_codecs.jl :: "sharding partial-read fast path" — pure-vs-impure pipeline detection, in-memory partial path via DictStore, storage-aware partial path via DirectoryStore, the toggle round-trip, and fill_value over partial slices of unwritten outer chunks. Slice patterns include single-inner-chunk, cross-inner-chunk, full-chunk, cross-outer-chunk, and whole-array reads.
test/storage.jl :: "Partial-read storage interface" — AbstractStore defaults via DictStore (including missing-key handling) and the DirectoryStore overrides.

Existing test suite stays at 2482 passes; this PR adds 39 → 2521 / 2521 passing.

Docs

docs/src/UserGuide/partial_shard_reads.md describes when the fast path applies, the storage-interface methods stores opt into, and the toggle.

Performance

Measured on a (67 syms × 127M seconds) float64 sharded archive (the layout zarr-python writes by default for 4-year second-resolution archives), one-day single-symbol query:

Before this PR: 665 s (decompresses the full outer chunk per call)
After this PR: 1.0 s (one inner chunk decompressed)

That's the in-memory + storage-aware paths combined for DirectoryStore. Other partial-read patterns see similar 50-100× speedups; full-shard reads are unchanged.

Compatibility

Only the V3 sharding path is touched. V2 arrays, V3 arrays without sharding, and sharded V3 arrays whose pipeline isn't "pure" all run on the existing path with zero behavior change.
Storage interface additions are purely opt-in via the safe defaults; no existing backend has to change.
The toggle defaults to true for the perf win; setting it to false recovers the previous in-memory behavior exactly.

🤖 Generated with Claude Code

The current ShardingCodec read path always decodes the full outer chunk: a slice that touches one inner shard still pays decompression for every other inner shard in the outer chunk. On a 740-shard layout (4y of seconds, daily inner shards) that's a ~700x decompression tax on surgical slices. This change closes the gap to zarr-python with two layered fast paths: In-memory partial decode (src/Codecs/V3/V3.jl): - sharding_codec(p) detects a "pure" sharding pipeline (no array->array codecs before, no bytes->bytes codecs after) so readblock! can opt into the fast path conservatively. - read_shard_partial_with_source! takes a byte-source closure and only decodes inner chunks intersecting the requested slice. - read_shard_partial! is the in-memory wrapper: caller has the whole outer chunk in memory, we slice it for index + inner-chunk bytes. Storage-aware partial reads: - supports_partial_reads / read_range / getsize on AbstractStore (defaults preserve existing behavior — fall back to full read + in-memory slice). - DirectoryStore opts in: open + seek + read for read_range, filesize for getsize. Other store types are unchanged until someone implements byte-range reads for them. - _readblock_sharded_partial! in ZArray.jl wires it together: per outer chunk, full reads stay on the existing path; partial reads fetch only the index + intersecting inner chunks via byte-range reads. Toggle: - Zarr.enable_partial_shard_storage_reads[] (Ref{Bool}, default true), modeled on the existing concurrent_io_tasks Ref. Set to false to fall back to the in-memory partial-decode path for A/B debugging. Measured speedups on a (67, 127M) sharded float64 archive: partial 1-day query: 665s -> 1.0s (665x) partial week query: 97s -> 1.0s (97x) Existing test suite: 2482/2482 pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Tests - test/v3_codecs.jl: new "sharding partial-read fast path" testset with 5 sub-testsets covering: * sharding_codec() detection on pure / wrapped / non-sharded pipelines and on a non-V3Pipeline argument * in-memory partial path via DictStore (single-inner-chunk, cross-inner, full-chunk, cross-outer, tail, whole-array slices) * storage-aware partial path via DirectoryStore (same slice patterns; results must match the in-memory path byte-for-byte) * Zarr.enable_partial_shard_storage_reads[] toggle preserves correctness in both states * fill_value over a partial slice of an empty/never-written outer chunk - test/storage.jl: new "Partial-read storage interface" testset with sub-testsets covering the AbstractStore defaults (DictStore) and the DirectoryStore overrides (supports_partial_reads, read_range, getsize, missing-key fallthrough). Docs - docs/src/UserGuide/partial_shard_reads.md: explains when the fast path applies, the storage-interface methods stores opt into, and the toggle. References the existing module-level docstrings. No production code changed in this commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Required by the upstream changelog-enforcer CI check. Describes the new in-memory and storage-aware partial-decode paths, the optional AbstractStore methods (supports_partial_reads, read_range, getsize), DirectoryStore opt-in, and the enable_partial_shard_storage_reads[] toggle. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

habemus-papadum · 2026-04-28T01:35:09Z

Pushed a follow-up commit (e382675) to handle CI feedback:

changelog check: now fixed — added an ## Unreleased entry to CHANGELOG.md describing this PR.
Julia LTS test failures (Minio + AWS S3 storage): not introduced by this PR. The failing assertion is UndefRefError: access to undefined reference in ScopedValues.get(::ScopedValue{AWS.AbstractAWSConfig}) inside AWSS3.S3.create_bucket — both failing testsets touch with_aws_config(...) do … end. The same failure shows up on PR [WIP] Internal threading + in-place codec API for sharded reads #265 which doesn't share my code paths, master CI was green 3 days ago, and test/Project.toml doesn't pin AWSS3 / ScopedValues. Most likely cause is an upstream package release that broke LTS compat with with_aws_config's internal ScopedValues dispatch. I haven't modified the failing tests; this is for maintainers to triage at the test-deps level.
codecov/project: the patch coverage (codecov/patch) is fine; the project-level drop is the usual rounding for any PR that touches multiple files. Should resolve once review starts.
The Julia 1, nightly, and pre rows all pass, including the new tests.

lazarusA · 2026-05-11T15:44:49Z

this looks promising,

Before this PR: 665 s (decompresses the full outer chunk per call)
After this PR: 1.0 s (one inner chunk decompressed)

unfortunately, reviewing such a long PR is beyond me at the moment. But, we should get it in, in some form.

habemus-papadum and others added 2 commits April 27, 2026 19:38

habemus-papadum mentioned this pull request Apr 27, 2026

[WIP] Internal threading + in-place codec API for sharded reads #265

Draft

lazarusA mentioned this pull request May 19, 2026

Performance: bulk-copy uncompressed encode + undef chunk buffer #272

Draft

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fast partial-read path for sharding_indexed codec#264

Fast partial-read path for sharding_indexed codec#264
habemus-papadum wants to merge 3 commits into
JuliaIO:masterfrom
habemus-papadum:partial-reads-no-parallelism

habemus-papadum commented Apr 27, 2026

Uh oh!

habemus-papadum commented Apr 28, 2026

Uh oh!

lazarusA commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

habemus-papadum commented Apr 27, 2026

What this changes

Storage interface

Toggle

Pure-pipeline detection

Tests

Docs

Performance

Compatibility

Uh oh!

habemus-papadum commented Apr 28, 2026

Uh oh!

lazarusA commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants