[WIP] Internal threading + in-place codec API for sharded reads#265
Draft
habemus-papadum wants to merge 8 commits into
Draft
[WIP] Internal threading + in-place codec API for sharded reads#265habemus-papadum wants to merge 8 commits into
habemus-papadum wants to merge 8 commits into
Conversation
The current ShardingCodec read path always decodes the full outer
chunk: a slice that touches one inner shard still pays decompression
for every other inner shard in the outer chunk. On a 740-shard layout
(4y of seconds, daily inner shards) that's a ~700x decompression tax
on surgical slices.
This change closes the gap to zarr-python with two layered fast paths:
In-memory partial decode (src/Codecs/V3/V3.jl):
- sharding_codec(p) detects a "pure" sharding pipeline (no
array->array codecs before, no bytes->bytes codecs after) so
readblock! can opt into the fast path conservatively.
- read_shard_partial_with_source! takes a byte-source closure and
only decodes inner chunks intersecting the requested slice.
- read_shard_partial! is the in-memory wrapper: caller has the
whole outer chunk in memory, we slice it for index + inner-chunk
bytes.
Storage-aware partial reads:
- supports_partial_reads / read_range / getsize on AbstractStore
(defaults preserve existing behavior — fall back to full read +
in-memory slice).
- DirectoryStore opts in: open + seek + read for read_range,
filesize for getsize. Other store types are unchanged until
someone implements byte-range reads for them.
- _readblock_sharded_partial! in ZArray.jl wires it together: per
outer chunk, full reads stay on the existing path; partial reads
fetch only the index + intersecting inner chunks via byte-range
reads.
Toggle:
- Zarr.enable_partial_shard_storage_reads[] (Ref{Bool}, default
true), modeled on the existing concurrent_io_tasks Ref. Set to
false to fall back to the in-memory partial-decode path for A/B
debugging.
Measured speedups on a (67, 127M) sharded float64 archive:
partial 1-day query: 665s -> 1.0s (665x)
partial week query: 97s -> 1.0s (97x)
Existing test suite: 2482/2482 pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tests
- test/v3_codecs.jl: new "sharding partial-read fast path" testset
with 5 sub-testsets covering:
* sharding_codec() detection on pure / wrapped / non-sharded
pipelines and on a non-V3Pipeline argument
* in-memory partial path via DictStore (single-inner-chunk,
cross-inner, full-chunk, cross-outer, tail, whole-array slices)
* storage-aware partial path via DirectoryStore (same slice
patterns; results must match the in-memory path byte-for-byte)
* Zarr.enable_partial_shard_storage_reads[] toggle preserves
correctness in both states
* fill_value over a partial slice of an empty/never-written
outer chunk
- test/storage.jl: new "Partial-read storage interface" testset with
sub-testsets covering the AbstractStore defaults (DictStore) and
the DirectoryStore overrides (supports_partial_reads, read_range,
getsize, missing-key fallthrough).
Docs
- docs/src/UserGuide/partial_shard_reads.md: explains when the fast
path applies, the storage-interface methods stores opt into, and
the toggle. References the existing module-level docstrings.
No production code changed in this commit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sharding partial-read fast paths were single-threaded: a request that
touched many inner chunks of one shard, or many outer chunks across
the array, decoded them serially. Adds two layers of internal
threading so one readblock! call scales with available cores.
read_shard_partial_with_source! (Codecs/V3/V3.jl)
- precomputes the work list of intersecting inner chunks (each
writes to a disjoint region of `aout`, so safe to parallelize)
- dispatches the decode loop with @sync / Threads.@Spawn
- bounded buffer pool (Channel of nthreads buffers) caps memory
instead of allocating one full-shard buffer per task
_readblock_sharded_partial! (ZArray.jl)
- splits incoming blockr into full vs partial outer-chunk reads
- full-chunk reads keep the existing serial path with one shared
chunk buffer (already saturates one core)
- partial-chunk reads dispatch with @sync / Threads.@Spawn — each
task reads its shard index + intersecting inner chunks
- inner-chunk threading inside each task still applies, so
multi-symbol multi-day queries get both axes of parallelism
Toggle:
- Zarr.enable_threaded_shard_decode[] (Ref{Bool}, default true).
Falls back to sequential when nthreads()==1, |work|==1, or the
flag is off.
Threading is opt-out via the flag, so single-threaded callers see no
behavior change. Both flag declarations moved above include() so
nested modules can import them.
Measured on a (129 syms × 1.48 yr) symbol-major archive, BTC × full
quotes history, all 6 vars:
user-serial path: 23.9s → 10.3s (2.3x from internal threading)
user-level @threads over 6 vars: 12.7s → 5.2s
Python (zarr-python 3, internal parallelism) baseline: 4.0s.
Existing test suite: 2482/2482 pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two chunk-sized transient allocations on every inner-chunk decode in
the V3 partial-shard path: one in zstd's `cc_decode` (returns a fresh
Vector{UInt8}) and one in BytesCodec's `collect(reinterpret(...))`.
pipeline_decode! then did a final `copyto!(output, arr)` — a third
copy. For a 187 MB inner chunk that's GBs of waste per query.
Patch A: in-place codec_decode! API (Codecs/V3/V3.jl)
- Generic fallbacks dispatched on V3Codec{In,Out} so any codec gets
a working method (allocates + copies, but no caller-visible API
change).
- Specialized BytesCodec — reinterpret bytes as UInt8 view of the
output array and `copyto!` straight in. No fresh Vector{T}.
- Specialized ZstdV3Codec — calls ChunkCodecCore.decode! into the
caller's buffer directly. No fresh decoded Vector{UInt8}.
Patch B: pipeline_decode! threads in-place dispatch (pipeline.jl)
- For the common case (no array_array codecs, ≥1 bytes_bytes codec),
sizes a single intermediate buffer to the array-bytes step's
expected output and reuses it across the chain via codec_decode!.
- Final array-bytes step writes straight into `output` — eliminates
the `arr` allocation + redundant copy.
- Falls back to the old path when transpose codecs or other
array-array steps are present.
Patch C: cap inner-decode buffer pool (Zarr.jl + V3.jl)
- New Zarr.max_concurrent_inner_decodes Ref{Int} (default 8),
modeled on zarr-python's async.concurrency = 10.
- read_shard_partial_with_source! pool sized to
min(nthreads, |work|, max_concurrent_inner_decodes[]).
- On a `-t 32` run with ~187 MB chunks the pool no longer reserves
~6 GB upfront for buffers most tasks never use.
Measured (HL BTC quotes full history × 6 vars on
hyperliquid_1s_symmajor.zarr, -t 32):
user-serial: 9.6s → 7.2s (-25%)
user @threads-over-vars: 5.1s → 3.3s (-35%, beats Python's 4.0s)
Allocations dropped 35-40% across partial-read benches.
Existing test suite: 2482/2482 pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
pipeline_decode! was still allocating a chunk-sized Vector{UInt8} on
every inner-chunk decode for the bytes-bytes step's output. For our
prod shape that's ~187 MB per inner chunk × ~7 outer chunks per query
= ~1.3 GB of transient allocations per arr[:, si] call.
The dominant pipeline shape is `[BytesCodec, ZstdV3Codec]` —
BytesCodec just reinterprets bytes as the array's element type, so
the bytes-bytes output IS the byte view of the typed output array.
Decoding zstd straight into `reinterpret(UInt8, vec(output))`
eliminates the scratch entirely.
Two new fast paths added:
- matching-endian common case: zstd into output's byte view, return.
- mismatched-endian variant: zstd into output's byte view, then
in-place bswap via codec_decode!(::BytesCodec, ...).
For multi-step bytes-bytes chains or pipelines with array_array codecs
(transpose etc.), keep the existing scratch-buffer / fallback paths
unchanged.
Measured (HL BTC quotes full history × 6 vars, -t 32):
user-serial: 7.2s → 3.2s (Python 4.0s)
user @threads-over-vars: 3.3s → 1.1s
per-call alloc on f64 vars: 2.9 GB → 1.7 GB
Existing test suite: 2482/2482 pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tests
- test/v3_codecs.jl: three new testsets, 34 cases:
* "codec_decode! in-place API" — BytesCodec little-endian and
big-endian (bswap) round-trips, dimension-mismatch error,
ZstdV3Codec in-place decode, and the generic V3Codec{:bytes,
:bytes} fallback via CRC32cV3Codec.
* "pipeline_decode! V3 paths" — exercises each branch:
matching-endian fast path (BytesCodec :little + Zstd),
endian-mismatch variant (BytesCodec :big), no-bytes-bytes path
(BytesCodec only), multi bytes-bytes scratch-buffer branch
(Zstd + CRC32c), and the array_array fallback (TransposeCodec).
* "threading flags preserve correctness" — flips
enable_threaded_shard_decode[] and max_concurrent_inner_decodes[]
across realistic combinations, verifies reads still match.
Docs
- docs/src/UserGuide/partial_shard_reads.md: new "Threading" and
"In-place codec API" sections describing the user-transparent
parallelism, the two new toggles, and how downstream codecs can
opt into in-place dispatch.
No production code changed in this commit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Required by the upstream changelog-enforcer CI check. Describes the new in-memory and storage-aware partial-decode paths, the optional AbstractStore methods (supports_partial_reads, read_range, getsize), DirectoryStore opt-in, and the enable_partial_shard_storage_reads[] toggle. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Describes the internal Threads.@Spawn dispatch in read_shard_partial_with_source! and _readblock_sharded_partial!, the new enable_threaded_shard_decode[] and max_concurrent_inner_decodes[] toggles, the in-place codec_decode! API, and the rewritten V3 pipeline_decode! that eliminates per-inner-chunk scratch allocations. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this changes
Builds on the partial-read fast path from #264. Two follow-up improvements that close the throughput gap to
zarr-python:Internal threading for the sharded partial-read path. When Julia is started with
-t > 1, inner-chunk decodes within one outer chunk and outer-chunk reads across one request are dispatched toThreads.@spawn— the same wayzarr-pythonparallelizes inner chunks inside one__getitem__. User code is unchanged; the parallelism is transparent.In-place codec API to remove transient allocations on the hot path.
pipeline_decode!for V3 now decodes zstd directly into the byte view of the caller's typed output array for the dominant[BytesCodec, ZstdV3Codec]pipeline shape, eliminating the chunk-sized scratch buffer +copyto!that ran on every inner-chunk decode.Total of 4 commits on top of #264:
Parallelize sharded partial reads with @spawn— threading insideread_shard_partial_with_source!(over inner chunks) and_readblock_sharded_partial!(over outer chunks). NewZarr.enable_threaded_shard_decode[]toggle (defaulttrue).In-place codec_decode! + cap inner-decode pool— addscodec_decode!API with specializations forBytesCodec(zero-copy reinterpret + bulk byte copy) andZstdV3Codec(ChunkCodecCore.decode!straight into the caller's buffer), plus generic fallbacks for the three V3Codec In/Out tag pairs. AddsZarr.max_concurrent_inner_decodes[](Ref{Int}, default 8) capping the buffer pool independently ofnthreads().Skip the bytes scratch buffer when array_bytes is BytesCodec—pipeline_decode!for V3 detects[BytesCodec, one bytes_bytes codec]and decodes directly intoreinterpret(UInt8, vec(output)), removing the chunk-sized scratch allocation entirely. Multi-step bytes-bytes pipelines and pipelines with array→array codecs keep the existing buffered fallback.Test coverage + docs for the threading & in-place codec work— non-production: 34 new test cases and a Threading + In-place codec section indocs/src/UserGuide/partial_shard_reads.md.Toggles
Three flags, all
Reftypes modeled on the existingZarr.concurrent_io_tasks::Ref{Int}:Zarr.enable_partial_shard_storage_reads[](from Fast partial-read path for sharding_indexed codec #264)Zarr.enable_threaded_shard_decode[]—Ref{Bool}, defaulttrue. Forces the sequential decode path even with-t > 1when set tofalse.Zarr.max_concurrent_inner_decodes[]—Ref{Int}, default8. Mirrorszarr-python'sasync.concurrency = 10.Tests
Three new testsets covering the new code (
test/v3_codecs.jl, +34 cases):codec_decode!in-place API — BytesCodec little-endian and big-endian (bswap) round-trips, dimension-mismatch error, ZstdV3Codec in-place decode, and the genericV3Codec{:bytes,:bytes}fallback viaCRC32cV3Codec.pipeline_decode!V3 paths — exercises every branch of the rewritten function: matching-endian fast path, endian-mismatch variant, no-bytes-bytes path, multi-bytes-bytes scratch-buffer branch, and the array_array fallback (TransposeCodec).enable_threaded_shard_decode[]andmax_concurrent_inner_decodes[]across realistic combinations and verifies reads still match the written data on a shardedDirectoryStore-backed array.Combined with #264's coverage, the suite is at 2555 / 2555 passing.
Performance
Measured on a
(129 syms × 1.48 yr)symbol-major sharded archive, BTC × full quotes history × 6 vars:@threads-over-varsPer-call allocations on the f64 quote variables dropped roughly 35-40% versus #264 alone, with chunk-sized transient allocations eliminated for the dominant pipeline shape.
🤖 Generated with Claude Code