Performance improvements (v2 only for now)#280
Merged
Conversation
`zcompress(data, NoCompressor())` returns a lazy `reinterpret(UInt8, data)`
view. The old `empty!` + `append!` fallback walked that view element by
element through `_growend!` / `push!`, materialising bytes one at a time —
roughly 67% of CPU time on uncompressed full-chunk V2 writes in profiling.
Replace with `resize!` + `copyto!` so the same path issues a single SIMD /
memcpy bulk copy. Real compressors (Blosc, Zlib, Zstd) are unaffected: they
already return a freshly-allocated `Vector{UInt8}` and `copyto!` handles
that case identically.
Bytes-on-disk are bit-identical; full test suite (2499 tests) passes.
Measured on Apple M2 Ultra, Julia 1.12.6, V2 + NoCompressor + chunks=(Nx,Ny,Nz,1):
| size | baseline | patched | speed-up |
|---------------------|---------:|---------:|---------:|
| 128×128×16×50 50M | 702 MB/s | 995 MB/s | 1.42× |
| 256×256×32×50 400M | 799 MB/s |1530 MB/s | 1.92× |
| 512×512×32×50 1.5G | 776 MB/s |1104 MB/s | 1.42× |
Reads are unchanged (this fallback is not on the read path).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Coverage Report for CI Build 26687704356Coverage decreased (-0.005%) to 89.477%Details
Uncovered ChangesNo uncovered changes found. Coverage Regressions4 previously-covered lines in 3 files lost coverage.
Coverage Stats
💛 - Coveralls |
The generic `zuncompress!` fallback at the top of `Compressors.jl`
ends up at `copyto!(::Array{T}, ::ReinterpretArray)` when
`c isa NoCompressor`, since `zuncompress(bytes, ::NoCompressor, T)`
returns a lazy `reinterpret(T, bytes)` view. That `copyto!` walks
element by element at ~17 GB/s on Apple silicon vs. ~85 GB/s for
`unsafe_copyto!` on the same memory — a 5× gap. End-to-end V2 reads
absorb most of it via DiskArrays slicing, channel ferry, and storage
I/O, leaving ~25-99% throughput improvement depending on chunk size.
Add a `::NoCompressor`-dispatched `zuncompress!` method alongside
the existing per-compressor versions for Zstd / Zlib / Blosc. Mirror
image of the encode-side bulk copy that landed in the post-A1
`zcompress!`. Guards on `data::Array{T}` and `isbitstype(T)` keep
the fast path off `SenMissArray`, `MaxLengthString`, ragged
`Vector{T}` eltypes, and any non-contiguous input — those fall back
to the existing generic `copyto!` path.
Tests: 2499/2499 pass.
Measured on Apple M2 Ultra, Julia 1.12.6, V2 + NoCompressor +
chunks=(Nx,Ny,Nz,1), 1 warm-up + 7 timed repeats, A/B via git stash
on the same wall-clock:
| size | A1 only read | A1 + B1 read | speed-up |
|---------------------|-------------:|-------------:|---------:|
| 128×128×16×50 50M | 3080 MB/s | 3884 MB/s | +26% |
| 256×256×32×50 400M | 3155 MB/s | 4130 MB/s | +31% |
| 512×512×32×50 1.5G | 2243 MB/s | 4455 MB/s | +99% |
Writes are unchanged (B1 doesn't touch the encode path).
Reference: Zarrs.jl on the same V2-uncompressed workload reads at
5166 / 4672 / 5418 MB/s, so B1 closes most of the read gap; the
1.6 GiB case goes from ~58% of Zarrs.jl to ~82%.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both `readblock!` and `writeblock!` allocated the chunk-shaped scratch
buffer via `getchunkarray(z) = fill(_zero(eltype(z)), z.metadata.chunks)`,
which zero-fills the chunk before any other work touches it. On the
read side that fill is always clobbered: `pipeline_decode!` writes
every element, or the fill-value branch in `uncompress_raw!` calls
`fill!` itself. On the write side it's clobbered for full-chunk
overwrites (the common case for `z[:,:,:,t] = buf`-style appends) and
re-initialised by `resetbuffer!` for partial-chunk RMW.
Add `getchunkarray_undef(z::ZArray{T})` which returns
`Array{T}(undef, …)` for plain isbits eltypes and falls back to
`getchunkarray` for `Missing <: T` (the `SenMissArray` path: Blosc
and friends reject `Union{Missing,T}` directly, so the inner buffer
has to be a real `Array{T}`).
- `readblock!`: always use `getchunkarray_undef`. Decode-into and
fill-value branches both initialise every element.
- `writeblock!`: use `getchunkarray_undef` when
`z.metadata.fill_value !== nothing` (resetbuffer! handles init).
Keep the legacy zero-fill when `fill_value === nothing` so partial
writes to a fresh chunk still default un-written cells to zero.
Port of the V2-side hunks from #272 (verbatim, with
the same `Missing <: T` carve-out the PR added in commit 75bea38 to
fix a Blosc rejection regression).
Tests: 2499/2499 pass — including the SenMissArray-exercising
`Fillvalue as missing`, `getindex/setindex` (amiss),
`MaxLengthString large-chunk read path`, and `ragged arrays` tests.
Measured on Apple M2 Ultra, Julia 1.12.6, V2 + NoCompressor +
chunks=(Nx,Ny,Nz,1), 1 warm-up + 7 timed repeats, A/B via git stash:
| size | B1 only (R) | B1+A2 (R) | speed-up |
|---------------------|------------:|-----------:|---------:|
| 128×128×16×50 50M | 3466 MB/s | 3866 MB/s | +12% |
| 256×256×32×50 400M | 3589 MB/s | 4169 MB/s | +16% |
| 512×512×32×50 1.5G | ~3000 MB/s | ~3500 MB/s | noisy |
Write side smaller (most write time is storage-bound, not memset):
| size | B1 only (W) | B1+A2 (W) | speed-up |
|---------------------|------------:|-----------:|---------:|
| 128×128×16×50 50M | 1148 MB/s | 1203 MB/s | +5% |
| 256×256×32×50 400M | 1327 MB/s | 1354 MB/s | +2% |
| 512×512×32×50 1.5G | 1132 MB/s | 1329 MB/s | +17% |
The 1.5 GiB row is the noisiest across runs because chunks no longer
fit in L3 — exact numbers vary with cache state at run start.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
One bullet per commit on the PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The multi-paragraph rationale comments are now stale post-landing; one line each is enough to flag why bulk copy beats append! / generic ReinterpretArray copy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6b4551a to
2e4d09b
Compare
lazarusA
reviewed
May 20, 2026
The slow writeblock! path spawns a readtask and a writetask connected by
0-buffered channels for every call. Profiling the 400 MiB headline workload
showed 93% of write CPU pinned in __psynch_cvwait — the channel ferry
synchronisation cost — for the dominant case of writing one full chunk at
a time (`z[:,:,:,t] = buf`).
Add a fast path at the top of writeblock! that, when the call touches
exactly one chunk and `ain` is a plain Array matching the chunk shape and
eltype, encodes `ain` directly via `compress_raw` and calls
`store_writechunk`/`store_deletechunk` synchronously. Fill-value elision is
preserved (delete-if-initialised when encode returns `nothing`).
Restricted to non-Missing element types so the SenMissArray indirection
required by the codec pipeline still goes through the slow path.
Bench (M2 Ultra, ZS_REPEATS=3, V2 + NoCompressor, 50 timesteps):
size write MiB/s read MiB/s
before after before after
128³×16×50 1235 2228 3864 ~3500 (write +80%)
256³×32×50 1392 2160 3870 ~3300 (write +55%)
512²×32×50 1339 2240 4651 ~3500 (write +67%)
Read variance in the combined bench is system noise; the isolated
read-only bench shows reads unchanged.
Tests: 2499/2499 pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors the writeblock! fast path: for the common case of reading exactly one full chunk into an Array of matching shape and eltype, bypass the readtask + 0-buffered channel and the chunk-shaped scratch buffer. Read the compressed bytes synchronously via `store_readchunk` and decode directly into `aout`, eliminating both the channel-ferry sync wait (80% of read CPU per profiling) and the per-call scratch allocation + final copy (the 13% GC pressure source). Bench (M2 Ultra, ZS_REPEATS=3, V2 + NoCompressor, isolated read-only): size before after speedup 128³×16×50 1528 MiB/s ~5300 MiB/s ~3.5x 256³×32×50 3128 MiB/s ~4900 MiB/s ~1.6x 512²×32×50 3248 MiB/s ~4800 MiB/s ~1.5x Same guards as the write fast path: single chunk touched, `aout isa Array`, non-Missing eltype, eltype and shape match the chunk. Otherwise falls through to the existing channel-based path. Tests: 2499/2499 pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The two A3 fast paths share the same eligibility check (single chunk, plain Array of matching shape and eltype, non-Missing T). Extract that into a helper that returns the chunk index or `nothing`, so each caller's fast-path body shrinks to the I/O calls. The previous version also re-derived `indranges` to verify full-chunk coverage; that check is redundant. If `size(arr) == chunks` and `length(blockr) == 1`, the slice extent equals the chunk extent within one chunk, which forces it to align with that chunk's range — otherwise the slice would straddle into a neighbouring chunk and bump `length(blockr)`. The helper's docstring states this so the elision is not mysterious. Tests: 2499/2499 pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The single-chunk fastpath in writeblock! went through compress_raw →
pipeline_encode, which allocates a chunk-sized Vector{UInt8} and
copyto!'s the bytes in. For the V2-uncompressed-no-filters case the
encoded bytes are just the host-endian bit representation of the input,
so reinterpret(UInt8, ain) is a valid byte source — pass that straight
to the store and skip the allocation + memcpy.
Factor the fastpath body into write_singlechunk_fastpath! so the
specialization can dispatch on metadata type (V2 + NoCompressor +
Nothing filters) without touching the slow path. The slow path
continues to materialize an owned Vector{UInt8} since its scratch
buffer is reused across iterations and aliasing would race the
writetask.
ZarrBenchmarks v2 uncompressed write throughput (256x256x32x50 chunks,
NVMe on Btrfs):
- sequential: 1045 → 2513 MB/s (+141%), 95-101% of raw write() ceiling
- 8 threads: 1294 → 2444 MB/s (+89%)
- 512^2 chunk-size regression eliminated (the prior allocation
dominated at large chunks).
Tests cover dispatch (which() inspection), round-trip across rank 0-4
and multiple bitstypes, all-fill-value chunk elision, and caller-array
aliasing safety.
Collaborator
|
There is still a lot of AI slop in between functions, maybe another pass to cleanup ? or change them to docstrings (making sure the statements are accurate) if you want to keep them. Other than that, it does the job! |
lazarusA
approved these changes
May 30, 2026
Collaborator
|
looking forward to the v3 improvements 😄 . I will merge now, and if there are issues we fix them later in smaller PRs. |
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR works through @glwagner's list of V2 performance issues and lands the biggest wins as separate commits. V3 is out of scope here (covered separately).
Commits
59644e4zcompress!fallback50fbc16NoCompressorzuncompress!c046338getchunkarray_undefskips dead zero-fillebc3ea1writeblock!single-chunk fast pathd846986readblock!single-chunk fast path3839b95singlechunk_fastpathguard helperA1 —
zcompress!fallback (59644e4)Replace
empty!+append!withresize!+copyto!in the genericzcompress!fallback inCompressors.jl. ForNoCompressor,zcompressreturns a lazyreinterpret(UInt8, data)view that the old path walked element-by-element through_growend!/push!— about 67% of CPU on uncompressed V2 write profiles. The new path is a single SIMD/memcpy bulk copy. Real compressors (Blosc/Zstd/Zlib) already return aVector{UInt8}andcopyto!handles them identically.B1 —
NoCompressorzuncompress!(50fbc16)Mirror image for the decode side. The generic fallback ends up at
copyto!(::Array{T}, ::ReinterpretArray), which walks element-by-element at ~17 GB/s on Apple silicon vs. ~85 GB/s forunsafe_copyto!on the same memory. Add a::NoCompressor-dispatchedzuncompress!method alongside the existing per-compressor versions (Zstd / Zlib / Blosc already had their own). Guards ondata::Array{T}andisbitstype(T)keep the fast path offSenMissArray,MaxLengthString, raggedVector{T}eltypes, and non-contiguous inputs.A2 —
getchunkarray_undef(c046338)Both
readblock!andwriteblock!allocated their chunk-shaped scratch buffer viagetchunkarray(z) = fill(_zero(eltype(z)), z.metadata.chunks), then immediately overwrote it (decode writes every element on reads;resetbuffer!re-initialises on writes). Addgetchunkarray_undef(z::ZArray{T})returningArray{T}(undef, chunks), used byreadblock!always and bywriteblock!whenfill_value !== nothing. Falls back to the zero-fill whenMissing <: T(theSenMissArraycodec path needs the isbits inner buffer; Blosc rejectsUnion{Missing,T}directly). Port of the V2-side hunks from #272 verbatim, including theMissing <: Tcarve-out that PR added in75bea38to fix a Blosc rejection regression.A3-write —
writeblock!fast path (ebc3ea1)PProf of A1+B1+A2 showed 93% of write CPU pinned in
__psynch_cvwait— the readtask/writetask 0-buffered channel ferry sync wait. For the dominant single-chunk full-overwrite case (z[:,:,:,t] = buf), add a fast path that bypasses the channels entirely: encodeainstraight throughcompress_rawand callstore_writechunk/store_deletechunksynchronously. Fill-value elision is preserved. Anything else falls through to the channel-based path. SkipsMissing <: Tso theSenMissArraycodec contract stays intact.A3-read —
readblock!fast path (d846986)Same idea on the read side. Profile showed 80% read CPU in
__psynch_cvwaitplus 13% GC pressure from the per-call chunk-shaped scratch + final copy. The fast path reads the compressed bytes synchronously viastore_readchunkand decodes directly intoaout, eliminating both the channel ferry and the scratch buffer.Cleanup —
singlechunk_fastpathhelper (3839b95)Both fast paths share the same eligibility check. Factor it into
singlechunk_fastpath(arr, z, blockr) -> Union{CartesianIndex, Nothing}so each caller's fast-path body is just the I/O. Also drops the redundantindrangescoverage check — whensize(arr) == chunksandlength(blockr) == 1, the slice must align with that chunk's range (otherwise it would straddle into a neighbouring chunk and bumplength(blockr)).Combined effect — master vs HEAD
Apple M2 Ultra, Julia 1.12.6, V2 + NoCompressor +
chunks=(Nx,Ny,Nz,1), 1 warm-up + 7 timed repeats per size, matched A/B (git checkout 03270b2 -- src/then back) on the same wall-clock so cache state and power profile are identical:Writes
Reads
vs Zarrs.jl
For reference, Zarrs.jl (the Rust-backed implementation) on the same workload:
Zarr.jl is now faster than Zarrs.jl on uncompressed V2 across all three sizes, on both writes (1.10–1.96×) and reads (1.30–2.17×).
Tests
2499/2499 pass after each commit. Paths exercised:
SenMissArray/Missing <: T—Fillvalue as missingand thegetindex/setindexamiss testsMaxLengthString large-chunk read pathVector{Float64}eltype)v3_codecs.jl:336) — V3 path is unchangedNot in this PR
BytesCodecencode/decode bulk-copy and the exact-full-chunk overwrite path are still open; covered by upstream Performance: bulk-copy uncompressed encode + undef chunk buffer #272 (V3 portion) and a separate write-up.Vector{UInt8}throughpipeline_encode(main benefit is for compressed codecs, not NoCompressor).DirectoryStoremkdircache (bigger win for many-small-chunks workloads).all(isequal(fill_value), data). Confirmed no measurable win on dense bench data (allshort-circuits at element 1 in nanoseconds) and has real semantic cost: it implements Zarr's fill-value chunk elision, which would change the on-disk shape for sparse arrays and re-writes-with-fill, plus break 3 tests.