Performance improvements (v2 only for now) by asinghvi17 · Pull Request #280 · JuliaIO/Zarr.jl

asinghvi17 · 2026-05-20T17:56:43Z

This PR works through @glwagner's list of V2 performance issues and lands the biggest wins as separate commits. V3 is out of scope here (covered separately).

Commits

	sha	summary
1	`59644e4`	A1 — bulk-copy `zcompress!` fallback
2	`50fbc16`	B1 — bulk-copy `NoCompressor` `zuncompress!`
3	`c046338`	A2 — `getchunkarray_undef` skips dead zero-fill
4	`ebc3ea1`	A3-write — `writeblock!` single-chunk fast path
5	`d846986`	A3-read — `readblock!` single-chunk fast path
6	`3839b95`	factor `singlechunk_fastpath` guard helper

A1 — `zcompress!` fallback (`59644e4`)

Replace empty! + append! with resize! + copyto! in the generic zcompress! fallback in Compressors.jl. For NoCompressor, zcompress returns a lazy reinterpret(UInt8, data) view that the old path walked element-by-element through _growend! / push! — about 67% of CPU on uncompressed V2 write profiles. The new path is a single SIMD/memcpy bulk copy. Real compressors (Blosc/Zstd/Zlib) already return a Vector{UInt8} and copyto! handles them identically.

B1 — `NoCompressor` `zuncompress!` (`50fbc16`)

Mirror image for the decode side. The generic fallback ends up at copyto!(::Array{T}, ::ReinterpretArray), which walks element-by-element at ~17 GB/s on Apple silicon vs. ~85 GB/s for unsafe_copyto! on the same memory. Add a ::NoCompressor-dispatched zuncompress! method alongside the existing per-compressor versions (Zstd / Zlib / Blosc already had their own). Guards on data::Array{T} and isbitstype(T) keep the fast path off SenMissArray, MaxLengthString, ragged Vector{T} eltypes, and non-contiguous inputs.

A2 — `getchunkarray_undef` (`c046338`)

Both readblock! and writeblock! allocated their chunk-shaped scratch buffer via getchunkarray(z) = fill(_zero(eltype(z)), z.metadata.chunks), then immediately overwrote it (decode writes every element on reads; resetbuffer! re-initialises on writes). Add getchunkarray_undef(z::ZArray{T}) returning Array{T}(undef, chunks), used by readblock! always and by writeblock! when fill_value !== nothing. Falls back to the zero-fill when Missing <: T (the SenMissArray codec path needs the isbits inner buffer; Blosc rejects Union{Missing,T} directly). Port of the V2-side hunks from #272 verbatim, including the Missing <: T carve-out that PR added in 75bea38 to fix a Blosc rejection regression.

A3-write — `writeblock!` fast path (`ebc3ea1`)

PProf of A1+B1+A2 showed 93% of write CPU pinned in __psynch_cvwait — the readtask/writetask 0-buffered channel ferry sync wait. For the dominant single-chunk full-overwrite case (z[:,:,:,t] = buf), add a fast path that bypasses the channels entirely: encode ain straight through compress_raw and call store_writechunk / store_deletechunk synchronously. Fill-value elision is preserved. Anything else falls through to the channel-based path. Skips Missing <: T so the SenMissArray codec contract stays intact.

A3-read — `readblock!` fast path (`d846986`)

Same idea on the read side. Profile showed 80% read CPU in __psynch_cvwait plus 13% GC pressure from the per-call chunk-shaped scratch + final copy. The fast path reads the compressed bytes synchronously via store_readchunk and decodes directly into aout, eliminating both the channel ferry and the scratch buffer.

Cleanup — `singlechunk_fastpath` helper (`3839b95`)

Both fast paths share the same eligibility check. Factor it into singlechunk_fastpath(arr, z, blockr) -> Union{CartesianIndex, Nothing} so each caller's fast-path body is just the I/O. Also drops the redundant indranges coverage check — when size(arr) == chunks and length(blockr) == 1, the slice must align with that chunk's range (otherwise it would straddle into a neighbouring chunk and bump length(blockr)).

Combined effect — master vs HEAD

Apple M2 Ultra, Julia 1.12.6, V2 + NoCompressor + chunks=(Nx,Ny,Nz,1), 1 warm-up + 7 timed repeats per size, matched A/B (git checkout 03270b2 -- src/ then back) on the same wall-clock so cache state and power profile are identical:

Writes

size	master (`03270b2`)	this PR	speed-up
128×128×16×50 (50 MiB)	710 MB/s	2172 MB/s	3.06×
256×256×32×50 (400 MiB)	740 MB/s	2313 MB/s	3.12×
512×512×32×50 (1.56 GiB)	757 MB/s	2184 MB/s	2.88×

Reads

size	master (`03270b2`)	this PR	speed-up
128×128×16×50 (50 MiB)	3261 MB/s	7061 MB/s	2.16×
256×256×32×50 (400 MiB)	3008 MB/s	6746 MB/s	2.24×
512×512×32×50 (1.56 GiB)	3304 MB/s	8010 MB/s	2.42×

vs Zarrs.jl

For reference, Zarrs.jl (the Rust-backed implementation) on the same workload:

size	this PR (W)	Zarrs (W)	this PR (R)	Zarrs (R)
50 MiB	2172	1109	7061	5435
400 MiB	2313	1889	6746	5053
1.5 GiB	2184	1991	8010	3697

Zarr.jl is now faster than Zarrs.jl on uncompressed V2 across all three sizes, on both writes (1.10–1.96×) and reads (1.30–2.17×).

Tests

2499/2499 pass after each commit. Paths exercised:

SenMissArray / Missing <: T — Fillvalue as missing and the getindex/setindex amiss tests
MaxLengthString large-chunk read path
Ragged arrays (Vector{Float64} eltype)
Python round-trip via PythonCall
V3 fill-value elision (v3_codecs.jl:336) — V3 path is unchanged

Not in this PR

V3 path. BytesCodec encode/decode bulk-copy and the exact-full-chunk overwrite path are still open; covered by upstream Performance: bulk-copy uncompressed encode + undef chunk buffer #272 (V3 portion) and a separate write-up.
A5 — thread a scratch Vector{UInt8} through pipeline_encode (main benefit is for compressed codecs, not NoCompressor).
A6 — DirectoryStore mkdir cache (bigger win for many-small-chunks workloads).
Dropping all(isequal(fill_value), data). Confirmed no measurable win on dense bench data (all short-circuits at element 1 in nanoseconds) and has real semantic cost: it implements Zarr's fill-value chunk elision, which would change the on-disk shape for sparse arrays and re-writes-with-fill, plus break 3 tests.

`zcompress(data, NoCompressor())` returns a lazy `reinterpret(UInt8, data)` view. The old `empty!` + `append!` fallback walked that view element by element through `_growend!` / `push!`, materialising bytes one at a time — roughly 67% of CPU time on uncompressed full-chunk V2 writes in profiling. Replace with `resize!` + `copyto!` so the same path issues a single SIMD / memcpy bulk copy. Real compressors (Blosc, Zlib, Zstd) are unaffected: they already return a freshly-allocated `Vector{UInt8}` and `copyto!` handles that case identically. Bytes-on-disk are bit-identical; full test suite (2499 tests) passes. Measured on Apple M2 Ultra, Julia 1.12.6, V2 + NoCompressor + chunks=(Nx,Ny,Nz,1): | size | baseline | patched | speed-up | |---------------------|---------:|---------:|---------:| | 128×128×16×50 50M | 702 MB/s | 995 MB/s | 1.42× | | 256×256×32×50 400M | 799 MB/s |1530 MB/s | 1.92× | | 512×512×32×50 1.5G | 776 MB/s |1104 MB/s | 1.42× | Reads are unchanged (this fallback is not on the read path). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coveralls · 2026-05-20T18:12:13Z

Coverage Report for CI Build 26687704356

Coverage decreased (-0.005%) to 89.477%

Details

Coverage decreased (-0.005%) from the base build.
Patch coverage: 42 of 42 lines across 2 files are fully covered (100%).
4 coverage regressions across 3 files.

Uncovered Changes

No uncovered changes found.

Coverage Regressions

4 previously-covered lines in 3 files lost coverage.

File	Lines Losing Coverage	Coverage
src/metadata.jl	2	91.75%
ext/s3store.jl	1	92.65%
src/ZArray.jl	1	94.68%

Coverage Stats


Relevant Lines:	1834
Covered Lines:	1641
Line Coverage:	89.48%
Coverage Strength:	18922.87 hits per line

💛 - Coveralls

The generic `zuncompress!` fallback at the top of `Compressors.jl` ends up at `copyto!(::Array{T}, ::ReinterpretArray)` when `c isa NoCompressor`, since `zuncompress(bytes, ::NoCompressor, T)` returns a lazy `reinterpret(T, bytes)` view. That `copyto!` walks element by element at ~17 GB/s on Apple silicon vs. ~85 GB/s for `unsafe_copyto!` on the same memory — a 5× gap. End-to-end V2 reads absorb most of it via DiskArrays slicing, channel ferry, and storage I/O, leaving ~25-99% throughput improvement depending on chunk size. Add a `::NoCompressor`-dispatched `zuncompress!` method alongside the existing per-compressor versions for Zstd / Zlib / Blosc. Mirror image of the encode-side bulk copy that landed in the post-A1 `zcompress!`. Guards on `data::Array{T}` and `isbitstype(T)` keep the fast path off `SenMissArray`, `MaxLengthString`, ragged `Vector{T}` eltypes, and any non-contiguous input — those fall back to the existing generic `copyto!` path. Tests: 2499/2499 pass. Measured on Apple M2 Ultra, Julia 1.12.6, V2 + NoCompressor + chunks=(Nx,Ny,Nz,1), 1 warm-up + 7 timed repeats, A/B via git stash on the same wall-clock: | size | A1 only read | A1 + B1 read | speed-up | |---------------------|-------------:|-------------:|---------:| | 128×128×16×50 50M | 3080 MB/s | 3884 MB/s | +26% | | 256×256×32×50 400M | 3155 MB/s | 4130 MB/s | +31% | | 512×512×32×50 1.5G | 2243 MB/s | 4455 MB/s | +99% | Writes are unchanged (B1 doesn't touch the encode path). Reference: Zarrs.jl on the same V2-uncompressed workload reads at 5166 / 4672 / 5418 MB/s, so B1 closes most of the read gap; the 1.6 GiB case goes from ~58% of Zarrs.jl to ~82%. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Both `readblock!` and `writeblock!` allocated the chunk-shaped scratch buffer via `getchunkarray(z) = fill(_zero(eltype(z)), z.metadata.chunks)`, which zero-fills the chunk before any other work touches it. On the read side that fill is always clobbered: `pipeline_decode!` writes every element, or the fill-value branch in `uncompress_raw!` calls `fill!` itself. On the write side it's clobbered for full-chunk overwrites (the common case for `z[:,:,:,t] = buf`-style appends) and re-initialised by `resetbuffer!` for partial-chunk RMW. Add `getchunkarray_undef(z::ZArray{T})` which returns `Array{T}(undef, …)` for plain isbits eltypes and falls back to `getchunkarray` for `Missing <: T` (the `SenMissArray` path: Blosc and friends reject `Union{Missing,T}` directly, so the inner buffer has to be a real `Array{T}`). - `readblock!`: always use `getchunkarray_undef`. Decode-into and fill-value branches both initialise every element. - `writeblock!`: use `getchunkarray_undef` when `z.metadata.fill_value !== nothing` (resetbuffer! handles init). Keep the legacy zero-fill when `fill_value === nothing` so partial writes to a fresh chunk still default un-written cells to zero. Port of the V2-side hunks from #272 (verbatim, with the same `Missing <: T` carve-out the PR added in commit 75bea38 to fix a Blosc rejection regression). Tests: 2499/2499 pass — including the SenMissArray-exercising `Fillvalue as missing`, `getindex/setindex` (amiss), `MaxLengthString large-chunk read path`, and `ragged arrays` tests. Measured on Apple M2 Ultra, Julia 1.12.6, V2 + NoCompressor + chunks=(Nx,Ny,Nz,1), 1 warm-up + 7 timed repeats, A/B via git stash: | size | B1 only (R) | B1+A2 (R) | speed-up | |---------------------|------------:|-----------:|---------:| | 128×128×16×50 50M | 3466 MB/s | 3866 MB/s | +12% | | 256×256×32×50 400M | 3589 MB/s | 4169 MB/s | +16% | | 512×512×32×50 1.5G | ~3000 MB/s | ~3500 MB/s | noisy | Write side smaller (most write time is storage-bound, not memset): | size | B1 only (W) | B1+A2 (W) | speed-up | |---------------------|------------:|-----------:|---------:| | 128×128×16×50 50M | 1148 MB/s | 1203 MB/s | +5% | | 256×256×32×50 400M | 1327 MB/s | 1354 MB/s | +2% | | 512×512×32×50 1.5G | 1132 MB/s | 1329 MB/s | +17% | The 1.5 GiB row is the noisiest across runs because chunks no longer fit in L3 — exact numbers vary with cache state at run start. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

One bullet per commit on the PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The multi-paragraph rationale comments are now stale post-landing; one line each is enough to flag why bulk copy beats append! / generic ReinterpretArray copy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The slow writeblock! path spawns a readtask and a writetask connected by 0-buffered channels for every call. Profiling the 400 MiB headline workload showed 93% of write CPU pinned in __psynch_cvwait — the channel ferry synchronisation cost — for the dominant case of writing one full chunk at a time (`z[:,:,:,t] = buf`). Add a fast path at the top of writeblock! that, when the call touches exactly one chunk and `ain` is a plain Array matching the chunk shape and eltype, encodes `ain` directly via `compress_raw` and calls `store_writechunk`/`store_deletechunk` synchronously. Fill-value elision is preserved (delete-if-initialised when encode returns `nothing`). Restricted to non-Missing element types so the SenMissArray indirection required by the codec pipeline still goes through the slow path. Bench (M2 Ultra, ZS_REPEATS=3, V2 + NoCompressor, 50 timesteps): size write MiB/s read MiB/s before after before after 128³×16×50 1235 2228 3864 ~3500 (write +80%) 256³×32×50 1392 2160 3870 ~3300 (write +55%) 512²×32×50 1339 2240 4651 ~3500 (write +67%) Read variance in the combined bench is system noise; the isolated read-only bench shows reads unchanged. Tests: 2499/2499 pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Mirrors the writeblock! fast path: for the common case of reading exactly one full chunk into an Array of matching shape and eltype, bypass the readtask + 0-buffered channel and the chunk-shaped scratch buffer. Read the compressed bytes synchronously via `store_readchunk` and decode directly into `aout`, eliminating both the channel-ferry sync wait (80% of read CPU per profiling) and the per-call scratch allocation + final copy (the 13% GC pressure source). Bench (M2 Ultra, ZS_REPEATS=3, V2 + NoCompressor, isolated read-only): size before after speedup 128³×16×50 1528 MiB/s ~5300 MiB/s ~3.5x 256³×32×50 3128 MiB/s ~4900 MiB/s ~1.6x 512²×32×50 3248 MiB/s ~4800 MiB/s ~1.5x Same guards as the write fast path: single chunk touched, `aout isa Array`, non-Missing eltype, eltype and shape match the chunk. Otherwise falls through to the existing channel-based path. Tests: 2499/2499 pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The two A3 fast paths share the same eligibility check (single chunk, plain Array of matching shape and eltype, non-Missing T). Extract that into a helper that returns the chunk index or `nothing`, so each caller's fast-path body shrinks to the I/O calls. The previous version also re-derived `indranges` to verify full-chunk coverage; that check is redundant. If `size(arr) == chunks` and `length(blockr) == 1`, the slice extent equals the chunk extent within one chunk, which forces it to align with that chunk's range — otherwise the slice would straddle into a neighbouring chunk and bump `length(blockr)`. The helper's docstring states this so the elision is not mysterious. Tests: 2499/2499 pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The single-chunk fastpath in writeblock! went through compress_raw → pipeline_encode, which allocates a chunk-sized Vector{UInt8} and copyto!'s the bytes in. For the V2-uncompressed-no-filters case the encoded bytes are just the host-endian bit representation of the input, so reinterpret(UInt8, ain) is a valid byte source — pass that straight to the store and skip the allocation + memcpy. Factor the fastpath body into write_singlechunk_fastpath! so the specialization can dispatch on metadata type (V2 + NoCompressor + Nothing filters) without touching the slow path. The slow path continues to materialize an owned Vector{UInt8} since its scratch buffer is reused across iterations and aliasing would race the writetask. ZarrBenchmarks v2 uncompressed write throughput (256x256x32x50 chunks, NVMe on Btrfs): - sequential: 1045 → 2513 MB/s (+141%), 95-101% of raw write() ceiling - 8 threads: 1294 → 2444 MB/s (+89%) - 512^2 chunk-size regression eliminated (the prior allocation dominated at large chunks). Tests cover dispatch (which() inspection), round-trip across rank 0-4 and multiple bitstypes, all-fill-value chunk elision, and caller-array aliasing safety.

@lazarusA

Per @lazarusA's PR review: collapse the five repeated [#280] entries into one bullet with the sub-items as a list, matching the v0.10.0 #241 formatting. Adds the zero-copy NoCompressor fastpath as a new sub-item.

lazarusA · 2026-05-29T10:46:06Z

There is still a lot of AI slop in between functions, maybe another pass to cleanup ? or change them to docstrings (making sure the statements are accurate) if you want to keep them. Other than that, it does the job!

lazarusA · 2026-05-30T15:48:06Z

looking forward to the v3 improvements 😄 . I will merge now, and if there are issues we fix them later in smaller PRs.

asinghvi17 and others added 4 commits May 20, 2026 14:50

add CHANGELOG entries for #280 (V2 perf)

e9d9004

One bullet per commit on the PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

asinghvi17 force-pushed the as/performance branch from 6b4551a to 2e4d09b Compare May 20, 2026 19:56

lazarusA reviewed May 20, 2026

View reviewed changes

Comment thread CHANGELOG.md Outdated

asinghvi17 and others added 5 commits May 20, 2026 17:21

consolidate #280 CHANGELOG entries

031bf40

Per @lazarusA's PR review: collapse the five repeated [#280] entries into one bullet with the sub-items as a list, matching the v0.10.0 #241 formatting. Adds the zero-copy NoCompressor fastpath as a new sub-item.

asinghvi17 marked this pull request as ready for review May 28, 2026 10:13

lazarusA changed the title ~~[WIP] Performance improvements (v2 only for now)~~ Performance improvements (v2 only for now) May 29, 2026

Merge branch 'master' into as/performance

f129682

lazarusA approved these changes May 30, 2026

View reviewed changes

lazarusA merged commit d2a3f79 into master May 30, 2026
19 checks passed

asinghvi17 mentioned this pull request May 30, 2026

Performance: bulk-copy uncompressed encode + undef chunk buffer #272

Draft

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance improvements (v2 only for now)#280

Performance improvements (v2 only for now)#280
lazarusA merged 11 commits into
masterfrom
as/performance

asinghvi17 commented May 20, 2026 •

edited

Loading

Uh oh!

coveralls commented May 20, 2026 •

edited

Loading

Uh oh!

Uh oh!

lazarusA commented May 29, 2026

Uh oh!

lazarusA commented May 30, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

asinghvi17 commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Commits

A1 — zcompress! fallback (59644e4)

B1 — NoCompressor zuncompress! (50fbc16)

A2 — getchunkarray_undef (c046338)

A3-write — writeblock! fast path (ebc3ea1)

A3-read — readblock! fast path (d846986)

Cleanup — singlechunk_fastpath helper (3839b95)

Combined effect — master vs HEAD

Writes

Reads

vs Zarrs.jl

Tests

Not in this PR

Uh oh!

coveralls commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Coverage Report for CI Build 26687704356

Coverage decreased (-0.005%) to 89.477%

Details

Uncovered Changes

Coverage Regressions

Coverage Stats

💛 - Coveralls

Uh oh!

Uh oh!

lazarusA commented May 29, 2026

Uh oh!

lazarusA commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

asinghvi17 commented May 20, 2026 •

edited

Loading

A1 — `zcompress!` fallback (`59644e4`)

B1 — `NoCompressor` `zuncompress!` (`50fbc16`)

A2 — `getchunkarray_undef` (`c046338`)

A3-write — `writeblock!` fast path (`ebc3ea1`)

A3-read — `readblock!` fast path (`d846986`)

Cleanup — `singlechunk_fastpath` helper (`3839b95`)

coveralls commented May 20, 2026 •

edited

Loading

lazarusA commented May 30, 2026 •

edited

Loading