Skip to content

Performance improvements (v2 only for now)#280

Merged
lazarusA merged 11 commits into
masterfrom
as/performance
May 30, 2026
Merged

Performance improvements (v2 only for now)#280
lazarusA merged 11 commits into
masterfrom
as/performance

Conversation

@asinghvi17

@asinghvi17 asinghvi17 commented May 20, 2026

Copy link
Copy Markdown
Member

This PR works through @glwagner's list of V2 performance issues and lands the biggest wins as separate commits. V3 is out of scope here (covered separately).

Commits

sha summary
1 59644e4 A1 — bulk-copy zcompress! fallback
2 50fbc16 B1 — bulk-copy NoCompressor zuncompress!
3 c046338 A2 — getchunkarray_undef skips dead zero-fill
4 ebc3ea1 A3-write — writeblock! single-chunk fast path
5 d846986 A3-read — readblock! single-chunk fast path
6 3839b95 factor singlechunk_fastpath guard helper

A1 — zcompress! fallback (59644e4)

Replace empty! + append! with resize! + copyto! in the generic zcompress! fallback in Compressors.jl. For NoCompressor, zcompress returns a lazy reinterpret(UInt8, data) view that the old path walked element-by-element through _growend! / push! — about 67% of CPU on uncompressed V2 write profiles. The new path is a single SIMD/memcpy bulk copy. Real compressors (Blosc/Zstd/Zlib) already return a Vector{UInt8} and copyto! handles them identically.

B1 — NoCompressor zuncompress! (50fbc16)

Mirror image for the decode side. The generic fallback ends up at copyto!(::Array{T}, ::ReinterpretArray), which walks element-by-element at ~17 GB/s on Apple silicon vs. ~85 GB/s for unsafe_copyto! on the same memory. Add a ::NoCompressor-dispatched zuncompress! method alongside the existing per-compressor versions (Zstd / Zlib / Blosc already had their own). Guards on data::Array{T} and isbitstype(T) keep the fast path off SenMissArray, MaxLengthString, ragged Vector{T} eltypes, and non-contiguous inputs.

A2 — getchunkarray_undef (c046338)

Both readblock! and writeblock! allocated their chunk-shaped scratch buffer via getchunkarray(z) = fill(_zero(eltype(z)), z.metadata.chunks), then immediately overwrote it (decode writes every element on reads; resetbuffer! re-initialises on writes). Add getchunkarray_undef(z::ZArray{T}) returning Array{T}(undef, chunks), used by readblock! always and by writeblock! when fill_value !== nothing. Falls back to the zero-fill when Missing <: T (the SenMissArray codec path needs the isbits inner buffer; Blosc rejects Union{Missing,T} directly). Port of the V2-side hunks from #272 verbatim, including the Missing <: T carve-out that PR added in 75bea38 to fix a Blosc rejection regression.

A3-write — writeblock! fast path (ebc3ea1)

PProf of A1+B1+A2 showed 93% of write CPU pinned in __psynch_cvwait — the readtask/writetask 0-buffered channel ferry sync wait. For the dominant single-chunk full-overwrite case (z[:,:,:,t] = buf), add a fast path that bypasses the channels entirely: encode ain straight through compress_raw and call store_writechunk / store_deletechunk synchronously. Fill-value elision is preserved. Anything else falls through to the channel-based path. Skips Missing <: T so the SenMissArray codec contract stays intact.

A3-read — readblock! fast path (d846986)

Same idea on the read side. Profile showed 80% read CPU in __psynch_cvwait plus 13% GC pressure from the per-call chunk-shaped scratch + final copy. The fast path reads the compressed bytes synchronously via store_readchunk and decodes directly into aout, eliminating both the channel ferry and the scratch buffer.

Cleanup — singlechunk_fastpath helper (3839b95)

Both fast paths share the same eligibility check. Factor it into singlechunk_fastpath(arr, z, blockr) -> Union{CartesianIndex, Nothing} so each caller's fast-path body is just the I/O. Also drops the redundant indranges coverage check — when size(arr) == chunks and length(blockr) == 1, the slice must align with that chunk's range (otherwise it would straddle into a neighbouring chunk and bump length(blockr)).

Combined effect — master vs HEAD

Apple M2 Ultra, Julia 1.12.6, V2 + NoCompressor + chunks=(Nx,Ny,Nz,1), 1 warm-up + 7 timed repeats per size, matched A/B (git checkout 03270b2 -- src/ then back) on the same wall-clock so cache state and power profile are identical:

Writes

size master (03270b2) this PR speed-up
128×128×16×50 (50 MiB) 710 MB/s 2172 MB/s 3.06×
256×256×32×50 (400 MiB) 740 MB/s 2313 MB/s 3.12×
512×512×32×50 (1.56 GiB) 757 MB/s 2184 MB/s 2.88×

Reads

size master (03270b2) this PR speed-up
128×128×16×50 (50 MiB) 3261 MB/s 7061 MB/s 2.16×
256×256×32×50 (400 MiB) 3008 MB/s 6746 MB/s 2.24×
512×512×32×50 (1.56 GiB) 3304 MB/s 8010 MB/s 2.42×

vs Zarrs.jl

For reference, Zarrs.jl (the Rust-backed implementation) on the same workload:

size this PR (W) Zarrs (W) this PR (R) Zarrs (R)
50 MiB 2172 1109 7061 5435
400 MiB 2313 1889 6746 5053
1.5 GiB 2184 1991 8010 3697

Zarr.jl is now faster than Zarrs.jl on uncompressed V2 across all three sizes, on both writes (1.10–1.96×) and reads (1.30–2.17×).

Tests

2499/2499 pass after each commit. Paths exercised:

  • SenMissArray / Missing <: TFillvalue as missing and the getindex/setindex amiss tests
  • MaxLengthString large-chunk read path
  • Ragged arrays (Vector{Float64} eltype)
  • Python round-trip via PythonCall
  • V3 fill-value elision (v3_codecs.jl:336) — V3 path is unchanged

Not in this PR

  • V3 path. BytesCodec encode/decode bulk-copy and the exact-full-chunk overwrite path are still open; covered by upstream Performance: bulk-copy uncompressed encode + undef chunk buffer #272 (V3 portion) and a separate write-up.
  • A5 — thread a scratch Vector{UInt8} through pipeline_encode (main benefit is for compressed codecs, not NoCompressor).
  • A6DirectoryStore mkdir cache (bigger win for many-small-chunks workloads).
  • Dropping all(isequal(fill_value), data). Confirmed no measurable win on dense bench data (all short-circuits at element 1 in nanoseconds) and has real semantic cost: it implements Zarr's fill-value chunk elision, which would change the on-disk shape for sparse arrays and re-writes-with-fill, plus break 3 tests.

`zcompress(data, NoCompressor())` returns a lazy `reinterpret(UInt8, data)`
view. The old `empty!` + `append!` fallback walked that view element by
element through `_growend!` / `push!`, materialising bytes one at a time —
roughly 67% of CPU time on uncompressed full-chunk V2 writes in profiling.

Replace with `resize!` + `copyto!` so the same path issues a single SIMD /
memcpy bulk copy. Real compressors (Blosc, Zlib, Zstd) are unaffected: they
already return a freshly-allocated `Vector{UInt8}` and `copyto!` handles
that case identically.

Bytes-on-disk are bit-identical; full test suite (2499 tests) passes.

Measured on Apple M2 Ultra, Julia 1.12.6, V2 + NoCompressor + chunks=(Nx,Ny,Nz,1):

  | size                | baseline | patched  | speed-up |
  |---------------------|---------:|---------:|---------:|
  |  128×128×16×50  50M | 702 MB/s | 995 MB/s | 1.42×    |
  |  256×256×32×50 400M | 799 MB/s |1530 MB/s | 1.92×    |
  |  512×512×32×50 1.5G | 776 MB/s |1104 MB/s | 1.42×    |

Reads are unchanged (this fallback is not on the read path).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coveralls

coveralls commented May 20, 2026

Copy link
Copy Markdown

Coverage Report for CI Build 26687704356

Coverage decreased (-0.005%) to 89.477%

Details

  • Coverage decreased (-0.005%) from the base build.
  • Patch coverage: 42 of 42 lines across 2 files are fully covered (100%).
  • 4 coverage regressions across 3 files.

Uncovered Changes

No uncovered changes found.

Coverage Regressions

4 previously-covered lines in 3 files lost coverage.

File Lines Losing Coverage Coverage
src/metadata.jl 2 91.75%
ext/s3store.jl 1 92.65%
src/ZArray.jl 1 94.68%

Coverage Stats

Coverage Status
Relevant Lines: 1834
Covered Lines: 1641
Line Coverage: 89.48%
Coverage Strength: 18922.87 hits per line

💛 - Coveralls

asinghvi17 and others added 4 commits May 20, 2026 14:50
The generic `zuncompress!` fallback at the top of `Compressors.jl`
ends up at `copyto!(::Array{T}, ::ReinterpretArray)` when
`c isa NoCompressor`, since `zuncompress(bytes, ::NoCompressor, T)`
returns a lazy `reinterpret(T, bytes)` view. That `copyto!` walks
element by element at ~17 GB/s on Apple silicon vs. ~85 GB/s for
`unsafe_copyto!` on the same memory — a 5× gap. End-to-end V2 reads
absorb most of it via DiskArrays slicing, channel ferry, and storage
I/O, leaving ~25-99% throughput improvement depending on chunk size.

Add a `::NoCompressor`-dispatched `zuncompress!` method alongside
the existing per-compressor versions for Zstd / Zlib / Blosc. Mirror
image of the encode-side bulk copy that landed in the post-A1
`zcompress!`. Guards on `data::Array{T}` and `isbitstype(T)` keep
the fast path off `SenMissArray`, `MaxLengthString`, ragged
`Vector{T}` eltypes, and any non-contiguous input — those fall back
to the existing generic `copyto!` path.

Tests: 2499/2499 pass.

Measured on Apple M2 Ultra, Julia 1.12.6, V2 + NoCompressor +
chunks=(Nx,Ny,Nz,1), 1 warm-up + 7 timed repeats, A/B via git stash
on the same wall-clock:

  | size                | A1 only read | A1 + B1 read | speed-up |
  |---------------------|-------------:|-------------:|---------:|
  |  128×128×16×50  50M |  3080 MB/s   |  3884 MB/s   | +26%     |
  |  256×256×32×50 400M |  3155 MB/s   |  4130 MB/s   | +31%     |
  |  512×512×32×50 1.5G |  2243 MB/s   |  4455 MB/s   | +99%     |

Writes are unchanged (B1 doesn't touch the encode path).

Reference: Zarrs.jl on the same V2-uncompressed workload reads at
5166 / 4672 / 5418 MB/s, so B1 closes most of the read gap; the
1.6 GiB case goes from ~58% of Zarrs.jl to ~82%.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both `readblock!` and `writeblock!` allocated the chunk-shaped scratch
buffer via `getchunkarray(z) = fill(_zero(eltype(z)), z.metadata.chunks)`,
which zero-fills the chunk before any other work touches it. On the
read side that fill is always clobbered: `pipeline_decode!` writes
every element, or the fill-value branch in `uncompress_raw!` calls
`fill!` itself. On the write side it's clobbered for full-chunk
overwrites (the common case for `z[:,:,:,t] = buf`-style appends) and
re-initialised by `resetbuffer!` for partial-chunk RMW.

Add `getchunkarray_undef(z::ZArray{T})` which returns
`Array{T}(undef, …)` for plain isbits eltypes and falls back to
`getchunkarray` for `Missing <: T` (the `SenMissArray` path: Blosc
and friends reject `Union{Missing,T}` directly, so the inner buffer
has to be a real `Array{T}`).

- `readblock!`: always use `getchunkarray_undef`. Decode-into and
  fill-value branches both initialise every element.
- `writeblock!`: use `getchunkarray_undef` when
  `z.metadata.fill_value !== nothing` (resetbuffer! handles init).
  Keep the legacy zero-fill when `fill_value === nothing` so partial
  writes to a fresh chunk still default un-written cells to zero.

Port of the V2-side hunks from #272 (verbatim, with
the same `Missing <: T` carve-out the PR added in commit 75bea38 to
fix a Blosc rejection regression).

Tests: 2499/2499 pass — including the SenMissArray-exercising
`Fillvalue as missing`, `getindex/setindex` (amiss),
`MaxLengthString large-chunk read path`, and `ragged arrays` tests.

Measured on Apple M2 Ultra, Julia 1.12.6, V2 + NoCompressor +
chunks=(Nx,Ny,Nz,1), 1 warm-up + 7 timed repeats, A/B via git stash:

  | size                | B1 only (R) | B1+A2 (R)  | speed-up |
  |---------------------|------------:|-----------:|---------:|
  |  128×128×16×50  50M |  3466 MB/s  | 3866 MB/s  | +12%     |
  |  256×256×32×50 400M |  3589 MB/s  | 4169 MB/s  | +16%     |
  |  512×512×32×50 1.5G |  ~3000 MB/s | ~3500 MB/s | noisy    |

Write side smaller (most write time is storage-bound, not memset):

  | size                | B1 only (W) | B1+A2 (W)  | speed-up |
  |---------------------|------------:|-----------:|---------:|
  |  128×128×16×50  50M |  1148 MB/s  | 1203 MB/s  | +5%      |
  |  256×256×32×50 400M |  1327 MB/s  | 1354 MB/s  | +2%      |
  |  512×512×32×50 1.5G |  1132 MB/s  | 1329 MB/s  | +17%     |

The 1.5 GiB row is the noisiest across runs because chunks no longer
fit in L3 — exact numbers vary with cache state at run start.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
One bullet per commit on the PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The multi-paragraph rationale comments are now stale post-landing; one line each is enough to flag why bulk copy beats append! / generic ReinterpretArray copy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread CHANGELOG.md Outdated
asinghvi17 and others added 5 commits May 20, 2026 17:21
The slow writeblock! path spawns a readtask and a writetask connected by
0-buffered channels for every call. Profiling the 400 MiB headline workload
showed 93% of write CPU pinned in __psynch_cvwait — the channel ferry
synchronisation cost — for the dominant case of writing one full chunk at
a time (`z[:,:,:,t] = buf`).

Add a fast path at the top of writeblock! that, when the call touches
exactly one chunk and `ain` is a plain Array matching the chunk shape and
eltype, encodes `ain` directly via `compress_raw` and calls
`store_writechunk`/`store_deletechunk` synchronously. Fill-value elision is
preserved (delete-if-initialised when encode returns `nothing`).

Restricted to non-Missing element types so the SenMissArray indirection
required by the codec pipeline still goes through the slow path.

Bench (M2 Ultra, ZS_REPEATS=3, V2 + NoCompressor, 50 timesteps):

  size           write MiB/s        read MiB/s
                 before  after      before  after
  128³×16×50      1235    2228       3864    ~3500   (write +80%)
  256³×32×50      1392    2160       3870    ~3300   (write +55%)
  512²×32×50      1339    2240       4651    ~3500   (write +67%)

Read variance in the combined bench is system noise; the isolated
read-only bench shows reads unchanged.

Tests: 2499/2499 pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors the writeblock! fast path: for the common case of reading exactly
one full chunk into an Array of matching shape and eltype, bypass the
readtask + 0-buffered channel and the chunk-shaped scratch buffer. Read
the compressed bytes synchronously via `store_readchunk` and decode
directly into `aout`, eliminating both the channel-ferry sync wait (80%
of read CPU per profiling) and the per-call scratch allocation + final
copy (the 13% GC pressure source).

Bench (M2 Ultra, ZS_REPEATS=3, V2 + NoCompressor, isolated read-only):

  size           before      after       speedup
  128³×16×50     1528 MiB/s  ~5300 MiB/s  ~3.5x
  256³×32×50     3128 MiB/s  ~4900 MiB/s  ~1.6x
  512²×32×50     3248 MiB/s  ~4800 MiB/s  ~1.5x

Same guards as the write fast path: single chunk touched, `aout isa Array`,
non-Missing eltype, eltype and shape match the chunk. Otherwise falls
through to the existing channel-based path.

Tests: 2499/2499 pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The two A3 fast paths share the same eligibility check (single chunk,
plain Array of matching shape and eltype, non-Missing T). Extract that
into a helper that returns the chunk index or `nothing`, so each
caller's fast-path body shrinks to the I/O calls.

The previous version also re-derived `indranges` to verify full-chunk
coverage; that check is redundant. If `size(arr) == chunks` and
`length(blockr) == 1`, the slice extent equals the chunk extent within
one chunk, which forces it to align with that chunk's range — otherwise
the slice would straddle into a neighbouring chunk and bump
`length(blockr)`. The helper's docstring states this so the elision
is not mysterious.

Tests: 2499/2499 pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The single-chunk fastpath in writeblock! went through compress_raw →
pipeline_encode, which allocates a chunk-sized Vector{UInt8} and
copyto!'s the bytes in. For the V2-uncompressed-no-filters case the
encoded bytes are just the host-endian bit representation of the input,
so reinterpret(UInt8, ain) is a valid byte source — pass that straight
to the store and skip the allocation + memcpy.

Factor the fastpath body into write_singlechunk_fastpath! so the
specialization can dispatch on metadata type (V2 + NoCompressor +
Nothing filters) without touching the slow path. The slow path
continues to materialize an owned Vector{UInt8} since its scratch
buffer is reused across iterations and aliasing would race the
writetask.

ZarrBenchmarks v2 uncompressed write throughput (256x256x32x50 chunks,
NVMe on Btrfs):
  - sequential: 1045 → 2513 MB/s (+141%), 95-101% of raw write() ceiling
  - 8 threads:  1294 → 2444 MB/s (+89%)
  - 512^2 chunk-size regression eliminated (the prior allocation
    dominated at large chunks).

Tests cover dispatch (which() inspection), round-trip across rank 0-4
and multiple bitstypes, all-fill-value chunk elision, and caller-array
aliasing safety.
Per @lazarusA's PR review: collapse the five repeated [#280] entries
into one bullet with the sub-items as a list, matching the v0.10.0
#241 formatting. Adds the zero-copy NoCompressor fastpath as a new
sub-item.
@asinghvi17 asinghvi17 marked this pull request as ready for review May 28, 2026 10:13
@lazarusA lazarusA changed the title [WIP] Performance improvements (v2 only for now) Performance improvements (v2 only for now) May 29, 2026
@lazarusA

Copy link
Copy Markdown
Collaborator

There is still a lot of AI slop in between functions, maybe another pass to cleanup ? or change them to docstrings (making sure the statements are accurate) if you want to keep them. Other than that, it does the job!

@lazarusA

lazarusA commented May 30, 2026

Copy link
Copy Markdown
Collaborator

looking forward to the v3 improvements 😄 . I will merge now, and if there are issues we fix them later in smaller PRs.

@lazarusA lazarusA merged commit d2a3f79 into master May 30, 2026
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants