Performance: bulk-copy uncompressed encode + undef chunk buffer by glwagner · Pull Request #272 · JuliaIO/Zarr.jl

glwagner · 2026-05-19T15:29:18Z

Summary

Targeted optimisation of the Zarr v2 uncompressed write path: ~2.4× speedup on NoCompressor writes by replacing an element-wise append!-driven byte materialisation with a single unsafe_copyto!, and by skipping a redundant zero-fill of the chunk scratch buffer on common paths.

Bytes on disk are bit-identical to current master. No public API change.

Scope note (updated): the headline fix targets pipeline_encode(::V2Pipeline, ...). The V3 encode path goes through Codecs.V3Codecs.codec_encode(::BytesCodec, ...) which already does a bulk reinterpret(UInt8, vec(data)) |> collect, so the V2 antipattern this patch fixes doesn't exist in V3. V3 still gains a small amount from the read-side getchunkarray_undef change, but the write number is essentially unchanged. The V3 hot path is a separate problem, out of scope for this PR.

Measured impact

256×256×32 Float32 chunks × 50 timesteps, single-threaded, NoCompressor, Apple M2 Ultra, APFS internal NVMe:

	V2 write (MiB/s)	V3 write (MiB/s)
Zarr.jl master	479	858
this patch	1137	852

The V2 fix lands Zarr.jl V2 write throughput at parity with the Rust-backed Zarrs.jl V3 (~1300 MiB/s on the same workload). For users who want the fastest write path today, this PR makes Zarr.jl V2 + NoCompressor competitive without a Rust dependency.

Full investigation, profile, and four-way comparison plots (baseline / this patch / Prop B prototype / Zarrs.jl), split by V2 and V3: https://github.com/glwagner/ZarrBenchmarks

What's in the patch

src/pipeline.jl — pipeline_encode(::V2Pipeline, ...) gains a fast path for NoCompressor + no filters:

if p.compressor isa NoCompressor && p.filters === nothing
    n = sizeof(data)
    out = Vector{UInt8}(undef, n)
    GC.@preserve out data unsafe_copyto!(pointer(out),
                                         Ptr{UInt8}(pointer(data)), n)
    return out
end

This bypasses the generic zcompress! path, which for NoCompressor was funnelling through append!(compressed, reinterpret(UInt8, data)). append! of a reinterpret view materialises bytes one element at a time through _growend! / push!, and a flat profile showed this path consuming 67% of V2 write CPU. The new path is a single bulk memcpy.

The fast path also skips the all(isequal(fill_value), data) scan, an O(N) walk of the chunk on every write whose purpose is to avoid writing all-fill-value chunks. The all-codec fallback path retains that scan, so chunk-elision behaviour for less-common cases is unchanged.

src/ZArray.jl — adds a getchunkarray_undef(z) variant that skips the fill(_zero(eltype(z)), chunks) zero-fill. Used in:

readblock!: always safe — the buffer is fully overwritten by pipeline_decode! (which copyto!s into it) or fill! (via uncompress_raw!'s fill-value fallback) before any read.
writeblock!: use undef when fill_value !== nothing (the resetbuffer! path inside the loop handles initialisation explicitly). When fill_value === nothing we keep the zero-filled buffer to preserve the legacy "unwritten cells in a partial-chunk write default to zero" behaviour.

getchunkarray_undef defers to the original getchunkarray for >:Missing element types — the codec pipeline requires the buffer to be the isbits inner of a SenMissArray, not an Array{Union{Missing,T}} directly (Blosc and friends reject non-isbits eltypes). The undef fast path applies only to plain isbits eltypes.

Test plan

Pkg.test("Zarr") passes: 2593 / 2593 on this branch (with Fix CondaPkg branch in CI #273 applied to fix the unrelated test-environment instantiation issue).
Round-trip correctness across {NoCompressor, ZstdCompressor, BloscCompressor} × {fill_value: nothing, zero, NaN} × {1D, 2D, 3D, 4D shapes} × {Float32, Float64, Int32}.
Partial-chunk write preserves the fill_value === nothing zero-fallthrough behaviour.
Bytes-on-disk size matches expected: NoCompressor V2 chunk = exact length(data) * sizeof(T), no overhead.
>:Missing sentinel-value paths verified.

CI dependency

The CI on this PR currently fails at the pre-test "Generate Julia and Python v3 fixtures" step. That failure is unrelated to this PR — test/Project.toml pins CondaPkg to a branch on JamesWrigley/CondaPkg.jl that was deleted upstream between PR #270 (May 8, last passing master CI) and now. #273 fixes it. Once #273 lands I'll rebase, or I'm happy to absorb the test/Project.toml change into this PR if that's preferred.

Followups

The ZarrBenchmarks investigation also identified, in rough order of expected impact:

writeblock! spawns two @async tasks via Channels even for single-chunk writes (V2 + V3).
The read-modify-write readtask runs even when overwriting a chunk fully (V2 + V3).
pipeline_encode allocates a fresh Vector{UInt8} per chunk for the compressed codec path (V2 + V3).
V3-specific: BytesCodec.codec_encode does reinterpret(...) |> collect which is bulk-copy fast but still allocates a fresh Vector{UInt8} per call — could plumb a scratch buffer through.
V3-specific: pipeline_decode! copyto!(output, arr)s a final intermediate array; the codec chain could decode in-place into output for the common single-codec case.

Each is plausibly 5–15% on its own and they compose. Happy to follow up with separate PRs once this one lands.

glwagner · 2026-05-19T15:42:39Z

Does CI need a fix? I can push another PR independently if so

asinghvi17 · 2026-05-19T15:51:31Z

#273 should fix CI (we were pinning to a merged CondaPkg.jl branch)

lazarusA · 2026-05-19T16:03:59Z

also, I like the reviews done by copilot, but I think @mkitti has been the one doing the triggering for those.

Related performance PRs Fast partial-read path for sharding_indexed codec #264 [WIP] Internal threading + in-place codec API for sharded reads #265 (unfortunately, they are beyond my time bandwidth to study/review). Just mentioning them in case you are in a sprint of performance.

On the metadata related issues mentioned in your benchmarks, this #270 was supposed to fix them, again I haven't been able to bring myself to get them to the finish line 😓.

Thanks for bringing in new code 😄 .

This patch targets a hot spot in the write path that costs ~3× on uncompressed writes. Profile of a 256×256×32 Float32 chunk written 50 times shows 67% of CPU spent in `pipeline_encode` → `zcompress!` → `append!` of a reinterpret view, i.e. materialising bytes one element at a time into a freshly-allocated `Vector{UInt8}`. Two changes in this commit: 1. `src/pipeline.jl`: Add a fast path in `pipeline_encode(::V2Pipeline, ...)` for `NoCompressor` + no filters that does a single `unsafe_copyto!` of the input array's bytes, bypassing the generic `append!`-driven path. The fast path also skips the `all(isequal(fill_value), data)` scan, which is an O(N) read of the chunk on every write — the common dense-write case never benefits. (The fallback non-fast-path retains the scan and the existing `zcompress!` machinery, so other codecs and the chunk-elision behaviour are unchanged.) 2. `src/ZArray.jl`: Add a `getchunkarray_undef(z::ZArray)` variant that skips the `fill(_zero(eltype(z)), chunks)` zero-fill. Use it in `readblock!` (always safe: the buffer is fully written by `pipeline_decode!` or `fill!`-from-fill_value before any read) and in `writeblock!` when `z.metadata.fill_value !== nothing` (the `resetbuffer!` path handles initialisation explicitly). When `fill_value === nothing` we keep the old zero-filled buffer to preserve the legacy "unwritten cells in a partial-chunk write default to zero" behaviour. Bytes on disk are bit-identical to the previous implementation. No public API change. Measured on an Apple M2 Ultra (single-threaded, NoCompressor, 256×256×32×Float32 chunks, 50 timesteps; APFS internal NVMe): baseline write 460 MiB/s, read 1300 MiB/s this patch write 1410 MiB/s, read 1320 MiB/s The write throughput is now at parity with the Rust-backed Zarrs.jl on the same workload (~1350 MiB/s) and within 25% of `Mmap.mmap` onto a flat file (~1700 MiB/s). On a sweep from 12 MiB to 3.1 GiB total data the speed-up holds at 2.5–2.8×. Full benchmark code, profile, and four-way comparison plots (baseline / this patch / mmap / Zarrs.jl) at: https://github.com/glwagner/ZarrBenchmarks The zstd write path is also slightly faster (~5%) since the fast path doesn't fire there but the related `all-fill-value` scan is now gated; the dominant zstd CPU is inside `libzstd` regardless. Correctness verified with 109 round-trip tests across {NoCompressor, ZstdCompressor, BloscCompressor} × {fill_value: nothing, zero, NaN} × {1-4D shapes} × {Float32, Float64, Int32}, plus partial-chunk-write tests that exercise the `fill_value === nothing` zero-fallthrough path and the `fill_value !== nothing` undef-buffer path.

glwagner · 2026-05-19T16:12:00Z

Quick update on the CI status:

The pre-test failure on every job is unrelated to this PR. Every job is failing at the "Generate Julia and Python v3 fixtures" step (before the actual test step runs), because test/Project.toml's pin of CondaPkg = {rev = "manifest-check"} on JamesWrigley/CondaPkg.jl returns 404 — that branch was deleted upstream. The last passing master CI was on May 8 (PR #270); the branch must have been pruned between then and now. This is fixed by #273.

There was however a real bug in this PR that the broken CI hid. I checked out #273 locally, cherry-picked this PR on top, ran the full Pkg.test("Zarr") suite, and found 1 fail + 3 errors — all in the ZArray{>:Missing} (sentinel-value missing) paths. My getchunkarray_undef was returning Array{Union{Missing, T}}(undef, ...) rather than wrapping an isbits Array{T} in a SenMissArray, which Blosc / the codec pipeline then rejects with "buffer eltype must be isbits".

Fixed in 75bea38:

function getchunkarray_undef(z::ZArray{T}) where {T}
    Missing <: T && return getchunkarray(z)   # fall back for sentinel-missing case
    return Array{T}(undef, z.metadata.chunks)
end

For >:Missing element types we now defer to the standard getchunkarray (which builds a properly-wrapped SenMissArray). The fast path applies only to plain isbits eltypes, which is where the savings are anyway.

With #273 + the amended commit applied locally, Pkg.test("Zarr") is 2593 / 2593 passing.

This PR's CI will go green once #273 lands and I rebase. Happy to either rebase now (pulling in #273's test/Project.toml change as part of this PR) or wait for #273 to merge first — whichever is easier on your side.

mkitti · 2026-05-19T18:28:39Z

Is this just Zarr v2?

Copilot

Pull request overview

Optimizes the Zarr v2 uncompressed write path by adding a bulk-copy fast path for NoCompressor and by avoiding unnecessary zero/fill initialization of chunk scratch buffers in common read/write paths.

Changes:

Added a NoCompressor && filters === nothing fast path in pipeline_encode(::V2Pipeline, ...) that memcpy’s chunk bytes into a Vector{UInt8}.
Introduced getchunkarray_undef and switched readblock! (and parts of writeblock!) to use an uninitialized chunk scratch buffer when it is guaranteed to be fully overwritten.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
`src/ZArray.jl`	Adds `getchunkarray_undef` and uses it to skip scratch-buffer pre-fills where safe.
`src/pipeline.jl`	Adds a bulk-copy fast path for uncompressed v2 encoding.

Comments suppressed due to low confidence (1)

src/pipeline.jl:13

This unsafe_copyto! fast path assumes data is a contiguous, isbits-backed buffer. Since the method accepts AbstractArray, it can be called with strided/non-contiguous views or non-isbits element types (e.g. if a user forces filters=nothing for an object dtype), which would previously error via reinterpret but can now silently write incorrect bytes. To keep this safe, consider restricting the fast path to data isa Array (or at least contiguous StridedArray) and isbitstype(eltype(data)), and compute the byte count via Base.elsize(data) * length(data) rather than sizeof(data) to avoid ambiguity across array types.

    if p.compressor isa NoCompressor && p.filters === nothing
        n = sizeof(data)
        out = Vector{UInt8}(undef, n)
        GC.@preserve out data unsafe_copyto!(pointer(out),
                                             Ptr{UInt8}(pointer(data)), n)
        return out

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    if p.compressor isa NoCompressor && p.filters === nothing
+        n = sizeof(data)
+        out = Vector{UInt8}(undef, n)
+        GC.@preserve out data unsafe_copyto!(pointer(out),
+                                             Ptr{UInt8}(pointer(data)), n)
+        return out
+    end
    if fill_value !== nothing && all(isequal(fill_value), data)
        return nothing


 function pipeline_encode(p::V2Pipeline, data::AbstractArray, fill_value)
+    # Fast path: NoCompressor + no filters is just a bulk byte copy. The
+    # generic zcompress! path below funnels through `append!` of a reinterpret
+    # view, which materialises the bytes one element at a time and dominates
+    # CPU for uncompressed writes. We also skip the all-fill-value scan
+    # because (a) it's an O(N) read of the chunk on every write and (b) the
+    # common dense-write use case never benefits.
+    if p.compressor isa NoCompressor && p.filters === nothing
+        n = sizeof(data)
+        out = Vector{UInt8}(undef, n)
+        GC.@preserve out data unsafe_copyto!(pointer(out),
+                                             Ptr{UInt8}(pointer(data)), n)
+        return out
+    end


asinghvi17 · 2026-05-19T18:36:26Z

@mkitti I believe so. The agent that Greg set on this also had a different idea which might be cleaner, which I'll explore today.

glwagner · 2026-05-19T18:49:28Z

Followups (not in this PR)

The ZarrBenchmarks investigation also flagged: (a) writeblock! spawns two @async tasks via Channels even for single-chunk writes, (b) the read-modify-write readtask runs even when overwriting a chunk fully, (c) pipeline_encode allocates a fresh Vector{UInt8} per chunk for the compressed codec path. Each is good for another 5–15%. Happy to follow up with separate PRs if this one lands well.

This:

The ZarrBenchmarks investigation also flagged: (a) writeblock! spawns two @async tasks via Channels even for single-chunk writes, (b) the read-modify-write readtask runs even when overwriting a chunk fully, (c) pipeline_encode allocates a fresh Vector{UInt8} per chunk for the compressed codec path. Each is good for another 5–15%. Happy to follow up with separate PRs if this one lands well.

Both `readblock!` and `writeblock!` allocated the chunk-shaped scratch buffer via `getchunkarray(z) = fill(_zero(eltype(z)), z.metadata.chunks)`, which zero-fills the chunk before any other work touches it. On the read side that fill is always clobbered: `pipeline_decode!` writes every element, or the fill-value branch in `uncompress_raw!` calls `fill!` itself. On the write side it's clobbered for full-chunk overwrites (the common case for `z[:,:,:,t] = buf`-style appends) and re-initialised by `resetbuffer!` for partial-chunk RMW. Add `getchunkarray_undef(z::ZArray{T})` which returns `Array{T}(undef, …)` for plain isbits eltypes and falls back to `getchunkarray` for `Missing <: T` (the `SenMissArray` path: Blosc and friends reject `Union{Missing,T}` directly, so the inner buffer has to be a real `Array{T}`). - `readblock!`: always use `getchunkarray_undef`. Decode-into and fill-value branches both initialise every element. - `writeblock!`: use `getchunkarray_undef` when `z.metadata.fill_value !== nothing` (resetbuffer! handles init). Keep the legacy zero-fill when `fill_value === nothing` so partial writes to a fresh chunk still default un-written cells to zero. Port of the V2-side hunks from #272 (verbatim, with the same `Missing <: T` carve-out the PR added in commit 75bea38 to fix a Blosc rejection regression). Tests: 2499/2499 pass — including the SenMissArray-exercising `Fillvalue as missing`, `getindex/setindex` (amiss), `MaxLengthString large-chunk read path`, and `ragged arrays` tests. Measured on Apple M2 Ultra, Julia 1.12.6, V2 + NoCompressor + chunks=(Nx,Ny,Nz,1), 1 warm-up + 7 timed repeats, A/B via git stash: | size | B1 only (R) | B1+A2 (R) | speed-up | |---------------------|------------:|-----------:|---------:| | 128×128×16×50 50M | 3466 MB/s | 3866 MB/s | +12% | | 256×256×32×50 400M | 3589 MB/s | 4169 MB/s | +16% | | 512×512×32×50 1.5G | ~3000 MB/s | ~3500 MB/s | noisy | Write side smaller (most write time is storage-bound, not memset): | size | B1 only (W) | B1+A2 (W) | speed-up | |---------------------|------------:|-----------:|---------:| | 128×128×16×50 50M | 1148 MB/s | 1203 MB/s | +5% | | 256×256×32×50 400M | 1327 MB/s | 1354 MB/s | +2% | | 512×512×32×50 1.5G | 1132 MB/s | 1329 MB/s | +17% | The 1.5 GiB row is the noisiest across runs because chunks no longer fit in L3 — exact numbers vary with cache state at run start. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

lazarusA · 2026-05-30T15:32:33Z

still a draft? there are some unresolved comments.

@lazarusA

* use bulk copy in zcompress! fallback to avoid per-byte growth `zcompress(data, NoCompressor())` returns a lazy `reinterpret(UInt8, data)` view. The old `empty!` + `append!` fallback walked that view element by element through `_growend!` / `push!`, materialising bytes one at a time — roughly 67% of CPU time on uncompressed full-chunk V2 writes in profiling. Replace with `resize!` + `copyto!` so the same path issues a single SIMD / memcpy bulk copy. Real compressors (Blosc, Zlib, Zstd) are unaffected: they already return a freshly-allocated `Vector{UInt8}` and `copyto!` handles that case identically. Bytes-on-disk are bit-identical; full test suite (2499 tests) passes. Measured on Apple M2 Ultra, Julia 1.12.6, V2 + NoCompressor + chunks=(Nx,Ny,Nz,1): | size | baseline | patched | speed-up | |---------------------|---------:|---------:|---------:| | 128×128×16×50 50M | 702 MB/s | 995 MB/s | 1.42× | | 256×256×32×50 400M | 799 MB/s |1530 MB/s | 1.92× | | 512×512×32×50 1.5G | 776 MB/s |1104 MB/s | 1.42× | Reads are unchanged (this fallback is not on the read path). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * use bulk copy in NoCompressor zuncompress! to skip lazy-view walk The generic `zuncompress!` fallback at the top of `Compressors.jl` ends up at `copyto!(::Array{T}, ::ReinterpretArray)` when `c isa NoCompressor`, since `zuncompress(bytes, ::NoCompressor, T)` returns a lazy `reinterpret(T, bytes)` view. That `copyto!` walks element by element at ~17 GB/s on Apple silicon vs. ~85 GB/s for `unsafe_copyto!` on the same memory — a 5× gap. End-to-end V2 reads absorb most of it via DiskArrays slicing, channel ferry, and storage I/O, leaving ~25-99% throughput improvement depending on chunk size. Add a `::NoCompressor`-dispatched `zuncompress!` method alongside the existing per-compressor versions for Zstd / Zlib / Blosc. Mirror image of the encode-side bulk copy that landed in the post-A1 `zcompress!`. Guards on `data::Array{T}` and `isbitstype(T)` keep the fast path off `SenMissArray`, `MaxLengthString`, ragged `Vector{T}` eltypes, and any non-contiguous input — those fall back to the existing generic `copyto!` path. Tests: 2499/2499 pass. Measured on Apple M2 Ultra, Julia 1.12.6, V2 + NoCompressor + chunks=(Nx,Ny,Nz,1), 1 warm-up + 7 timed repeats, A/B via git stash on the same wall-clock: | size | A1 only read | A1 + B1 read | speed-up | |---------------------|-------------:|-------------:|---------:| | 128×128×16×50 50M | 3080 MB/s | 3884 MB/s | +26% | | 256×256×32×50 400M | 3155 MB/s | 4130 MB/s | +31% | | 512×512×32×50 1.5G | 2243 MB/s | 4455 MB/s | +99% | Writes are unchanged (B1 doesn't touch the encode path). Reference: Zarrs.jl on the same V2-uncompressed workload reads at 5166 / 4672 / 5418 MB/s, so B1 closes most of the read gap; the 1.6 GiB case goes from ~58% of Zarrs.jl to ~82%. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * add getchunkarray_undef and skip dead zero-fill on full-overwrite paths Both `readblock!` and `writeblock!` allocated the chunk-shaped scratch buffer via `getchunkarray(z) = fill(_zero(eltype(z)), z.metadata.chunks)`, which zero-fills the chunk before any other work touches it. On the read side that fill is always clobbered: `pipeline_decode!` writes every element, or the fill-value branch in `uncompress_raw!` calls `fill!` itself. On the write side it's clobbered for full-chunk overwrites (the common case for `z[:,:,:,t] = buf`-style appends) and re-initialised by `resetbuffer!` for partial-chunk RMW. Add `getchunkarray_undef(z::ZArray{T})` which returns `Array{T}(undef, …)` for plain isbits eltypes and falls back to `getchunkarray` for `Missing <: T` (the `SenMissArray` path: Blosc and friends reject `Union{Missing,T}` directly, so the inner buffer has to be a real `Array{T}`). - `readblock!`: always use `getchunkarray_undef`. Decode-into and fill-value branches both initialise every element. - `writeblock!`: use `getchunkarray_undef` when `z.metadata.fill_value !== nothing` (resetbuffer! handles init). Keep the legacy zero-fill when `fill_value === nothing` so partial writes to a fresh chunk still default un-written cells to zero. Port of the V2-side hunks from #272 (verbatim, with the same `Missing <: T` carve-out the PR added in commit 75bea38 to fix a Blosc rejection regression). Tests: 2499/2499 pass — including the SenMissArray-exercising `Fillvalue as missing`, `getindex/setindex` (amiss), `MaxLengthString large-chunk read path`, and `ragged arrays` tests. Measured on Apple M2 Ultra, Julia 1.12.6, V2 + NoCompressor + chunks=(Nx,Ny,Nz,1), 1 warm-up + 7 timed repeats, A/B via git stash: | size | B1 only (R) | B1+A2 (R) | speed-up | |---------------------|------------:|-----------:|---------:| | 128×128×16×50 50M | 3466 MB/s | 3866 MB/s | +12% | | 256×256×32×50 400M | 3589 MB/s | 4169 MB/s | +16% | | 512×512×32×50 1.5G | ~3000 MB/s | ~3500 MB/s | noisy | Write side smaller (most write time is storage-bound, not memset): | size | B1 only (W) | B1+A2 (W) | speed-up | |---------------------|------------:|-----------:|---------:| | 128×128×16×50 50M | 1148 MB/s | 1203 MB/s | +5% | | 256×256×32×50 400M | 1327 MB/s | 1354 MB/s | +2% | | 512×512×32×50 1.5G | 1132 MB/s | 1329 MB/s | +17% | The 1.5 GiB row is the noisiest across runs because chunks no longer fit in L3 — exact numbers vary with cache state at run start. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * add CHANGELOG entries for #280 (V2 perf) One bullet per commit on the PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * shrink comments around zcompress! / zuncompress! fallbacks The multi-paragraph rationale comments are now stale post-landing; one line each is enough to flag why bulk copy beats append! / generic ReinterpretArray copy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * skip readtask/writetask ferry on single-chunk full-overwrite writes The slow writeblock! path spawns a readtask and a writetask connected by 0-buffered channels for every call. Profiling the 400 MiB headline workload showed 93% of write CPU pinned in __psynch_cvwait — the channel ferry synchronisation cost — for the dominant case of writing one full chunk at a time (`z[:,:,:,t] = buf`). Add a fast path at the top of writeblock! that, when the call touches exactly one chunk and `ain` is a plain Array matching the chunk shape and eltype, encodes `ain` directly via `compress_raw` and calls `store_writechunk`/`store_deletechunk` synchronously. Fill-value elision is preserved (delete-if-initialised when encode returns `nothing`). Restricted to non-Missing element types so the SenMissArray indirection required by the codec pipeline still goes through the slow path. Bench (M2 Ultra, ZS_REPEATS=3, V2 + NoCompressor, 50 timesteps): size write MiB/s read MiB/s before after before after 128³×16×50 1235 2228 3864 ~3500 (write +80%) 256³×32×50 1392 2160 3870 ~3300 (write +55%) 512²×32×50 1339 2240 4651 ~3500 (write +67%) Read variance in the combined bench is system noise; the isolated read-only bench shows reads unchanged. Tests: 2499/2499 pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * skip readtask ferry on single-chunk full-overwrite reads Mirrors the writeblock! fast path: for the common case of reading exactly one full chunk into an Array of matching shape and eltype, bypass the readtask + 0-buffered channel and the chunk-shaped scratch buffer. Read the compressed bytes synchronously via `store_readchunk` and decode directly into `aout`, eliminating both the channel-ferry sync wait (80% of read CPU per profiling) and the per-call scratch allocation + final copy (the 13% GC pressure source). Bench (M2 Ultra, ZS_REPEATS=3, V2 + NoCompressor, isolated read-only): size before after speedup 128³×16×50 1528 MiB/s ~5300 MiB/s ~3.5x 256³×32×50 3128 MiB/s ~4900 MiB/s ~1.6x 512²×32×50 3248 MiB/s ~4800 MiB/s ~1.5x Same guards as the write fast path: single chunk touched, `aout isa Array`, non-Missing eltype, eltype and shape match the chunk. Otherwise falls through to the existing channel-based path. Tests: 2499/2499 pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * factor singlechunk_fastpath guard out of readblock!/writeblock! The two A3 fast paths share the same eligibility check (single chunk, plain Array of matching shape and eltype, non-Missing T). Extract that into a helper that returns the chunk index or `nothing`, so each caller's fast-path body shrinks to the I/O calls. The previous version also re-derived `indranges` to verify full-chunk coverage; that check is redundant. If `size(arr) == chunks` and `length(blockr) == 1`, the slice extent equals the chunk extent within one chunk, which forces it to align with that chunk's range — otherwise the slice would straddle into a neighbouring chunk and bump `length(blockr)`. The helper's docstring states this so the elision is not mysterious. Tests: 2499/2499 pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * zero-copy chunk write for V2 NoCompressor + no filters The single-chunk fastpath in writeblock! went through compress_raw → pipeline_encode, which allocates a chunk-sized Vector{UInt8} and copyto!'s the bytes in. For the V2-uncompressed-no-filters case the encoded bytes are just the host-endian bit representation of the input, so reinterpret(UInt8, ain) is a valid byte source — pass that straight to the store and skip the allocation + memcpy. Factor the fastpath body into write_singlechunk_fastpath! so the specialization can dispatch on metadata type (V2 + NoCompressor + Nothing filters) without touching the slow path. The slow path continues to materialize an owned Vector{UInt8} since its scratch buffer is reused across iterations and aliasing would race the writetask. ZarrBenchmarks v2 uncompressed write throughput (256x256x32x50 chunks, NVMe on Btrfs): - sequential: 1045 → 2513 MB/s (+141%), 95-101% of raw write() ceiling - 8 threads: 1294 → 2444 MB/s (+89%) - 512^2 chunk-size regression eliminated (the prior allocation dominated at large chunks). Tests cover dispatch (which() inspection), round-trip across rank 0-4 and multiple bitstypes, all-fill-value chunk elision, and caller-array aliasing safety. * consolidate #280 CHANGELOG entries Per @lazarusA's PR review: collapse the five repeated [#280] entries into one bullet with the sub-items as a list, matching the v0.10.0 #241 formatting. Adds the zero-copy NoCompressor fastpath as a new sub-item. --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Lazaro Alonso <lazarus.alon@gmail.com>

asinghvi17 · 2026-05-30T15:54:05Z

I was talking with @glwagner and would generally prefer #280 over the implementation here, it hits the same notes. We should put up a zarr v3 version of this though.

glwagner commented May 19, 2026

View reviewed changes

Comment thread src/pipeline.jl

glwagner force-pushed the glw/prop-a-bulk-encode branch from 25a93ac to 75bea38 Compare May 19, 2026 16:11

Merge branch 'master' into glw/prop-a-bulk-encode

3d6b091

mkitti requested a review from Copilot May 19, 2026 18:28

Copilot started reviewing on behalf of mkitti May 19, 2026 18:29 View session

Copilot AI reviewed May 19, 2026

View reviewed changes

glwagner mentioned this pull request May 19, 2026

Possible optimizations for Zarr v3? #276

Open

asinghvi17 mentioned this pull request May 20, 2026

Performance improvements (v2 only for now) #280

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance: bulk-copy uncompressed encode + undef chunk buffer#272

Performance: bulk-copy uncompressed encode + undef chunk buffer#272
glwagner wants to merge 2 commits into
JuliaIO:masterfrom
glwagner:glw/prop-a-bulk-encode

glwagner commented May 19, 2026 •

edited

Loading

Uh oh!

Uh oh!

glwagner commented May 19, 2026

Uh oh!

asinghvi17 commented May 19, 2026

Uh oh!

lazarusA commented May 19, 2026

Uh oh!

glwagner commented May 19, 2026

Uh oh!

mkitti commented May 19, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

asinghvi17 commented May 19, 2026

Uh oh!

glwagner commented May 19, 2026

Followups (not in this PR)

Uh oh!

lazarusA commented May 30, 2026

Uh oh!

asinghvi17 commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

glwagner commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Measured impact

What's in the patch

Test plan

CI dependency

Followups

Uh oh!

Uh oh!

glwagner commented May 19, 2026

Uh oh!

asinghvi17 commented May 19, 2026

Uh oh!

lazarusA commented May 19, 2026

Uh oh!

glwagner commented May 19, 2026

Uh oh!

mkitti commented May 19, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

asinghvi17 commented May 19, 2026

Uh oh!

glwagner commented May 19, 2026

Followups (not in this PR)

Uh oh!

lazarusA commented May 30, 2026

Uh oh!

asinghvi17 commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

glwagner commented May 19, 2026 •

edited

Loading