Performance: bulk-copy uncompressed encode + undef chunk buffer#272
Performance: bulk-copy uncompressed encode + undef chunk buffer#272glwagner wants to merge 2 commits into
Conversation
|
Does CI need a fix? I can push another PR independently if so |
|
#273 should fix CI (we were pinning to a merged CondaPkg.jl branch) |
|
also, I like the reviews done by copilot, but I think @mkitti has been the one doing the triggering for those.
On the metadata related issues mentioned in your benchmarks, this #270 was supposed to fix them, again I haven't been able to bring myself to get them to the finish line 😓. Thanks for bringing in new code 😄 . |
This patch targets a hot spot in the write path that costs ~3× on
uncompressed writes. Profile of a 256×256×32 Float32 chunk written 50
times shows 67% of CPU spent in `pipeline_encode` → `zcompress!` →
`append!` of a reinterpret view, i.e. materialising bytes one element
at a time into a freshly-allocated `Vector{UInt8}`.
Two changes in this commit:
1. `src/pipeline.jl`: Add a fast path in `pipeline_encode(::V2Pipeline,
...)` for `NoCompressor` + no filters that does a single
`unsafe_copyto!` of the input array's bytes, bypassing the generic
`append!`-driven path. The fast path also skips the
`all(isequal(fill_value), data)` scan, which is an O(N) read of the
chunk on every write — the common dense-write case never benefits.
(The fallback non-fast-path retains the scan and the existing
`zcompress!` machinery, so other codecs and the chunk-elision
behaviour are unchanged.)
2. `src/ZArray.jl`: Add a `getchunkarray_undef(z::ZArray)` variant that
skips the `fill(_zero(eltype(z)), chunks)` zero-fill. Use it in
`readblock!` (always safe: the buffer is fully written by
`pipeline_decode!` or `fill!`-from-fill_value before any read) and
in `writeblock!` when `z.metadata.fill_value !== nothing` (the
`resetbuffer!` path handles initialisation explicitly). When
`fill_value === nothing` we keep the old zero-filled buffer to
preserve the legacy "unwritten cells in a partial-chunk write
default to zero" behaviour.
Bytes on disk are bit-identical to the previous implementation. No
public API change.
Measured on an Apple M2 Ultra (single-threaded, NoCompressor,
256×256×32×Float32 chunks, 50 timesteps; APFS internal NVMe):
baseline write 460 MiB/s, read 1300 MiB/s
this patch write 1410 MiB/s, read 1320 MiB/s
The write throughput is now at parity with the Rust-backed Zarrs.jl
on the same workload (~1350 MiB/s) and within 25% of `Mmap.mmap` onto
a flat file (~1700 MiB/s). On a sweep from 12 MiB to 3.1 GiB total
data the speed-up holds at 2.5–2.8×.
Full benchmark code, profile, and four-way comparison plots
(baseline / this patch / mmap / Zarrs.jl) at:
https://github.com/glwagner/ZarrBenchmarks
The zstd write path is also slightly faster (~5%) since the fast path
doesn't fire there but the related `all-fill-value` scan is now
gated; the dominant zstd CPU is inside `libzstd` regardless.
Correctness verified with 109 round-trip tests across
{NoCompressor, ZstdCompressor, BloscCompressor} × {fill_value:
nothing, zero, NaN} × {1-4D shapes} × {Float32, Float64, Int32},
plus partial-chunk-write tests that exercise the
`fill_value === nothing` zero-fallthrough path and the
`fill_value !== nothing` undef-buffer path.
25a93ac to
75bea38
Compare
|
Quick update on the CI status: The pre-test failure on every job is unrelated to this PR. Every job is failing at the "Generate Julia and Python v3 fixtures" step (before the actual test step runs), because There was however a real bug in this PR that the broken CI hid. I checked out #273 locally, cherry-picked this PR on top, ran the full Fixed in 75bea38: function getchunkarray_undef(z::ZArray{T}) where {T}
Missing <: T && return getchunkarray(z) # fall back for sentinel-missing case
return Array{T}(undef, z.metadata.chunks)
endFor With #273 + the amended commit applied locally, This PR's CI will go green once #273 lands and I rebase. Happy to either rebase now (pulling in #273's |
|
Is this just Zarr v2? |
There was a problem hiding this comment.
Pull request overview
Optimizes the Zarr v2 uncompressed write path by adding a bulk-copy fast path for NoCompressor and by avoiding unnecessary zero/fill initialization of chunk scratch buffers in common read/write paths.
Changes:
- Added a
NoCompressor && filters === nothingfast path inpipeline_encode(::V2Pipeline, ...)that memcpy’s chunk bytes into aVector{UInt8}. - Introduced
getchunkarray_undefand switchedreadblock!(and parts ofwriteblock!) to use an uninitialized chunk scratch buffer when it is guaranteed to be fully overwritten.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
src/ZArray.jl |
Adds getchunkarray_undef and uses it to skip scratch-buffer pre-fills where safe. |
src/pipeline.jl |
Adds a bulk-copy fast path for uncompressed v2 encoding. |
Comments suppressed due to low confidence (1)
src/pipeline.jl:13
- This
unsafe_copyto!fast path assumesdatais a contiguous, isbits-backed buffer. Since the method acceptsAbstractArray, it can be called with strided/non-contiguous views or non-isbits element types (e.g. if a user forcesfilters=nothingfor an object dtype), which would previously error viareinterpretbut can now silently write incorrect bytes. To keep this safe, consider restricting the fast path todata isa Array(or at least contiguousStridedArray) andisbitstype(eltype(data)), and compute the byte count viaBase.elsize(data) * length(data)rather thansizeof(data)to avoid ambiguity across array types.
if p.compressor isa NoCompressor && p.filters === nothing
n = sizeof(data)
out = Vector{UInt8}(undef, n)
GC.@preserve out data unsafe_copyto!(pointer(out),
Ptr{UInt8}(pointer(data)), n)
return out
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if p.compressor isa NoCompressor && p.filters === nothing | ||
| n = sizeof(data) | ||
| out = Vector{UInt8}(undef, n) | ||
| GC.@preserve out data unsafe_copyto!(pointer(out), | ||
| Ptr{UInt8}(pointer(data)), n) | ||
| return out | ||
| end | ||
| if fill_value !== nothing && all(isequal(fill_value), data) | ||
| return nothing |
| function pipeline_encode(p::V2Pipeline, data::AbstractArray, fill_value) | ||
| # Fast path: NoCompressor + no filters is just a bulk byte copy. The | ||
| # generic zcompress! path below funnels through `append!` of a reinterpret | ||
| # view, which materialises the bytes one element at a time and dominates | ||
| # CPU for uncompressed writes. We also skip the all-fill-value scan | ||
| # because (a) it's an O(N) read of the chunk on every write and (b) the | ||
| # common dense-write use case never benefits. | ||
| if p.compressor isa NoCompressor && p.filters === nothing | ||
| n = sizeof(data) | ||
| out = Vector{UInt8}(undef, n) | ||
| GC.@preserve out data unsafe_copyto!(pointer(out), | ||
| Ptr{UInt8}(pointer(data)), n) | ||
| return out | ||
| end |
|
@mkitti I believe so. The agent that Greg set on this also had a different idea which might be cleaner, which I'll explore today. |
This: The ZarrBenchmarks investigation also flagged: (a) writeblock! spawns two @async tasks via Channels even for single-chunk writes, (b) the read-modify-write readtask runs even when overwriting a chunk fully, (c) pipeline_encode allocates a fresh Vector{UInt8} per chunk for the compressed codec path. Each is good for another 5–15%. Happy to follow up with separate PRs if this one lands well. |
Both `readblock!` and `writeblock!` allocated the chunk-shaped scratch
buffer via `getchunkarray(z) = fill(_zero(eltype(z)), z.metadata.chunks)`,
which zero-fills the chunk before any other work touches it. On the
read side that fill is always clobbered: `pipeline_decode!` writes
every element, or the fill-value branch in `uncompress_raw!` calls
`fill!` itself. On the write side it's clobbered for full-chunk
overwrites (the common case for `z[:,:,:,t] = buf`-style appends) and
re-initialised by `resetbuffer!` for partial-chunk RMW.
Add `getchunkarray_undef(z::ZArray{T})` which returns
`Array{T}(undef, …)` for plain isbits eltypes and falls back to
`getchunkarray` for `Missing <: T` (the `SenMissArray` path: Blosc
and friends reject `Union{Missing,T}` directly, so the inner buffer
has to be a real `Array{T}`).
- `readblock!`: always use `getchunkarray_undef`. Decode-into and
fill-value branches both initialise every element.
- `writeblock!`: use `getchunkarray_undef` when
`z.metadata.fill_value !== nothing` (resetbuffer! handles init).
Keep the legacy zero-fill when `fill_value === nothing` so partial
writes to a fresh chunk still default un-written cells to zero.
Port of the V2-side hunks from #272 (verbatim, with
the same `Missing <: T` carve-out the PR added in commit 75bea38 to
fix a Blosc rejection regression).
Tests: 2499/2499 pass — including the SenMissArray-exercising
`Fillvalue as missing`, `getindex/setindex` (amiss),
`MaxLengthString large-chunk read path`, and `ragged arrays` tests.
Measured on Apple M2 Ultra, Julia 1.12.6, V2 + NoCompressor +
chunks=(Nx,Ny,Nz,1), 1 warm-up + 7 timed repeats, A/B via git stash:
| size | B1 only (R) | B1+A2 (R) | speed-up |
|---------------------|------------:|-----------:|---------:|
| 128×128×16×50 50M | 3466 MB/s | 3866 MB/s | +12% |
| 256×256×32×50 400M | 3589 MB/s | 4169 MB/s | +16% |
| 512×512×32×50 1.5G | ~3000 MB/s | ~3500 MB/s | noisy |
Write side smaller (most write time is storage-bound, not memset):
| size | B1 only (W) | B1+A2 (W) | speed-up |
|---------------------|------------:|-----------:|---------:|
| 128×128×16×50 50M | 1148 MB/s | 1203 MB/s | +5% |
| 256×256×32×50 400M | 1327 MB/s | 1354 MB/s | +2% |
| 512×512×32×50 1.5G | 1132 MB/s | 1329 MB/s | +17% |
The 1.5 GiB row is the noisiest across runs because chunks no longer
fit in L3 — exact numbers vary with cache state at run start.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
still a draft? there are some unresolved comments. |
* use bulk copy in zcompress! fallback to avoid per-byte growth
`zcompress(data, NoCompressor())` returns a lazy `reinterpret(UInt8, data)`
view. The old `empty!` + `append!` fallback walked that view element by
element through `_growend!` / `push!`, materialising bytes one at a time —
roughly 67% of CPU time on uncompressed full-chunk V2 writes in profiling.
Replace with `resize!` + `copyto!` so the same path issues a single SIMD /
memcpy bulk copy. Real compressors (Blosc, Zlib, Zstd) are unaffected: they
already return a freshly-allocated `Vector{UInt8}` and `copyto!` handles
that case identically.
Bytes-on-disk are bit-identical; full test suite (2499 tests) passes.
Measured on Apple M2 Ultra, Julia 1.12.6, V2 + NoCompressor + chunks=(Nx,Ny,Nz,1):
| size | baseline | patched | speed-up |
|---------------------|---------:|---------:|---------:|
| 128×128×16×50 50M | 702 MB/s | 995 MB/s | 1.42× |
| 256×256×32×50 400M | 799 MB/s |1530 MB/s | 1.92× |
| 512×512×32×50 1.5G | 776 MB/s |1104 MB/s | 1.42× |
Reads are unchanged (this fallback is not on the read path).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* use bulk copy in NoCompressor zuncompress! to skip lazy-view walk
The generic `zuncompress!` fallback at the top of `Compressors.jl`
ends up at `copyto!(::Array{T}, ::ReinterpretArray)` when
`c isa NoCompressor`, since `zuncompress(bytes, ::NoCompressor, T)`
returns a lazy `reinterpret(T, bytes)` view. That `copyto!` walks
element by element at ~17 GB/s on Apple silicon vs. ~85 GB/s for
`unsafe_copyto!` on the same memory — a 5× gap. End-to-end V2 reads
absorb most of it via DiskArrays slicing, channel ferry, and storage
I/O, leaving ~25-99% throughput improvement depending on chunk size.
Add a `::NoCompressor`-dispatched `zuncompress!` method alongside
the existing per-compressor versions for Zstd / Zlib / Blosc. Mirror
image of the encode-side bulk copy that landed in the post-A1
`zcompress!`. Guards on `data::Array{T}` and `isbitstype(T)` keep
the fast path off `SenMissArray`, `MaxLengthString`, ragged
`Vector{T}` eltypes, and any non-contiguous input — those fall back
to the existing generic `copyto!` path.
Tests: 2499/2499 pass.
Measured on Apple M2 Ultra, Julia 1.12.6, V2 + NoCompressor +
chunks=(Nx,Ny,Nz,1), 1 warm-up + 7 timed repeats, A/B via git stash
on the same wall-clock:
| size | A1 only read | A1 + B1 read | speed-up |
|---------------------|-------------:|-------------:|---------:|
| 128×128×16×50 50M | 3080 MB/s | 3884 MB/s | +26% |
| 256×256×32×50 400M | 3155 MB/s | 4130 MB/s | +31% |
| 512×512×32×50 1.5G | 2243 MB/s | 4455 MB/s | +99% |
Writes are unchanged (B1 doesn't touch the encode path).
Reference: Zarrs.jl on the same V2-uncompressed workload reads at
5166 / 4672 / 5418 MB/s, so B1 closes most of the read gap; the
1.6 GiB case goes from ~58% of Zarrs.jl to ~82%.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* add getchunkarray_undef and skip dead zero-fill on full-overwrite paths
Both `readblock!` and `writeblock!` allocated the chunk-shaped scratch
buffer via `getchunkarray(z) = fill(_zero(eltype(z)), z.metadata.chunks)`,
which zero-fills the chunk before any other work touches it. On the
read side that fill is always clobbered: `pipeline_decode!` writes
every element, or the fill-value branch in `uncompress_raw!` calls
`fill!` itself. On the write side it's clobbered for full-chunk
overwrites (the common case for `z[:,:,:,t] = buf`-style appends) and
re-initialised by `resetbuffer!` for partial-chunk RMW.
Add `getchunkarray_undef(z::ZArray{T})` which returns
`Array{T}(undef, …)` for plain isbits eltypes and falls back to
`getchunkarray` for `Missing <: T` (the `SenMissArray` path: Blosc
and friends reject `Union{Missing,T}` directly, so the inner buffer
has to be a real `Array{T}`).
- `readblock!`: always use `getchunkarray_undef`. Decode-into and
fill-value branches both initialise every element.
- `writeblock!`: use `getchunkarray_undef` when
`z.metadata.fill_value !== nothing` (resetbuffer! handles init).
Keep the legacy zero-fill when `fill_value === nothing` so partial
writes to a fresh chunk still default un-written cells to zero.
Port of the V2-side hunks from #272 (verbatim, with
the same `Missing <: T` carve-out the PR added in commit 75bea38 to
fix a Blosc rejection regression).
Tests: 2499/2499 pass — including the SenMissArray-exercising
`Fillvalue as missing`, `getindex/setindex` (amiss),
`MaxLengthString large-chunk read path`, and `ragged arrays` tests.
Measured on Apple M2 Ultra, Julia 1.12.6, V2 + NoCompressor +
chunks=(Nx,Ny,Nz,1), 1 warm-up + 7 timed repeats, A/B via git stash:
| size | B1 only (R) | B1+A2 (R) | speed-up |
|---------------------|------------:|-----------:|---------:|
| 128×128×16×50 50M | 3466 MB/s | 3866 MB/s | +12% |
| 256×256×32×50 400M | 3589 MB/s | 4169 MB/s | +16% |
| 512×512×32×50 1.5G | ~3000 MB/s | ~3500 MB/s | noisy |
Write side smaller (most write time is storage-bound, not memset):
| size | B1 only (W) | B1+A2 (W) | speed-up |
|---------------------|------------:|-----------:|---------:|
| 128×128×16×50 50M | 1148 MB/s | 1203 MB/s | +5% |
| 256×256×32×50 400M | 1327 MB/s | 1354 MB/s | +2% |
| 512×512×32×50 1.5G | 1132 MB/s | 1329 MB/s | +17% |
The 1.5 GiB row is the noisiest across runs because chunks no longer
fit in L3 — exact numbers vary with cache state at run start.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* add CHANGELOG entries for #280 (V2 perf)
One bullet per commit on the PR.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* shrink comments around zcompress! / zuncompress! fallbacks
The multi-paragraph rationale comments are now stale post-landing; one line each is enough to flag why bulk copy beats append! / generic ReinterpretArray copy.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* skip readtask/writetask ferry on single-chunk full-overwrite writes
The slow writeblock! path spawns a readtask and a writetask connected by
0-buffered channels for every call. Profiling the 400 MiB headline workload
showed 93% of write CPU pinned in __psynch_cvwait — the channel ferry
synchronisation cost — for the dominant case of writing one full chunk at
a time (`z[:,:,:,t] = buf`).
Add a fast path at the top of writeblock! that, when the call touches
exactly one chunk and `ain` is a plain Array matching the chunk shape and
eltype, encodes `ain` directly via `compress_raw` and calls
`store_writechunk`/`store_deletechunk` synchronously. Fill-value elision is
preserved (delete-if-initialised when encode returns `nothing`).
Restricted to non-Missing element types so the SenMissArray indirection
required by the codec pipeline still goes through the slow path.
Bench (M2 Ultra, ZS_REPEATS=3, V2 + NoCompressor, 50 timesteps):
size write MiB/s read MiB/s
before after before after
128³×16×50 1235 2228 3864 ~3500 (write +80%)
256³×32×50 1392 2160 3870 ~3300 (write +55%)
512²×32×50 1339 2240 4651 ~3500 (write +67%)
Read variance in the combined bench is system noise; the isolated
read-only bench shows reads unchanged.
Tests: 2499/2499 pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* skip readtask ferry on single-chunk full-overwrite reads
Mirrors the writeblock! fast path: for the common case of reading exactly
one full chunk into an Array of matching shape and eltype, bypass the
readtask + 0-buffered channel and the chunk-shaped scratch buffer. Read
the compressed bytes synchronously via `store_readchunk` and decode
directly into `aout`, eliminating both the channel-ferry sync wait (80%
of read CPU per profiling) and the per-call scratch allocation + final
copy (the 13% GC pressure source).
Bench (M2 Ultra, ZS_REPEATS=3, V2 + NoCompressor, isolated read-only):
size before after speedup
128³×16×50 1528 MiB/s ~5300 MiB/s ~3.5x
256³×32×50 3128 MiB/s ~4900 MiB/s ~1.6x
512²×32×50 3248 MiB/s ~4800 MiB/s ~1.5x
Same guards as the write fast path: single chunk touched, `aout isa Array`,
non-Missing eltype, eltype and shape match the chunk. Otherwise falls
through to the existing channel-based path.
Tests: 2499/2499 pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* factor singlechunk_fastpath guard out of readblock!/writeblock!
The two A3 fast paths share the same eligibility check (single chunk,
plain Array of matching shape and eltype, non-Missing T). Extract that
into a helper that returns the chunk index or `nothing`, so each
caller's fast-path body shrinks to the I/O calls.
The previous version also re-derived `indranges` to verify full-chunk
coverage; that check is redundant. If `size(arr) == chunks` and
`length(blockr) == 1`, the slice extent equals the chunk extent within
one chunk, which forces it to align with that chunk's range — otherwise
the slice would straddle into a neighbouring chunk and bump
`length(blockr)`. The helper's docstring states this so the elision
is not mysterious.
Tests: 2499/2499 pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* zero-copy chunk write for V2 NoCompressor + no filters
The single-chunk fastpath in writeblock! went through compress_raw →
pipeline_encode, which allocates a chunk-sized Vector{UInt8} and
copyto!'s the bytes in. For the V2-uncompressed-no-filters case the
encoded bytes are just the host-endian bit representation of the input,
so reinterpret(UInt8, ain) is a valid byte source — pass that straight
to the store and skip the allocation + memcpy.
Factor the fastpath body into write_singlechunk_fastpath! so the
specialization can dispatch on metadata type (V2 + NoCompressor +
Nothing filters) without touching the slow path. The slow path
continues to materialize an owned Vector{UInt8} since its scratch
buffer is reused across iterations and aliasing would race the
writetask.
ZarrBenchmarks v2 uncompressed write throughput (256x256x32x50 chunks,
NVMe on Btrfs):
- sequential: 1045 → 2513 MB/s (+141%), 95-101% of raw write() ceiling
- 8 threads: 1294 → 2444 MB/s (+89%)
- 512^2 chunk-size regression eliminated (the prior allocation
dominated at large chunks).
Tests cover dispatch (which() inspection), round-trip across rank 0-4
and multiple bitstypes, all-fill-value chunk elision, and caller-array
aliasing safety.
* consolidate #280 CHANGELOG entries
Per @lazarusA's PR review: collapse the five repeated [#280] entries
into one bullet with the sub-items as a list, matching the v0.10.0
#241 formatting. Adds the zero-copy NoCompressor fastpath as a new
sub-item.
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Lazaro Alonso <lazarus.alon@gmail.com>
Summary
Targeted optimisation of the Zarr v2 uncompressed write path: ~2.4× speedup on
NoCompressorwrites by replacing an element-wiseappend!-driven byte materialisation with a singleunsafe_copyto!, and by skipping a redundant zero-fill of the chunk scratch buffer on common paths.Bytes on disk are bit-identical to current
master. No public API change.Scope note (updated): the headline fix targets
pipeline_encode(::V2Pipeline, ...). The V3 encode path goes throughCodecs.V3Codecs.codec_encode(::BytesCodec, ...)which already does a bulkreinterpret(UInt8, vec(data)) |> collect, so the V2 antipattern this patch fixes doesn't exist in V3. V3 still gains a small amount from the read-sidegetchunkarray_undefchange, but the write number is essentially unchanged. The V3 hot path is a separate problem, out of scope for this PR.Measured impact
256×256×32
Float32chunks × 50 timesteps, single-threaded,NoCompressor, Apple M2 Ultra, APFS internal NVMe:The V2 fix lands Zarr.jl V2 write throughput at parity with the Rust-backed Zarrs.jl V3 (~1300 MiB/s on the same workload). For users who want the fastest write path today, this PR makes Zarr.jl V2 +
NoCompressorcompetitive without a Rust dependency.Full investigation, profile, and four-way comparison plots (baseline / this patch / Prop B prototype / Zarrs.jl), split by V2 and V3: https://github.com/glwagner/ZarrBenchmarks
What's in the patch
src/pipeline.jl—pipeline_encode(::V2Pipeline, ...)gains a fast path forNoCompressor + no filters:This bypasses the generic
zcompress!path, which forNoCompressorwas funnelling throughappend!(compressed, reinterpret(UInt8, data)).append!of a reinterpret view materialises bytes one element at a time through_growend!/push!, and a flat profile showed this path consuming 67% of V2 write CPU. The new path is a single bulk memcpy.The fast path also skips the
all(isequal(fill_value), data)scan, an O(N) walk of the chunk on every write whose purpose is to avoid writing all-fill-value chunks. The all-codec fallback path retains that scan, so chunk-elision behaviour for less-common cases is unchanged.src/ZArray.jl— adds agetchunkarray_undef(z)variant that skips thefill(_zero(eltype(z)), chunks)zero-fill. Used in:readblock!: always safe — the buffer is fully overwritten bypipeline_decode!(whichcopyto!s into it) orfill!(viauncompress_raw!'s fill-value fallback) before any read.writeblock!: use undef whenfill_value !== nothing(theresetbuffer!path inside the loop handles initialisation explicitly). Whenfill_value === nothingwe keep the zero-filled buffer to preserve the legacy "unwritten cells in a partial-chunk write default to zero" behaviour.getchunkarray_undefdefers to the originalgetchunkarrayfor>:Missingelement types — the codec pipeline requires the buffer to be theisbitsinner of aSenMissArray, not anArray{Union{Missing,T}}directly (Blosc and friends reject non-isbits eltypes). The undef fast path applies only to plain isbits eltypes.Test plan
Pkg.test("Zarr")passes: 2593 / 2593 on this branch (with Fix CondaPkg branch in CI #273 applied to fix the unrelated test-environment instantiation issue).{NoCompressor, ZstdCompressor, BloscCompressor}×{fill_value: nothing, zero, NaN}×{1D, 2D, 3D, 4D shapes}×{Float32, Float64, Int32}.fill_value === nothingzero-fallthrough behaviour.length(data) * sizeof(T), no overhead.>:Missingsentinel-value paths verified.CI dependency
The CI on this PR currently fails at the pre-test "Generate Julia and Python v3 fixtures" step. That failure is unrelated to this PR —
test/Project.tomlpinsCondaPkgto a branch onJamesWrigley/CondaPkg.jlthat was deleted upstream between PR #270 (May 8, last passing master CI) and now. #273 fixes it. Once #273 lands I'll rebase, or I'm happy to absorb thetest/Project.tomlchange into this PR if that's preferred.Followups
The ZarrBenchmarks investigation also identified, in rough order of expected impact:
writeblock!spawns two@asynctasks via Channels even for single-chunk writes (V2 + V3).readtaskruns even when overwriting a chunk fully (V2 + V3).pipeline_encodeallocates a freshVector{UInt8}per chunk for the compressed codec path (V2 + V3).BytesCodec.codec_encodedoesreinterpret(...) |> collectwhich is bulk-copy fast but still allocates a freshVector{UInt8}per call — could plumb a scratch buffer through.pipeline_decode!copyto!(output, arr)s a final intermediate array; the codec chain could decode in-place intooutputfor the common single-codec case.Each is plausibly 5–15% on its own and they compose. Happy to follow up with separate PRs once this one lands.