Summary
While benchmarking Zarr v3 write/read performance, I found two optimization opportunities that appear to substantially improve uncompressed V3 performance, especially for sequential full-chunk writes.
This is related to, but distinct from, the V2 NoCompressor bulk-copy optimization in #272. The V3-specific gap seems to come from the V3 codec pipeline and the generic chunk write path.
All code links below point at master @ 03270b2.
Optimization candidates
1. Fast path V3 BytesCodec encode/decode
Current encode. The V3 BytesCodec materializes bytes by reinterpreting + collecting an entire Vector:
|
function codec_encode(c::BytesCodec, data::AbstractArray) |
|
if _needs_bswap(c.endian) |
|
return reinterpret(UInt8, bswap.(vec(data))) |> collect |
|
else |
|
return reinterpret(UInt8, vec(data)) |> collect |
|
end |
|
end |
The reinterpret(UInt8, vec(data)) |> collect route walks the array element-wise to build the output Vector{UInt8} rather than issuing one memcpy-equivalent copy. For native-endian dense Arrays with isbitstype element types this can be replaced by a single bulk copy.
Proposed encode replacement. Add a method specialized on ::Array (not AbstractArray) that uses a bulk unsafe_copyto! for the native-endian path, leaving the bswap path unchanged:
function _copy_array_bytes(data::Array{T}) where T
isbitstype(T) || return reinterpret(UInt8, vec(data)) |> collect
out = Vector{UInt8}(undef, sizeof(data))
GC.@preserve out data unsafe_copyto!(pointer(out),
Ptr{UInt8}(pointer(data)),
length(out))
return out
end
function codec_encode(c::BytesCodec, data::Array)
if _needs_bswap(c.endian)
return reinterpret(UInt8, bswap.(vec(data))) |> collect
else
return _copy_array_bytes(data)
end
end
The original codec_encode(c::BytesCodec, data::AbstractArray) stays in place as the fallback (views, lazy arrays, etc.); the new ::Array method just wins by dispatch for the common dense case.
Current decode. Decode also allocates an intermediate typed array and then copyto!s the destination:
|
function codec_decode(c::BytesCodec, encoded::Vector{UInt8}, ::Type{T}, shape::NTuple{N,Int}; fill_value=nothing) where {T, N} |
|
arr = collect(reinterpret(T, encoded)) |
|
if _needs_bswap(c.endian) |
|
arr = bswap.(arr) |
|
end |
|
return reshape(arr, shape) |
|
end |
…and that intermediate is consumed by pipeline_decode!:
|
function pipeline_decode!(p::V3Pipeline, output::AbstractArray, compressed::Vector{UInt8}; fill_value=nothing) |
|
# Phase 3 reverse: bytes->bytes codecs (reverse order) |
|
bytes = compressed |
|
for codec in reverse(collect(p.bytes_bytes)) |
|
bytes = Codecs.V3Codecs.codec_decode(codec, bytes) |
|
end |
|
# Phase 2 reverse: bytes->array codec |
|
# Compute the intermediate shape — the shape data has after array_array encoding |
|
intermediate_shape = foldl( |
|
(sz, codec) -> Codecs.V3Codecs.encoded_shape(codec, sz), |
|
p.array_array; init=size(output) |
|
) |
|
arr = Codecs.V3Codecs.codec_decode(p.array_bytes, bytes, eltype(output), intermediate_shape; fill_value) |
|
# Phase 1 reverse: array->array codecs (reverse order) |
|
for codec in reverse(collect(p.array_array)) |
|
arr = Codecs.V3Codecs.codec_decode(codec, arr) |
|
end |
|
copyto!(output, arr) |
|
return output |
|
end |
For the common shape — V3Pipeline((), BytesCodec, ()), native endian, isbitstype element — the intermediate array is pure overhead. We can copy bytes straight into the destination.
Proposed decode replacement. Add a pipeline_decode! specialization for the bytes-only pipeline that does an in-place bulk copy, with a safe fallback for the non-bits / bswap case:
function pipeline_decode!(p::V3Pipeline{Tuple{},AB,Tuple{}}, output::Array{T},
compressed::Vector{UInt8}; fill_value=nothing) where {AB<:Codecs.V3Codecs.BytesCodec,T}
if !isbitstype(T) || Codecs.V3Codecs._needs_bswap(p.array_bytes.endian)
arr = Codecs.V3Codecs.codec_decode(p.array_bytes, compressed, T, size(output); fill_value)
copyto!(output, arr)
return output
end
length(compressed) == sizeof(output) || throw(DimensionMismatch(
"Encoded byte length $(length(compressed)) does not match output byte size $(sizeof(output))"
))
GC.@preserve output compressed unsafe_copyto!(Ptr{UInt8}(pointer(output)),
pointer(compressed),
length(compressed))
return output
end
The generic pipeline_decode!(::V3Pipeline, …) stays unchanged and handles any pipeline with array→array codecs, bytes→bytes codecs, or non-BytesCodec array→bytes codecs.
Behavior to preserve: fill-chunk elision. The existing V3 encode pipeline returns nothing when every element of the chunk equals fill_value, so the storage layer can skip writing the chunk entirely. In local testing, dropping that check failed this existing test:
|
@testset "V3Pipeline fill_value returns nothing" begin |
|
bytes_codec = Zarr.Codecs.V3Codecs.BytesCodec() |
|
pipeline = Zarr.V3Pipeline((), bytes_codec, ()) |
|
data = fill(Int32(0), 4) |
|
encoded = Zarr.pipeline_encode(pipeline, data, Int32(0)) |
|
@test encoded === nothing |
|
end |
A matching pipeline_encode specialization keeps the elision but skips the rest of the generic loop:
function pipeline_encode(p::V3Pipeline{Tuple{},AB,Tuple{}}, data::Array, fill_value) where {AB<:Codecs.V3Codecs.BytesCodec}
if fill_value !== nothing && all(isequal(fill_value), data)
return nothing
end
return Codecs.V3Codecs.codec_encode(p.array_bytes, data)
end
2. Direct full-chunk overwrite path
Current write path. writeblock! is shared by partial and full-chunk writes. For every chunk it touches, it reads the existing chunk (or fills the buffer with fill_value), copies user data into a scratch chunk buffer, routes through a read/write channel pair, and then encodes the scratch buffer — even when the caller is overwriting an entire chunk and none of that machinery is needed:
|
function writeblock!(ain::AbstractArray{<:Any,N}, z::ZArray{<:Any, N}, r::CartesianIndices{N}) where {N} |
|
|
|
z.writeable || error("Can not write to read-only ZArray") |
|
input_base_offsets = map(i->first(i)-1,r.indices) |
|
# Determines which chunks are affected |
|
blockr = CartesianIndices(map(trans_ind, r.indices, z.metadata.chunks)) |
|
# Allocate array of the size of a chunks where uncompressed data can be held |
|
#bufferdict = IdDict((current_task()=>getchunkarray(z),)) |
|
a = getchunkarray(z) |
|
# Now loop through the chunks |
|
readchannel = Channel{Pair{eltype(blockr),Union{Nothing,Vector{UInt8}}}}(channelsize(z.storage)) |
|
|
|
readtask = @async begin |
|
read_items!(z.storage, readchannel, z.metadata.chunk_key_encoding, z.path, blockr) |
|
end |
|
bind(readchannel,readtask) |
|
|
|
writechannel = Channel{Pair{eltype(blockr),Union{Nothing,Vector{UInt8}}}}(channelsize(z.storage)) |
|
|
|
writetask = @async begin |
|
write_items!(z.storage, writechannel, z.metadata.chunk_key_encoding, z.path, blockr) |
|
end |
|
bind(writechannel,writetask) |
|
|
|
try |
|
for i in 1:length(blockr) |
|
|
|
bI,chunk_compressed = take!(readchannel) |
|
|
|
current_chunk_offsets = map((s,i)->s*(i-1),size(a),Tuple(bI)) |
|
|
|
indranges = map(boundint,r.indices,size(a),current_chunk_offsets) |
|
|
|
if isnothing(chunk_compressed) || (length.(indranges) != size(a)) |
|
resetbuffer!(z.metadata.fill_value,a) |
|
end |
|
|
|
curchunk = if length.(indranges) != size(a) |
|
view(a,dotminus.(indranges,current_chunk_offsets)...) |
|
else |
|
a |
|
end |
|
|
|
if chunk_compressed !== nothing |
|
uncompress_raw!(a,z,chunk_compressed) |
|
end |
|
|
|
curchunk .= view(ain,dotminus.(indranges,input_base_offsets)...) |
|
|
|
put!(writechannel,bI=>compress_raw(maybeinner(a),z)) |
|
nothing |
|
end |
|
finally |
|
close(readchannel) |
|
close(writechannel) |
|
end |
|
ain |
|
end |
The terminal store_writechunk it eventually reaches is just:
|
store_writechunk(s::AbstractStore, v, p, i::CartesianIndex, e::AbstractChunkKeyEncoding) = s[p, citostring(e, i)] = v |
Proposed replacement. Add a guarded fast path that bypasses the channels and scratch buffer when the write is exactly one full chunk and the input is itself chunk-shaped:
function _same_indices_as_axes(ranges, a::AbstractArray{<:Any,N}) where N
all(ntuple(i -> first(ranges[i]) == first(axes(a, i)) &&
last(ranges[i]) == last(axes(a, i)), N))
end
function write_full_chunk_direct!(ain::AbstractArray{<:Any,N}, z::ZArray{<:Any,N},
r::CartesianIndices{N}, blockr, input_base_offsets) where N
length(blockr) == 1 || return false # exactly one chunk touched
Missing <: eltype(ain) && return false # SenMissArray path stays generic
bI = first(blockr)
current_chunk_offsets = map((s, i) -> s * (i - 1), z.metadata.chunks, Tuple(bI))
indranges = map(boundint, r.indices, z.metadata.chunks, current_chunk_offsets)
length.(indranges) == z.metadata.chunks || return false # write spans whole chunk
input_ranges = dotminus.(indranges, input_base_offsets)
size(ain) == z.metadata.chunks || return false # input is chunk-shaped
_same_indices_as_axes(input_ranges, ain) || return false # …and aligned to it
data_encoded = compress_raw(maybeinner(ain), z)
if data_encoded === nothing
# fill-chunk elision: delete any existing chunk on disk
if store_isinitialized(z.storage, z.path, bI, z.metadata.chunk_key_encoding)
store_deletechunk(z.storage, z.path, bI, z.metadata.chunk_key_encoding)
end
else
store_writechunk(z.storage, data_encoded, z.path, bI, z.metadata.chunk_key_encoding)
end
return true
end
…and an early return at the top of writeblock!:
function writeblock!(ain::AbstractArray{<:Any,N}, z::ZArray{<:Any,N}, r::CartesianIndices{N}) where {N}
z.writeable || error("Can not write to read-only ZArray")
input_base_offsets = map(i -> first(i) - 1, r.indices)
blockr = CartesianIndices(map(trans_ind, r.indices, z.metadata.chunks))
if write_full_chunk_direct!(ain, z, r, blockr, input_base_offsets)
return ain
end
# …existing channel/scratch-buffer path unchanged from here…
end
Every guard (length(blockr) == 1, full-chunk span, chunk-shaped input, aligned ranges, no Missing) falls back to the current implementation when violated. Partial writes, multi-chunk writes, missing-element arrays, and non-exact ranges all stay on the existing path.
A small companion change avoids paying a fill! for the scratch buffer when the existing path is taken but fill_value is set (the buffer gets initialised explicitly later anyway):
function getchunkarray_undef(z::ZArray{T}) where {T}
Missing <: T && return getchunkarray(z) # SenMissArray inner buffer needs zero-init
return Array{T}(undef, z.metadata.chunks)
end
# in writeblock!:
a = z.metadata.fill_value === nothing ? getchunkarray(z) : getchunkarray_undef(z)
This is independent of the fast-path guard and could be split out further if preferred.
Local benchmark results
Single-threaded, V3, NoCompressor, one chunk per timestep, 1 warm-up + 3 timed repeats. Throughput is MiB/s (median of the 3 timed repeats).
| Size |
Optimized Zarr.jl write |
Zarrs.jl write |
Write ratio |
Optimized Zarr.jl read |
Zarrs.jl read |
Read ratio |
| 64x64x16x50 |
382 |
360 |
1.06x |
1056 |
1879 |
0.56x |
| 128x128x16x50 |
1489 |
931 |
1.60x |
1509 |
2418 |
0.62x |
| 128x128x32x50 |
2331 |
1145 |
2.04x |
1608 |
2122 |
0.76x |
| 256x256x16x50 |
2974 |
1282 |
2.32x |
1037 |
2177 |
0.48x |
| 256x256x32x50 |
3577 |
1075 |
3.33x |
1484 |
2367 |
0.63x |
| 256x256x64x50 |
3709 |
1285 |
2.89x |
1613 |
1736 |
0.93x |
| 512x512x32x50 |
3727 |
1253 |
2.97x |
1517 |
1962 |
0.77x |
| 512x512x64x20 |
3828 |
1346 |
2.84x |
1526 |
1855 |
0.82x |
| 512x512x64x50 |
3744 |
1253 |
2.99x |
1664 |
1977 |
0.84x |
Across these sizes, the optimized local branch was about 2.3x geomean faster than Zarrs.jl on writes. Reads improved from the V3 bytes fast path, but Zarrs.jl was still faster overall on reads.
Validation so far
A local experiment branch preserving fill-chunk elision passed the V3 codec suite:
test/v3_codecs.jl: 182/182 passed
Also tested full-write and partial-write smoke cases to verify the exact full-chunk path falls back correctly for partial writes.
Proposed next step
If this direction sounds reasonable, I think this should probably be split into two PRs:
- V3
BytesCodec encode/decode fast paths (the ::Array method on codec_encode, plus the pipeline_encode / pipeline_decode! specializations for V3Pipeline{Tuple{},BytesCodec,Tuple{}}).
- Exact full-chunk overwrite fast path in
writeblock! (the write_full_chunk_direct! guard).
The first is smaller and lower risk; the second is the larger V3 write-side win.
Summary
While benchmarking Zarr v3 write/read performance, I found two optimization opportunities that appear to substantially improve uncompressed V3 performance, especially for sequential full-chunk writes.
This is related to, but distinct from, the V2
NoCompressorbulk-copy optimization in #272. The V3-specific gap seems to come from the V3 codec pipeline and the generic chunk write path.All code links below point at master @
03270b2.Optimization candidates
1. Fast path V3
BytesCodecencode/decodeCurrent encode. The V3
BytesCodecmaterializes bytes by reinterpreting + collecting an entireVector:Zarr.jl/src/Codecs/V3/V3.jl
Lines 626 to 632 in 03270b2
The
reinterpret(UInt8, vec(data)) |> collectroute walks the array element-wise to build the outputVector{UInt8}rather than issuing onememcpy-equivalent copy. For native-endian denseArrays withisbitstypeelement types this can be replaced by a single bulk copy.Proposed encode replacement. Add a method specialized on
::Array(notAbstractArray) that uses a bulkunsafe_copyto!for the native-endian path, leaving the bswap path unchanged:The original
codec_encode(c::BytesCodec, data::AbstractArray)stays in place as the fallback (views, lazy arrays, etc.); the new::Arraymethod just wins by dispatch for the common dense case.Current decode. Decode also allocates an intermediate typed array and then
copyto!s the destination:Zarr.jl/src/Codecs/V3/V3.jl
Lines 634 to 640 in 03270b2
…and that intermediate is consumed by
pipeline_decode!:Zarr.jl/src/pipeline.jl
Lines 33 to 52 in 03270b2
For the common shape —
V3Pipeline((), BytesCodec, ()), native endian,isbitstypeelement — the intermediate array is pure overhead. We can copy bytes straight into the destination.Proposed decode replacement. Add a
pipeline_decode!specialization for the bytes-only pipeline that does an in-place bulk copy, with a safe fallback for the non-bits / bswap case:The generic
pipeline_decode!(::V3Pipeline, …)stays unchanged and handles any pipeline with array→array codecs, bytes→bytes codecs, or non-BytesCodecarray→bytes codecs.Behavior to preserve: fill-chunk elision. The existing V3 encode pipeline returns
nothingwhen every element of the chunk equalsfill_value, so the storage layer can skip writing the chunk entirely. In local testing, dropping that check failed this existing test:Zarr.jl/test/v3_codecs.jl
Lines 336 to 342 in 03270b2
A matching
pipeline_encodespecialization keeps the elision but skips the rest of the generic loop:2. Direct full-chunk overwrite path
Current write path.
writeblock!is shared by partial and full-chunk writes. For every chunk it touches, it reads the existing chunk (or fills the buffer withfill_value), copies user data into a scratch chunk buffer, routes through a read/write channel pair, and then encodes the scratch buffer — even when the caller is overwriting an entire chunk and none of that machinery is needed:Zarr.jl/src/ZArray.jl
Lines 205 to 262 in 03270b2
The terminal
store_writechunkit eventually reaches is just:Zarr.jl/src/Storage/Storage.jl
Line 84 in 03270b2
Proposed replacement. Add a guarded fast path that bypasses the channels and scratch buffer when the write is exactly one full chunk and the input is itself chunk-shaped:
…and an early return at the top of
writeblock!:Every guard (
length(blockr) == 1, full-chunk span, chunk-shaped input, aligned ranges, noMissing) falls back to the current implementation when violated. Partial writes, multi-chunk writes, missing-element arrays, and non-exact ranges all stay on the existing path.A small companion change avoids paying a
fill!for the scratch buffer when the existing path is taken butfill_valueis set (the buffer gets initialised explicitly later anyway):This is independent of the fast-path guard and could be split out further if preferred.
Local benchmark results
Single-threaded, V3,
NoCompressor, one chunk per timestep, 1 warm-up + 3 timed repeats. Throughput is MiB/s (median of the 3 timed repeats).Across these sizes, the optimized local branch was about 2.3x geomean faster than Zarrs.jl on writes. Reads improved from the V3 bytes fast path, but Zarrs.jl was still faster overall on reads.
Validation so far
A local experiment branch preserving fill-chunk elision passed the V3 codec suite:
Also tested full-write and partial-write smoke cases to verify the exact full-chunk path falls back correctly for partial writes.
Proposed next step
If this direction sounds reasonable, I think this should probably be split into two PRs:
BytesCodecencode/decode fast paths (the::Arraymethod oncodec_encode, plus thepipeline_encode/pipeline_decode!specializations forV3Pipeline{Tuple{},BytesCodec,Tuple{}}).writeblock!(thewrite_full_chunk_direct!guard).The first is smaller and lower risk; the second is the larger V3 write-side win.