-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Codec-based alternative to attribute-based scale-offset encoding #11280
Description
What is your issue?
Zarr V3 has 2 new codec specs that might be of interest to xarray. We have a codec for casting arrays from one data type to another (cast_value) and a codec for applying a scale + offset transformation to an array (scale_offset). Together, these codecs can define the scale-offset encoding commonly used for compressing floating point measurements as ints before serialization, and which is currently implemented here in this library. The cast_value codec also supports remapping scalars within the same data type (e.g., mapping NaN to 0, and the reverse), which might also be useful in xarray.
see this gist for a demo. it just shows that the combination of a scale_offset codec (defined inline) and the cast_value codec gets the same results as the cf-style zarr attributes + xarray's encoding / decoding logic.
the advantage of pushing this logic down into Zarr itself is that it makes the encoding portable across zarr implementations, and it takes load off the zarr array attributes, which IMO are not well suited for declaring how array scalars should be encoded / decoded.
a disadvantage is that you need implementations of 2 new codecs: I cooked up an implementation of the cast_value codec in the cast-value.py package. There's a default numpy implementation and an optional rust implementation. The rust version has generally better performance from a CPU and memory perspective. I opened a PR in zarr-python to add the numpy cast_value implementation, and a scale_offset codec (which is dead simple) but it didn't get any traction so I closed it 😆
I don't know what integration path makes sense in xarray, so I can't propose any concrete code changes! But it might be worth thinking about how you can make codecs do more, and attributes do less, for data written to zarr.