Codec-based alternative to attribute-based scale-offset encoding

### What is your issue?

Zarr V3 has 2 new codec specs that might be of interest to xarray. We have a codec for casting arrays from one data type to another ([cast_value](https://github.com/zarr-developers/zarr-extensions/tree/main/codecs/cast_value)) and a codec for applying a scale + offset transformation to an array ([scale_offset](https://github.com/zarr-developers/zarr-extensions/tree/main/codecs/scale_offset)). Together, these codecs can define the scale-offset encoding commonly used for compressing floating point measurements as ints before serialization, and which is currently implemented [here](https://github.com/pydata/xarray/blob/e059cc7a522b2693b8b921048fdf499912b88f2c/xarray/coding/variables.py#L431) in this library. The `cast_value` codec also supports remapping scalars within the same data type (e.g., mapping NaN to 0, and the reverse), which might also be useful in xarray.

see [this gist](https://gist.github.com/d-v-b/c3563c7c650f2e83a7e53392cebc167f) for a demo. it just shows that the combination of a `scale_offset` codec (defined inline) and the `cast_value` codec gets the same results as the cf-style zarr attributes + xarray's encoding / decoding logic.

the advantage of pushing this logic down into Zarr itself is that it makes the encoding portable across zarr implementations, and it takes load off the zarr array attributes, which IMO are not well suited for declaring how array scalars should be encoded / decoded.

a disadvantage is that you need implementations of 2 new codecs: I cooked up an implementation of the `cast_value` codec in the [cast-value.py package](https://github.com/zarr-developers/cast-value.py). There's a default numpy implementation and an optional rust implementation. The rust version has generally better performance from a CPU and memory perspective. I opened a [PR](https://github.com/zarr-developers/zarr-python/pull/3774) in zarr-python to add the numpy cast_value implementation, and a scale_offset codec (which is dead simple) but it didn't get any traction so I closed it 😆  

I don't know what integration path makes sense in xarray, so I can't propose any concrete code changes! But it might be worth thinking about how you can make codecs do more, and attributes do less, for data written to zarr.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Codec-based alternative to attribute-based scale-offset encoding #11280

What is your issue?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Codec-based alternative to attribute-based scale-offset encoding #11280

Description

What is your issue?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions