fix BYTE_ARRAY_DECIMAL conversion by wolejri · Pull Request #971 · dask/fastparquet

wolejri · 2026-03-05T13:26:15Z

Fix BYTE_ARRAY DECIMAL Conversion Bug

Summary

This PR fixes a critical bug in DECIMAL conversion for variable-length BYTE_ARRAY types that was producing incorrect values (off by 14-15 orders of magnitude) and non-deterministic results across multiple reads of the same data.

Problem

When reading DECIMAL columns stored as variable-length BYTE_ARRAY in parquet files, fastparquet was producing garbage values instead of correct decimal numbers. For example, values that should have been in the range of 0-1000 were being read as hundreds of trillions.

Additionally, the bug was non-deterministic - reading the same parquet file multiple times would produce different incorrect values on each read.

Root Cause

The bug was introduced in commit 53ceac2dbb141b76f603318a5cc0f78e64769d62 (2019), which fixed a legitimate issue with FIXED_LEN_BYTE_ARRAY decimals where numpy was truncating values during iteration. The fix changed the byte extraction approach from:

python
for d in data:
int.from_bytes(d, ...)

to:

python
for i in range(len(data)):
int.from_bytes(data.data[i:i + 1], ...)

This new approach works correctly for FIXED_LEN_BYTE_ARRAY (where data is stored in a flat, contiguous buffer), but breaks for BYTE_ARRAY (where each element is a variable-length bytes object).

Why It Failed

For BYTE_ARRAY decimals:

Data is stored as a numpy object array where each element is a bytes object
Each bytes object can have a different length (e.g., 1 byte, 3 bytes, 5 bytes)
The expression data.data[i:i + 1] attempts to slice the underlying buffer at position i, extracting only one byte
This reads from arbitrary memory locations, producing garbage values
The garbage values vary between reads because the memory layout is not guaranteed to be consistent

Solution

The fix distinguishes between FIXED_LEN_BYTE_ARRAY and BYTE_ARRAY types and handles them appropriately:

For FIXED_LEN_BYTE_ARRAY

Use buffer slicing (preserving the 2019 fix):
python
its = data.dtype.itemsize
by = data.tobytes()
int.from_bytes(by[i * its:(i + 1) * its], ...)

This works because:

Memory layout is flat and contiguous
Each element has a known, fixed size (itemsize)
We can safely slice the buffer at fixed intervals

For BYTE_ARRAY

Iterate over elements directly (restoring pre-2019 behavior):
python
for d in data:
int.from_bytes(d, ...)

This works because:

Each element d is a complete bytes object with the correct length
We don't need to know the length in advance
We're not relying on buffer memory layout

Testing

New Test Case

Added test_byte_array_decimal() that tests variable-length BYTE_ARRAY decimal conversion with values of different byte lengths (1 byte, 3 bytes, etc.).

Regression Testing

The existing test_big_decimal() test for FIXED_LEN_BYTE_ARRAY continues to pass, ensuring the 2019 fix is preserved.

Test Results

All 15 tests in test_converted_types.py pass (14 passed, 1 skipped due to missing bson library).

Impact

Before Fix

❌ BYTE_ARRAY decimals: Broken (garbage values, non-deterministic)
✅ FIXED_LEN_BYTE_ARRAY decimals: Working correctly
✅ INT32/INT64 decimals: Working correctly

After Fix

✅ BYTE_ARRAY decimals: Fixed (correct values, deterministic)
✅ FIXED_LEN_BYTE_ARRAY decimals: Still working correctly
✅ INT32/INT64 decimals: Still working correctly

Backwards Compatibility

This fix:

✅ Maintains the 2019 fix for FIXED_LEN_BYTE_ARRAY
✅ Restores correct behavior for BYTE_ARRAY that was broken since 2019
✅ No API changes
✅ No breaking changes for users

Users who were reading parquet files with BYTE_ARRAY decimals will now get correct values instead of garbage. This may appear as a "change" in their data, but it's actually a fix - the previous values were completely incorrect.

Files Changed

fastparquet/converted_types.py: Updated DECIMAL conversion logic
fastparquet/test/test_converted_types.py: Added new test case

Related Issues

This fixes any issues where users report:

DECIMAL columns being read with wildly incorrect values (orders of magnitude off)
DECIMAL columns returning different values on each read
Validation failures when comparing DECIMAL columns between reads
Discrepancies between fastparquet and pyarrow when reading DECIMAL columns

martindurant · 2026-03-06T21:02:25Z

This indeed seems to fix the situation you found, and I am happy to include it.
Note that, with the arrival of pandas 3.0, fastparquet is not really being developed any more, but your fix might still be useful for those on earlier versions.

martindurant · 2026-03-06T21:14:22Z

(the CI env needs to be amended to specify pandas<3)

wolejri · 2026-03-13T15:42:31Z

@martindurant are you able to release a new version with this fix, so we can use it please ?

martindurant · 2026-03-13T19:12:32Z

Ping me next week...

wolejri · 2026-03-17T12:53:15Z

Hello @martindurant , kind reminder for the release of a new version of this library. Thank you.

martindurant · 2026-03-17T21:21:58Z

Done.

wolejri · 2026-03-18T12:49:25Z

thank you

fix BYTE_ARRAY_DECIMAL conversion

5e0a0d2

wolejri added 2 commits March 6, 2026 17:36

updated pandas version in ci environments

05609fd

fix pandas error

ca3ec9d

martindurant merged commit 6daebf1 into dask:main Mar 10, 2026
13 of 15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix BYTE_ARRAY_DECIMAL conversion#971

fix BYTE_ARRAY_DECIMAL conversion#971
martindurant merged 3 commits intodask:mainfrom
wolejri:chore/fix_decimal_var_len_bytes

wolejri commented Mar 5, 2026

Uh oh!

martindurant commented Mar 6, 2026

Uh oh!

martindurant commented Mar 6, 2026

Uh oh!

Uh oh!

wolejri commented Mar 13, 2026

Uh oh!

martindurant commented Mar 13, 2026

Uh oh!

wolejri commented Mar 17, 2026

Uh oh!

martindurant commented Mar 17, 2026

Uh oh!

wolejri commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wolejri commented Mar 5, 2026

Fix BYTE_ARRAY DECIMAL Conversion Bug

Summary

Problem

Root Cause

Why It Failed

Solution

For FIXED_LEN_BYTE_ARRAY

For BYTE_ARRAY

Testing

New Test Case

Regression Testing

Test Results

Impact

Before Fix

After Fix

Backwards Compatibility

Files Changed

Related Issues

Uh oh!

martindurant commented Mar 6, 2026

Uh oh!

martindurant commented Mar 6, 2026

Uh oh!

Uh oh!

wolejri commented Mar 13, 2026

Uh oh!

martindurant commented Mar 13, 2026

Uh oh!

wolejri commented Mar 17, 2026

Uh oh!

martindurant commented Mar 17, 2026

Uh oh!

wolejri commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants