Skip to content

fix BYTE_ARRAY_DECIMAL conversion#971

Merged
martindurant merged 3 commits intodask:mainfrom
wolejri:chore/fix_decimal_var_len_bytes
Mar 10, 2026
Merged

fix BYTE_ARRAY_DECIMAL conversion#971
martindurant merged 3 commits intodask:mainfrom
wolejri:chore/fix_decimal_var_len_bytes

Conversation

@wolejri
Copy link
Copy Markdown

@wolejri wolejri commented Mar 5, 2026

Fix BYTE_ARRAY DECIMAL Conversion Bug

Summary

This PR fixes a critical bug in DECIMAL conversion for variable-length BYTE_ARRAY types that was producing incorrect values (off by 14-15 orders of magnitude) and non-deterministic results across multiple reads of the same data.

Problem

When reading DECIMAL columns stored as variable-length BYTE_ARRAY in parquet files, fastparquet was producing garbage values instead of correct decimal numbers. For example, values that should have been in the range of 0-1000 were being read as hundreds of trillions.

Additionally, the bug was non-deterministic - reading the same parquet file multiple times would produce different incorrect values on each read.

Root Cause

The bug was introduced in commit 53ceac2dbb141b76f603318a5cc0f78e64769d62 (2019), which fixed a legitimate issue with FIXED_LEN_BYTE_ARRAY decimals where numpy was truncating values during iteration. The fix changed the byte extraction approach from:

python
for d in data:
int.from_bytes(d, ...)

to:

python
for i in range(len(data)):
int.from_bytes(data.data[i:i + 1], ...)

This new approach works correctly for FIXED_LEN_BYTE_ARRAY (where data is stored in a flat, contiguous buffer), but breaks for BYTE_ARRAY (where each element is a variable-length bytes object).

Why It Failed

For BYTE_ARRAY decimals:

  • Data is stored as a numpy object array where each element is a bytes object
  • Each bytes object can have a different length (e.g., 1 byte, 3 bytes, 5 bytes)
  • The expression data.data[i:i + 1] attempts to slice the underlying buffer at position i, extracting only one byte
  • This reads from arbitrary memory locations, producing garbage values
  • The garbage values vary between reads because the memory layout is not guaranteed to be consistent

Solution

The fix distinguishes between FIXED_LEN_BYTE_ARRAY and BYTE_ARRAY types and handles them appropriately:

For FIXED_LEN_BYTE_ARRAY

Use buffer slicing (preserving the 2019 fix):
python
its = data.dtype.itemsize
by = data.tobytes()
int.from_bytes(by[i * its:(i + 1) * its], ...)

This works because:

  • Memory layout is flat and contiguous
  • Each element has a known, fixed size (itemsize)
  • We can safely slice the buffer at fixed intervals

For BYTE_ARRAY

Iterate over elements directly (restoring pre-2019 behavior):
python
for d in data:
int.from_bytes(d, ...)

This works because:

  • Each element d is a complete bytes object with the correct length
  • We don't need to know the length in advance
  • We're not relying on buffer memory layout

Testing

New Test Case

Added test_byte_array_decimal() that tests variable-length BYTE_ARRAY decimal conversion with values of different byte lengths (1 byte, 3 bytes, etc.).

Regression Testing

The existing test_big_decimal() test for FIXED_LEN_BYTE_ARRAY continues to pass, ensuring the 2019 fix is preserved.

Test Results

All 15 tests in test_converted_types.py pass (14 passed, 1 skipped due to missing bson library).

Impact

Before Fix

  • ❌ BYTE_ARRAY decimals: Broken (garbage values, non-deterministic)
  • ✅ FIXED_LEN_BYTE_ARRAY decimals: Working correctly
  • ✅ INT32/INT64 decimals: Working correctly

After Fix

  • ✅ BYTE_ARRAY decimals: Fixed (correct values, deterministic)
  • ✅ FIXED_LEN_BYTE_ARRAY decimals: Still working correctly
  • ✅ INT32/INT64 decimals: Still working correctly

Backwards Compatibility

This fix:

  • ✅ Maintains the 2019 fix for FIXED_LEN_BYTE_ARRAY
  • ✅ Restores correct behavior for BYTE_ARRAY that was broken since 2019
  • ✅ No API changes
  • ✅ No breaking changes for users

Users who were reading parquet files with BYTE_ARRAY decimals will now get correct values instead of garbage. This may appear as a "change" in their data, but it's actually a fix - the previous values were completely incorrect.

Files Changed

  • fastparquet/converted_types.py: Updated DECIMAL conversion logic
  • fastparquet/test/test_converted_types.py: Added new test case

Related Issues

This fixes any issues where users report:

  • DECIMAL columns being read with wildly incorrect values (orders of magnitude off)
  • DECIMAL columns returning different values on each read
  • Validation failures when comparing DECIMAL columns between reads
  • Discrepancies between fastparquet and pyarrow when reading DECIMAL columns

@martindurant
Copy link
Copy Markdown
Member

This indeed seems to fix the situation you found, and I am happy to include it.
Note that, with the arrival of pandas 3.0, fastparquet is not really being developed any more, but your fix might still be useful for those on earlier versions.

@martindurant
Copy link
Copy Markdown
Member

(the CI env needs to be amended to specify pandas<3)

@martindurant martindurant merged commit 6daebf1 into dask:main Mar 10, 2026
13 of 15 checks passed
@wolejri
Copy link
Copy Markdown
Author

wolejri commented Mar 13, 2026

@martindurant are you able to release a new version with this fix, so we can use it please ?

@martindurant
Copy link
Copy Markdown
Member

Ping me next week...

@wolejri
Copy link
Copy Markdown
Author

wolejri commented Mar 17, 2026

Hello @martindurant , kind reminder for the release of a new version of this library. Thank you.

@martindurant
Copy link
Copy Markdown
Member

Done.

@wolejri
Copy link
Copy Markdown
Author

wolejri commented Mar 18, 2026

thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants