fix BYTE_ARRAY_DECIMAL conversion#971
Merged
martindurant merged 3 commits intodask:mainfrom Mar 10, 2026
Merged
Conversation
Member
|
This indeed seems to fix the situation you found, and I am happy to include it. |
Member
|
(the CI env needs to be amended to specify pandas<3) |
Author
|
@martindurant are you able to release a new version with this fix, so we can use it please ? |
Member
|
Ping me next week... |
Author
|
Hello @martindurant , kind reminder for the release of a new version of this library. Thank you. |
Member
|
Done. |
Author
|
thank you |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fix BYTE_ARRAY DECIMAL Conversion Bug
Summary
This PR fixes a critical bug in DECIMAL conversion for variable-length BYTE_ARRAY types that was producing incorrect values (off by 14-15 orders of magnitude) and non-deterministic results across multiple reads of the same data.
Problem
When reading DECIMAL columns stored as variable-length
BYTE_ARRAYin parquet files, fastparquet was producing garbage values instead of correct decimal numbers. For example, values that should have been in the range of 0-1000 were being read as hundreds of trillions.Additionally, the bug was non-deterministic - reading the same parquet file multiple times would produce different incorrect values on each read.
Root Cause
The bug was introduced in commit
53ceac2dbb141b76f603318a5cc0f78e64769d62(2019), which fixed a legitimate issue withFIXED_LEN_BYTE_ARRAYdecimals where numpy was truncating values during iteration. The fix changed the byte extraction approach from:python
for d in data:
int.from_bytes(d, ...)
to:
python
for i in range(len(data)):
int.from_bytes(data.data[i:i + 1], ...)
This new approach works correctly for
FIXED_LEN_BYTE_ARRAY(where data is stored in a flat, contiguous buffer), but breaks forBYTE_ARRAY(where each element is a variable-length bytes object).Why It Failed
For
BYTE_ARRAYdecimals:bytesobjectdata.data[i:i + 1]attempts to slice the underlying buffer at positioni, extracting only one byteSolution
The fix distinguishes between
FIXED_LEN_BYTE_ARRAYandBYTE_ARRAYtypes and handles them appropriately:For FIXED_LEN_BYTE_ARRAY
Use buffer slicing (preserving the 2019 fix):
python
its = data.dtype.itemsize
by = data.tobytes()
int.from_bytes(by[i * its:(i + 1) * its], ...)
This works because:
itemsize)For BYTE_ARRAY
Iterate over elements directly (restoring pre-2019 behavior):
python
for d in data:
int.from_bytes(d, ...)
This works because:
dis a completebytesobject with the correct lengthTesting
New Test Case
Added
test_byte_array_decimal()that tests variable-length BYTE_ARRAY decimal conversion with values of different byte lengths (1 byte, 3 bytes, etc.).Regression Testing
The existing
test_big_decimal()test for FIXED_LEN_BYTE_ARRAY continues to pass, ensuring the 2019 fix is preserved.Test Results
All 15 tests in
test_converted_types.pypass (14 passed, 1 skipped due to missing bson library).Impact
Before Fix
After Fix
Backwards Compatibility
This fix:
Users who were reading parquet files with BYTE_ARRAY decimals will now get correct values instead of garbage. This may appear as a "change" in their data, but it's actually a fix - the previous values were completely incorrect.
Files Changed
fastparquet/converted_types.py: Updated DECIMAL conversion logicfastparquet/test/test_converted_types.py: Added new test caseRelated Issues
This fixes any issues where users report: