Component Selection
Describe the Bug
The Parquet writer can generate oversized data pages for BYTE_ARRAY columns, including VARCHAR and VARBINARY.
For plain BYTE_ARRAY encoding, the writer currently appends a whole write batch into the encoder first and only checks EstimatedDataEncodedSize() >= data_pagesize_ afterward. If a single batch contains enough encoded bytes, the page size limit is checked too late: the writer may already have buffered a data page larger than the configured page size.
In extreme cases, the generated page can exceed the Parquet PageHeader int32 size fields (compressed_page_size / uncompressed_page_size), causing integer truncation or invalid Parquet output.
This can also happen after dictionary fallback, when a VARCHAR / VARBINARY column switches from dictionary encoding to plain encoding.
Reproduction Steps
-
Create a Parquet writer for a VARCHAR or VARBINARY column.
-
Disable dictionary encoding, or force dictionary fallback with a small dictionary page limit.
-
Set a small data page size for the column, for example:
writerOptions.enableDictionary = false;
writerOptions.columnDataPageSizeMap["c0"] = 100;
Bolt Version / Commit ID
95bf0f9
System Configuration
- **OS**: (e.g. Ubuntu 22.04, CentOS 7)
- **Compiler**: (e.g. GCC 11, Clang 14)
- **Build Type**: (Debug / Release / RelWithDebInfo)
- **CPU Arch**: (e.g. x86_64 AVX2, ARM64)
- **Framework**: (e.g. Spark 3.3, PrestoDB)
Logs / Stack Trace
Expected Behavior
The Parquet writer should split plain BYTE_ARRAY writes by encoded byte size before appending them to the encoder, while respecting record boundaries when required by page index or DataPageV2.
The writer should also validate data and dictionary page sizes before assigning them to Parquet PageHeader int32 fields.
Additional context
Affected types:
- VARCHAR
- VARBINARY
- other Parquet BYTE_ARRAY-backed columns
Affected paths:
- plain BYTE_ARRAY encoding
- dictionary fallback to plain BYTE_ARRAY encoding
A robust fix should avoid buffering an oversized page first. For flat Arrow binary arrays, the writer can use Arrow offsets to cheaply estimate encoded byte size before writing the batch, and only fall back to per-value splitting when the batch may exceed the page limit.
Component Selection
Describe the Bug
The Parquet writer can generate oversized data pages for
BYTE_ARRAYcolumns, includingVARCHARandVARBINARY.For plain
BYTE_ARRAYencoding, the writer currently appends a whole write batch into the encoder first and only checksEstimatedDataEncodedSize() >= data_pagesize_afterward. If a single batch contains enough encoded bytes, the page size limit is checked too late: the writer may already have buffered a data page larger than the configured page size.In extreme cases, the generated page can exceed the Parquet
PageHeaderint32size fields (compressed_page_size/uncompressed_page_size), causing integer truncation or invalid Parquet output.This can also happen after dictionary fallback, when a
VARCHAR/VARBINARYcolumn switches from dictionary encoding to plain encoding.Reproduction Steps
Create a Parquet writer for a
VARCHARorVARBINARYcolumn.Disable dictionary encoding, or force dictionary fallback with a small dictionary page limit.
Set a small data page size for the column, for example:
Bolt Version / Commit ID
95bf0f9
System Configuration
Logs / Stack Trace
Expected Behavior
The Parquet writer should split plain
BYTE_ARRAYwrites by encoded byte size before appending them to the encoder, while respecting record boundaries when required by page index or DataPageV2.The writer should also validate data and dictionary page sizes before assigning them to Parquet
PageHeaderint32 fields.Additional context
Affected types:
Affected paths:
A robust fix should avoid buffering an oversized page first. For flat Arrow binary arrays, the writer can use Arrow offsets to cheaply estimate encoded byte size before writing the batch, and only fall back to per-value splitting when the batch may exceed the page limit.