Skip to content

[BUG] Parquet writer can produce oversized BYTE_ARRAY data pages for large VARCHAR/VARBINARY batches #612

@Weixin-Xu

Description

@Weixin-Xu

Component Selection

  • Core Engine (Expression eval, Memory, Vector)
  • Connectors / File Formats (Hive, Parquet, etc.)
  • API / Bindings (Python, etc.)
  • Build
  • Other

Describe the Bug

The Parquet writer can generate oversized data pages for BYTE_ARRAY columns, including VARCHAR and VARBINARY.

For plain BYTE_ARRAY encoding, the writer currently appends a whole write batch into the encoder first and only checks EstimatedDataEncodedSize() >= data_pagesize_ afterward. If a single batch contains enough encoded bytes, the page size limit is checked too late: the writer may already have buffered a data page larger than the configured page size.

In extreme cases, the generated page can exceed the Parquet PageHeader int32 size fields (compressed_page_size / uncompressed_page_size), causing integer truncation or invalid Parquet output.

This can also happen after dictionary fallback, when a VARCHAR / VARBINARY column switches from dictionary encoding to plain encoding.

Reproduction Steps

  1. Create a Parquet writer for a VARCHAR or VARBINARY column.

  2. Disable dictionary encoding, or force dictionary fallback with a small dictionary page limit.

  3. Set a small data page size for the column, for example:

    writerOptions.enableDictionary = false;
    writerOptions.columnDataPageSizeMap["c0"] = 100;
    

Bolt Version / Commit ID

95bf0f9

System Configuration

- **OS**: (e.g. Ubuntu 22.04, CentOS 7)
- **Compiler**: (e.g. GCC 11, Clang 14)
- **Build Type**: (Debug / Release / RelWithDebInfo)
- **CPU Arch**: (e.g. x86_64 AVX2, ARM64)
- **Framework**: (e.g. Spark 3.3, PrestoDB)

Logs / Stack Trace

Expected Behavior

The Parquet writer should split plain BYTE_ARRAY writes by encoded byte size before appending them to the encoder, while respecting record boundaries when required by page index or DataPageV2.

The writer should also validate data and dictionary page sizes before assigning them to Parquet PageHeader int32 fields.

Additional context

Affected types:

  • VARCHAR
  • VARBINARY
  • other Parquet BYTE_ARRAY-backed columns

Affected paths:

  • plain BYTE_ARRAY encoding
  • dictionary fallback to plain BYTE_ARRAY encoding

A robust fix should avoid buffering an oversized page first. For flat Arrow binary arrays, the writer can use Arrow offsets to cheaply estimate encoded byte size before writing the batch, and only fall back to per-value splitting when the batch may exceed the page limit.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions