[BUG] Parquet writer can produce oversized BYTE_ARRAY data pages for large VARCHAR/VARBINARY batches

### Component Selection

- [ ] Core Engine (Expression eval, Memory, Vector)
- [x] Connectors / File Formats (Hive, Parquet, etc.)
- [ ] API / Bindings (Python, etc.)
- [ ] Build
- [ ] Other

### Describe the Bug

The Parquet writer can generate oversized data pages for `BYTE_ARRAY` columns, including `VARCHAR` and `VARBINARY`.

For plain `BYTE_ARRAY` encoding, the writer currently appends a whole write batch into the encoder first and only checks `EstimatedDataEncodedSize() >= data_pagesize_` afterward. If a single batch contains enough encoded bytes, the page size limit is checked too late: the writer may already have buffered a data page larger than the configured page size.

In extreme cases, the generated page can exceed the Parquet `PageHeader` `int32` size fields (`compressed_page_size` / `uncompressed_page_size`), causing integer truncation or invalid Parquet output.

This can also happen after dictionary fallback, when a `VARCHAR` / `VARBINARY` column switches from dictionary encoding to plain encoding.

### Reproduction Steps

1. Create a Parquet writer for a `VARCHAR` or `VARBINARY` column.
2. Disable dictionary encoding, or force dictionary fallback with a small dictionary page limit.
3. Set a small data page size for the column, for example:

   ```cpp
   writerOptions.enableDictionary = false;
   writerOptions.columnDataPageSizeMap["c0"] = 100;

### Bolt Version / Commit ID

95bf0f9a9

### System Configuration

```markdown
- **OS**: (e.g. Ubuntu 22.04, CentOS 7)
- **Compiler**: (e.g. GCC 11, Clang 14)
- **Build Type**: (Debug / Release / RelWithDebInfo)
- **CPU Arch**: (e.g. x86_64 AVX2, ARM64)
- **Framework**: (e.g. Spark 3.3, PrestoDB)
```

### Logs / Stack Trace

```shell

```

### Expected Behavior

The Parquet writer should split plain `BYTE_ARRAY` writes by encoded byte size before appending them to the encoder, while respecting record boundaries when required by page index or DataPageV2.

The writer should also validate data and dictionary page sizes before assigning them to Parquet `PageHeader` int32 fields.

### Additional context

Affected types:
- VARCHAR
- VARBINARY
- other Parquet BYTE_ARRAY-backed columns

Affected paths:
- plain BYTE_ARRAY encoding
- dictionary fallback to plain BYTE_ARRAY encoding

A robust fix should avoid buffering an oversized page first. For flat Arrow binary arrays, the writer can use Arrow offsets to cheaply estimate encoded byte size before writing the batch, and only fall back to per-value splitting when the batch may exceed the page limit.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Parquet writer can produce oversized BYTE_ARRAY data pages for large VARCHAR/VARBINARY batches #612

Component Selection

Describe the Bug

Reproduction Steps

Bolt Version / Commit ID

System Configuration

Logs / Stack Trace

Expected Behavior

Additional context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[BUG] Parquet writer can produce oversized BYTE_ARRAY data pages for large VARCHAR/VARBINARY batches #612

Description

Component Selection

Describe the Bug

Reproduction Steps

Bolt Version / Commit ID

System Configuration

Logs / Stack Trace

Expected Behavior

Additional context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions