fix(parquet): drain lazy preloadedRepDefs_ from seekToPage by JinyuanZhang617 · Pull Request #633 · bytedance/bolt

JinyuanZhang617 · 2026-06-05T09:14:44Z

Background

For nested (non top level) columns Bolt only fully decodes the first decodeRepDefPageCount_ (default 10) pages of each column chunk in preloadRepDefs(). The remaining pages are kept as raw rep/def bytes inside preloadedRepDefs_ and decoded on demand by loadMoreRepDefs(). guhaiyan@ commit 27e789e added an ahead-by-one-page invariant at the exit of decodeRepDefs() so that the next consumer access can always read numLeavesInPage_[pageIndex_] safely.

That invariant only fires on the read path driven by RepeatedColumnReader::readRepeatedFor. When a sibling top-level filter pushdown causes the nested column leaf reader to take the skip path instead, control flows SelectiveColumnReader::skip -> PageReader::skip -> seekToPage -> prepareDataPageV1 -> setPageRowInfo and never touches decodeRepDefs. If the skip distance is large enough to cross the sampled boundary, setPageRowInfo does ++pageIndex_ followed by numLeavesInPage_[pageIndex_] and raises (10 vs. 10) Seeking past known repdefs for non top level column page 10.

Fix

Hoist the lazy-load loop into seekToPage page-walk:

fired only when hasChunkRepDefs_ AND not isTopLevel_ AND maxRepeat_ greater than 0, so the top-level / no-repdef paths stay zero-cost
uses pageIndex_ + 1 >= numLeavesInPage_.size() so we always have the per-page leaf count for the page we are about to enter, mirroring guhaiyan@ ahead-by-one-page invariant on the read path
setPageRowInfo stays a pure lookup, keeping the original layering between rep/def producer (preloadRepDefs / loadMoreRepDefs) and consumer (setPageRowInfo / getLengthsAndNulls)
loadMoreRepDefs is called at most once per page header, vs. an earlier attempt where it could be called per setPageRowInfo call

Regression test

lazyRepDefSkipPastSampledBoundary writes a single row group containing a sparse MAP STRING STRING column with dataPageSize=1024 so the chunk has many small pages, applies a sparse IN filter on a sibling BIGINT column and lowers decodeRepDefPageCount to 2 via RowReaderOptions to make the sampling boundary deterministic. Without this change the test reproduces the assertion failure; with the change the read completes successfully.

Change-Id: Iaec5fb28f07533762c1421cba0601096eb203fe4

What problem does this PR solve?

Issue Number: close #xxx

Type of Change

🐛 Bug fix (non-breaking change which fixes an issue)
✨ New feature (non-breaking change which adds functionality)
🚀 Performance improvement (optimization)
⚠️ Breaking change (fix or feature that would cause existing functionality to change)
🔨 Refactoring (no logic changes)
🔧 Build/CI or Infrastructure changes
📝 Documentation only

Description

Describe your changes in detail.
For complex logic, explain the "Why" and "How".

Performance Impact

No Impact: This change does not affect the critical path (e.g., build system, doc, error handling).

Positive Impact: I have run benchmarks.

Click to view Benchmark Results

Paste your google-benchmark or TPC-H results here.
Before: 10.5s
After:   8.2s  (+20%)

Negative Impact: Explained below (e.g., trade-off for correctness).

Release Note

Please describe the changes in this PR

Release Note:

Release Note:
- Fixed a crash in `substr` when input is null.
- optimized `group by` performance by 20%.

Checklist (For Author)

I have added/updated unit tests (ctest).
I have verified the code with local build (Release/Debug).
I have run clang-format / linters.
(Optional) I have run Sanitizers (ASAN/TSAN) locally for complex C++ changes.
No need to test or manual test.

Breaking Changes

No

Yes (Description: ...)

Click to view Breaking Changes

Breaking Changes:
- Description of the breaking change.
- Possible solutions or workarounds.
- Any other relevant information.

guhaiyan0221 · 2026-06-05T10:35:36Z

      numRowsInPage_ = 0;
      break;
    }
+    // For nested (non top level) columns, 'preloadRepDefs' only fully


where is the fix? comments only ?

## Background For nested (non top level) columns Bolt only fully decodes the first decodeRepDefPageCount_ (default 10) pages of each column chunk in preloadRepDefs(). The remaining pages are kept as raw rep/def bytes inside preloadedRepDefs_ and decoded on demand by loadMoreRepDefs(). To ensure safety, there is an ahead-by-one-page invariant maintained at the exit of decodeRepDefs(), which ensures that the next consumer access can always read numLeavesInPage_[pageIndex_] safely. That invariant only fires on the read path driven by RepeatedColumnReader::readRepeatedFor. When a sibling top-level filter pushdown causes the nested column leaf reader to take the skip path instead, control flows SelectiveColumnReader::skip -> PageReader::skip -> seekToPage -> prepareDataPageV1 -> setPageRowInfo and never touches decodeRepDefs. If the skip distance is large enough to cross the sampled boundary, setPageRowInfo does ++pageIndex_ followed by numLeavesInPage_[pageIndex_] and raises (10 vs. 10) Seeking past known repdefs for non top level column page 10. ## Fix Hoist the lazy-load loop into seekToPage page-walk: - fired only when hasChunkRepDefs_ AND not isTopLevel_ AND maxRepeat_ greater than 0, so the top-level / no-repdef paths stay zero-cost - uses pageIndex_ + 1 >= numLeavesInPage_.size() so we always have the per-page leaf count for the page we are about to enter, mirroring the ahead-by-one-page invariant on the read path - setPageRowInfo stays a pure lookup, keeping the original layering between rep/def producer (preloadRepDefs / loadMoreRepDefs) and consumer (setPageRowInfo / getLengthsAndNulls) - loadMoreRepDefs is called at most once per page header, vs. an earlier attempt where it could be called per setPageRowInfo call ## Regression test lazyRepDefSkipPastSampledBoundary writes a single row group containing a sparse MAP STRING STRING column with dataPageSize=1024 so the chunk has many small pages, applies a sparse IN filter on a sibling BIGINT column and lowers decodeRepDefPageCount to 2 via RowReaderOptions to make the sampling boundary deterministic. Without this change the test reproduces the assertion failure; with the change the read completes successfully. Co-Authored-By: Aime <aime@bytedance.com> Change-Id: Iaec5fb28f07533762c1421cba0601096eb203fe4

JinyuanZhang617 requested a review from guhaiyan0221 June 5, 2026 09:14

JinyuanZhang617 force-pushed the fix-parquet-page-reader-github branch 3 times, most recently from 52280cd to 017ca21 Compare June 5, 2026 10:09

guhaiyan0221 reviewed Jun 5, 2026

View reviewed changes

JinyuanZhang617 force-pushed the fix-parquet-page-reader-github branch from 017ca21 to ee16d17 Compare June 5, 2026 10:38

JinyuanZhang617 closed this Jun 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(parquet): drain lazy preloadedRepDefs_ from seekToPage#633

fix(parquet): drain lazy preloadedRepDefs_ from seekToPage#633
JinyuanZhang617 wants to merge 1 commit into
bytedance:mainfrom
JinyuanZhang617:fix-parquet-page-reader-github

JinyuanZhang617 commented Jun 5, 2026

Uh oh!

guhaiyan0221 Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

JinyuanZhang617 commented Jun 5, 2026

Background

Fix

Regression test

What problem does this PR solve?

Type of Change

Description

Performance Impact

Release Note

Checklist (For Author)

Breaking Changes

Uh oh!

guhaiyan0221 Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants