Skip to content

fix(parquet): drain lazy preloadedRepDefs_ from seekToPage#633

Closed
JinyuanZhang617 wants to merge 1 commit into
bytedance:mainfrom
JinyuanZhang617:fix-parquet-page-reader-github
Closed

fix(parquet): drain lazy preloadedRepDefs_ from seekToPage#633
JinyuanZhang617 wants to merge 1 commit into
bytedance:mainfrom
JinyuanZhang617:fix-parquet-page-reader-github

Conversation

@JinyuanZhang617

Copy link
Copy Markdown
Collaborator

Background

For nested (non top level) columns Bolt only fully decodes the first decodeRepDefPageCount_ (default 10) pages of each column chunk in preloadRepDefs(). The remaining pages are kept as raw rep/def bytes inside preloadedRepDefs_ and decoded on demand by loadMoreRepDefs(). guhaiyan@ commit 27e789e added an ahead-by-one-page invariant at the exit of decodeRepDefs() so that the next consumer access can always read numLeavesInPage_[pageIndex_] safely.

That invariant only fires on the read path driven by RepeatedColumnReader::readRepeatedFor. When a sibling top-level filter pushdown causes the nested column leaf reader to take the skip path instead, control flows SelectiveColumnReader::skip -> PageReader::skip -> seekToPage -> prepareDataPageV1 -> setPageRowInfo and never touches decodeRepDefs. If the skip distance is large enough to cross the sampled boundary, setPageRowInfo does ++pageIndex_ followed by numLeavesInPage_[pageIndex_] and raises (10 vs. 10) Seeking past known repdefs for non top level column page 10.

Fix

Hoist the lazy-load loop into seekToPage page-walk:

  • fired only when hasChunkRepDefs_ AND not isTopLevel_ AND maxRepeat_ greater than 0, so the top-level / no-repdef paths stay zero-cost
  • uses pageIndex_ + 1 >= numLeavesInPage_.size() so we always have the per-page leaf count for the page we are about to enter, mirroring guhaiyan@ ahead-by-one-page invariant on the read path
  • setPageRowInfo stays a pure lookup, keeping the original layering between rep/def producer (preloadRepDefs / loadMoreRepDefs) and consumer (setPageRowInfo / getLengthsAndNulls)
  • loadMoreRepDefs is called at most once per page header, vs. an earlier attempt where it could be called per setPageRowInfo call

Regression test

lazyRepDefSkipPastSampledBoundary writes a single row group containing a sparse MAP STRING STRING column with dataPageSize=1024 so the chunk has many small pages, applies a sparse IN filter on a sibling BIGINT column and lowers decodeRepDefPageCount to 2 via RowReaderOptions to make the sampling boundary deterministic. Without this change the test reproduces the assertion failure; with the change the read completes successfully.

Change-Id: Iaec5fb28f07533762c1421cba0601096eb203fe4

What problem does this PR solve?

Issue Number: close #xxx

Type of Change

  • 🐛 Bug fix (non-breaking change which fixes an issue)
  • ✨ New feature (non-breaking change which adds functionality)
  • 🚀 Performance improvement (optimization)
  • ⚠️ Breaking change (fix or feature that would cause existing functionality to change)
  • 🔨 Refactoring (no logic changes)
  • 🔧 Build/CI or Infrastructure changes
  • 📝 Documentation only

Description

Describe your changes in detail.
For complex logic, explain the "Why" and "How".

Performance Impact

  • No Impact: This change does not affect the critical path (e.g., build system, doc, error handling).

  • Positive Impact: I have run benchmarks.

    Click to view Benchmark Results
    Paste your google-benchmark or TPC-H results here.
    Before: 10.5s
    After:   8.2s  (+20%)
    
  • Negative Impact: Explained below (e.g., trade-off for correctness).

Release Note

Please describe the changes in this PR

Release Note:

Release Note:
- Fixed a crash in `substr` when input is null.
- optimized `group by` performance by 20%.

Checklist (For Author)

  • I have added/updated unit tests (ctest).
  • I have verified the code with local build (Release/Debug).
  • I have run clang-format / linters.
  • (Optional) I have run Sanitizers (ASAN/TSAN) locally for complex C++ changes.
  • No need to test or manual test.

Breaking Changes

  • No

  • Yes (Description: ...)

    Click to view Breaking Changes
    Breaking Changes:
    - Description of the breaking change.
    - Possible solutions or workarounds.
    - Any other relevant information.
    

@JinyuanZhang617 JinyuanZhang617 force-pushed the fix-parquet-page-reader-github branch 3 times, most recently from 52280cd to 017ca21 Compare June 5, 2026 10:09
numRowsInPage_ = 0;
break;
}
// For nested (non top level) columns, 'preloadRepDefs' only fully

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is the fix? comments only ?

## Background

For nested (non top level) columns Bolt only fully decodes the first decodeRepDefPageCount_ (default 10) pages of each column chunk in preloadRepDefs(). The remaining pages are kept as raw rep/def bytes inside preloadedRepDefs_ and decoded on demand by loadMoreRepDefs(). To ensure safety, there is an ahead-by-one-page invariant maintained at the exit of decodeRepDefs(), which ensures that the next consumer access can always read numLeavesInPage_[pageIndex_] safely.

That invariant only fires on the read path driven by RepeatedColumnReader::readRepeatedFor. When a sibling top-level filter pushdown causes the nested column leaf reader to take the skip path instead, control flows SelectiveColumnReader::skip -> PageReader::skip -> seekToPage -> prepareDataPageV1 -> setPageRowInfo and never touches decodeRepDefs. If the skip distance is large enough to cross the sampled boundary, setPageRowInfo does ++pageIndex_ followed by numLeavesInPage_[pageIndex_] and raises (10 vs. 10) Seeking past known repdefs for non top level column page 10.

## Fix

Hoist the lazy-load loop into seekToPage page-walk:

- fired only when hasChunkRepDefs_ AND not isTopLevel_ AND maxRepeat_ greater than 0, so the top-level / no-repdef paths stay zero-cost
- uses pageIndex_ + 1 >= numLeavesInPage_.size() so we always have the per-page leaf count for the page we are about to enter, mirroring the ahead-by-one-page invariant on the read path
- setPageRowInfo stays a pure lookup, keeping the original layering between rep/def producer (preloadRepDefs / loadMoreRepDefs) and consumer (setPageRowInfo / getLengthsAndNulls)
- loadMoreRepDefs is called at most once per page header, vs. an earlier attempt where it could be called per setPageRowInfo call

## Regression test

lazyRepDefSkipPastSampledBoundary writes a single row group containing a sparse MAP STRING STRING column with dataPageSize=1024 so the chunk has many small pages, applies a sparse IN filter on a sibling BIGINT column and lowers decodeRepDefPageCount to 2 via RowReaderOptions to make the sampling boundary deterministic. Without this change the test reproduces the assertion failure; with the change the read completes successfully.

Co-Authored-By: Aime <aime@bytedance.com>
Change-Id: Iaec5fb28f07533762c1421cba0601096eb203fe4
@JinyuanZhang617 JinyuanZhang617 force-pushed the fix-parquet-page-reader-github branch from 017ca21 to ee16d17 Compare June 5, 2026 10:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants