fix(parquet): drain lazy preloadedRepDefs_ from seekToPage#633
Closed
JinyuanZhang617 wants to merge 1 commit into
Closed
fix(parquet): drain lazy preloadedRepDefs_ from seekToPage#633JinyuanZhang617 wants to merge 1 commit into
JinyuanZhang617 wants to merge 1 commit into
Conversation
52280cd to
017ca21
Compare
guhaiyan0221
reviewed
Jun 5, 2026
| numRowsInPage_ = 0; | ||
| break; | ||
| } | ||
| // For nested (non top level) columns, 'preloadRepDefs' only fully |
Collaborator
There was a problem hiding this comment.
where is the fix? comments only ?
## Background For nested (non top level) columns Bolt only fully decodes the first decodeRepDefPageCount_ (default 10) pages of each column chunk in preloadRepDefs(). The remaining pages are kept as raw rep/def bytes inside preloadedRepDefs_ and decoded on demand by loadMoreRepDefs(). To ensure safety, there is an ahead-by-one-page invariant maintained at the exit of decodeRepDefs(), which ensures that the next consumer access can always read numLeavesInPage_[pageIndex_] safely. That invariant only fires on the read path driven by RepeatedColumnReader::readRepeatedFor. When a sibling top-level filter pushdown causes the nested column leaf reader to take the skip path instead, control flows SelectiveColumnReader::skip -> PageReader::skip -> seekToPage -> prepareDataPageV1 -> setPageRowInfo and never touches decodeRepDefs. If the skip distance is large enough to cross the sampled boundary, setPageRowInfo does ++pageIndex_ followed by numLeavesInPage_[pageIndex_] and raises (10 vs. 10) Seeking past known repdefs for non top level column page 10. ## Fix Hoist the lazy-load loop into seekToPage page-walk: - fired only when hasChunkRepDefs_ AND not isTopLevel_ AND maxRepeat_ greater than 0, so the top-level / no-repdef paths stay zero-cost - uses pageIndex_ + 1 >= numLeavesInPage_.size() so we always have the per-page leaf count for the page we are about to enter, mirroring the ahead-by-one-page invariant on the read path - setPageRowInfo stays a pure lookup, keeping the original layering between rep/def producer (preloadRepDefs / loadMoreRepDefs) and consumer (setPageRowInfo / getLengthsAndNulls) - loadMoreRepDefs is called at most once per page header, vs. an earlier attempt where it could be called per setPageRowInfo call ## Regression test lazyRepDefSkipPastSampledBoundary writes a single row group containing a sparse MAP STRING STRING column with dataPageSize=1024 so the chunk has many small pages, applies a sparse IN filter on a sibling BIGINT column and lowers decodeRepDefPageCount to 2 via RowReaderOptions to make the sampling boundary deterministic. Without this change the test reproduces the assertion failure; with the change the read completes successfully. Co-Authored-By: Aime <aime@bytedance.com> Change-Id: Iaec5fb28f07533762c1421cba0601096eb203fe4
017ca21 to
ee16d17
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Background
For nested (non top level) columns Bolt only fully decodes the first decodeRepDefPageCount_ (default 10) pages of each column chunk in preloadRepDefs(). The remaining pages are kept as raw rep/def bytes inside preloadedRepDefs_ and decoded on demand by loadMoreRepDefs(). guhaiyan@ commit 27e789e added an ahead-by-one-page invariant at the exit of decodeRepDefs() so that the next consumer access can always read numLeavesInPage_[pageIndex_] safely.
That invariant only fires on the read path driven by RepeatedColumnReader::readRepeatedFor. When a sibling top-level filter pushdown causes the nested column leaf reader to take the skip path instead, control flows SelectiveColumnReader::skip -> PageReader::skip -> seekToPage -> prepareDataPageV1 -> setPageRowInfo and never touches decodeRepDefs. If the skip distance is large enough to cross the sampled boundary, setPageRowInfo does ++pageIndex_ followed by numLeavesInPage_[pageIndex_] and raises (10 vs. 10) Seeking past known repdefs for non top level column page 10.
Fix
Hoist the lazy-load loop into seekToPage page-walk:
Regression test
lazyRepDefSkipPastSampledBoundary writes a single row group containing a sparse MAP STRING STRING column with dataPageSize=1024 so the chunk has many small pages, applies a sparse IN filter on a sibling BIGINT column and lowers decodeRepDefPageCount to 2 via RowReaderOptions to make the sampling boundary deterministic. Without this change the test reproduces the assertion failure; with the change the read completes successfully.
Change-Id: Iaec5fb28f07533762c1421cba0601096eb203fe4
What problem does this PR solve?
Issue Number: close #xxx
Type of Change
Description
Describe your changes in detail.
For complex logic, explain the "Why" and "How".
Performance Impact
No Impact: This change does not affect the critical path (e.g., build system, doc, error handling).
Positive Impact: I have run benchmarks.
Click to view Benchmark Results
Negative Impact: Explained below (e.g., trade-off for correctness).
Release Note
Please describe the changes in this PR
Release Note:
Checklist (For Author)
Breaking Changes
No
Yes (Description: ...)
Click to view Breaking Changes