Component Selection
Describe the Bug
For nested (non top level) columns Bolt only fully decodes the first decodeRepDefPageCount_ (default 10) pages of each column chunk in preloadRepDefs(). The remaining pages are kept as raw rep/def bytes inside preloadedRepDefs_ and decoded on demand by loadMoreRepDefs(). guhaiyan@ commit 27e789e added an ahead-by-one-page invariant at the exit of decodeRepDefs() so that the next consumer access can always read numLeavesInPage_[pageIndex_] safely.
That invariant only fires on the read path driven by RepeatedColumnReader::readRepeatedFor. When a sibling top-level filter pushdown causes the nested column leaf reader to take the skip path instead, control flows SelectiveColumnReader::skip -> PageReader::skip -> seekToPage -> prepareDataPageV1 -> setPageRowInfo and never touches decodeRepDefs. If the skip distance is large enough to cross the sampled boundary, setPageRowInfo does ++pageIndex_ followed by numLeavesInPage_[pageIndex_] and raises (10 vs. 10) Seeking past known repdefs for non top level column page 10.
Reproduction Steps
TEST_F(ParquetReaderTest, lazyRepDefSkipPastSampledBoundary)
Bolt Version / Commit ID
main branch @bad4660d5d3489204b177523901d0364b5a58f63
System Configuration
- **OS**: (e.g. Ubuntu 22.04, CentOS 7)
- **Compiler**: (e.g. GCC 11, Clang 14)
- **Build Type**: (Debug / Release / RelWithDebInfo)
- **CPU Arch**: (e.g. x86_64 AVX2, ARM64)
- **Framework**: (e.g. Spark 3.3, PrestoDB)
Logs / Stack Trace
Expected Behavior
No response
Additional context
No response
Component Selection
Describe the Bug
For nested (non top level) columns Bolt only fully decodes the first decodeRepDefPageCount_ (default 10) pages of each column chunk in preloadRepDefs(). The remaining pages are kept as raw rep/def bytes inside preloadedRepDefs_ and decoded on demand by loadMoreRepDefs(). guhaiyan@ commit 27e789e added an ahead-by-one-page invariant at the exit of decodeRepDefs() so that the next consumer access can always read numLeavesInPage_[pageIndex_] safely.
That invariant only fires on the read path driven by RepeatedColumnReader::readRepeatedFor. When a sibling top-level filter pushdown causes the nested column leaf reader to take the skip path instead, control flows SelectiveColumnReader::skip -> PageReader::skip -> seekToPage -> prepareDataPageV1 -> setPageRowInfo and never touches decodeRepDefs. If the skip distance is large enough to cross the sampled boundary, setPageRowInfo does ++pageIndex_ followed by numLeavesInPage_[pageIndex_] and raises (10 vs. 10) Seeking past known repdefs for non top level column page 10.
Reproduction Steps
TEST_F(ParquetReaderTest, lazyRepDefSkipPastSampledBoundary)
Bolt Version / Commit ID
main branch @bad4660d5d3489204b177523901d0364b5a58f63
System Configuration
Logs / Stack Trace
Expected Behavior
No response
Additional context
No response