Skip to content

[Bug] Seeking past known repdefs for non top level column page #634

@JinyuanZhang617

Description

@JinyuanZhang617

Component Selection

  • Core Engine (Expression eval, Memory, Vector)
  • Connectors / File Formats (Hive, Parquet, etc.)
  • API / Bindings (Python, etc.)
  • Build
  • Other

Describe the Bug

For nested (non top level) columns Bolt only fully decodes the first decodeRepDefPageCount_ (default 10) pages of each column chunk in preloadRepDefs(). The remaining pages are kept as raw rep/def bytes inside preloadedRepDefs_ and decoded on demand by loadMoreRepDefs(). guhaiyan@ commit 27e789e added an ahead-by-one-page invariant at the exit of decodeRepDefs() so that the next consumer access can always read numLeavesInPage_[pageIndex_] safely.

That invariant only fires on the read path driven by RepeatedColumnReader::readRepeatedFor. When a sibling top-level filter pushdown causes the nested column leaf reader to take the skip path instead, control flows SelectiveColumnReader::skip -> PageReader::skip -> seekToPage -> prepareDataPageV1 -> setPageRowInfo and never touches decodeRepDefs. If the skip distance is large enough to cross the sampled boundary, setPageRowInfo does ++pageIndex_ followed by numLeavesInPage_[pageIndex_] and raises (10 vs. 10) Seeking past known repdefs for non top level column page 10.

Reproduction Steps

TEST_F(ParquetReaderTest, lazyRepDefSkipPastSampledBoundary)

Bolt Version / Commit ID

main branch @bad4660d5d3489204b177523901d0364b5a58f63

System Configuration

- **OS**: (e.g. Ubuntu 22.04, CentOS 7)
- **Compiler**: (e.g. GCC 11, Clang 14)
- **Build Type**: (Debug / Release / RelWithDebInfo)
- **CPU Arch**: (e.g. x86_64 AVX2, ARM64)
- **Framework**: (e.g. Spark 3.3, PrestoDB)

Logs / Stack Trace

Expected Behavior

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions