Skip to content

[Feature] Optimize ParquetRowReader::Impl::skip to avoid io for skipped row groups #530

@Weixin-Xu

Description

@Weixin-Xu

Feature Category

Performance Optimization

Problem / Use Case

ParquetRowReader::skip currently performs page-level skipping, causing unnecessary IO on row groups that are fully skipped. This can be optimized by skipping them at the row group level without issuing IO.

Proposed Solution

Key Idea

Instead of always performing page-level skipping, detect when the skip range fully covers one or more row groups and bypass them entirely without triggering IO.

Approach (based on the commit)

  • Track the remaining number of rows to skip (remainingSkip)
  • Before entering the page-reading path:
    • Check the current row group’s row count
    • If remainingSkip >= rowGroup.rowCount:
      • Decrease remainingSkip
      • Advance to the next row group
      • Do not initialize streams or perform any IO for this row group
  • Only when remainingSkip falls within a row group:
    • Fall back to existing page-level skipping logic (via page headers)

References / Prior Art

No response

Importance

Blocker (Cannot use Bolt without this)

Willingness to Contribute

Yes, I can submit a PR

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions