Feature Category
Performance Optimization
Problem / Use Case
ParquetRowReader::skip currently performs page-level skipping, causing unnecessary IO on row groups that are fully skipped. This can be optimized by skipping them at the row group level without issuing IO.
Proposed Solution
Key Idea
Instead of always performing page-level skipping, detect when the skip range fully covers one or more row groups and bypass them entirely without triggering IO.
Approach (based on the commit)
- Track the remaining number of rows to skip (
remainingSkip)
- Before entering the page-reading path:
- Check the current row group’s row count
- If
remainingSkip >= rowGroup.rowCount:
- Decrease
remainingSkip
- Advance to the next row group
- Do not initialize streams or perform any IO for this row group
- Only when
remainingSkip falls within a row group:
- Fall back to existing page-level skipping logic (via page headers)
References / Prior Art
No response
Importance
Blocker (Cannot use Bolt without this)
Willingness to Contribute
Yes, I can submit a PR
Feature Category
Performance Optimization
Problem / Use Case
ParquetRowReader::skip currently performs page-level skipping, causing unnecessary IO on row groups that are fully skipped. This can be optimized by skipping them at the row group level without issuing IO.
Proposed Solution
Key Idea
Instead of always performing page-level skipping, detect when the skip range fully covers one or more row groups and bypass them entirely without triggering IO.
Approach (based on the commit)
remainingSkip)remainingSkip >= rowGroup.rowCount:remainingSkipremainingSkipfalls within a row group:References / Prior Art
No response
Importance
Blocker (Cannot use Bolt without this)
Willingness to Contribute
Yes, I can submit a PR