Unify scan_plan across parquet and DuckDB-native scans#869
Open
kevkrist wants to merge 2 commits into
Open
Conversation
Collaborator
|
This PR touches a bunch of files that i deleted, which makes my rebase more difficult. Let's wait for #871 to merge before merging this |
Route the DuckDB-native GPU scan through the same scan_plan abstraction the parquet scan uses, replacing its parallel projected_cols / projected_types vectors and ad-hoc "keep first N columns" projection with the shared plan + assemble_scan_output. - scan_plan::data_column gains optional is_rowid + logical-type fields, and build_scan_plan gains build_scan_plan_options (decode_rowid_columns + type_for) so the native path routes synthesized rowid columns into data_columns / output_layout and carries per-column types. Parquet is unchanged: default options reproduce today's behavior. - duckdb_native_ingestible_table_info carries a shared scan_plan; the projected_column struct and the now-dead output_types / names / projection_ids fields are removed. - The pipeline converter builds the plan once via build_scan_plan. - The walker and decoder consume plan.data_columns directly. - The ingestible builds filters over plan.batch_position_by_column_id and reshapes output via assemble_scan_output (needs_output_assembly), the same path parquet uses. Also fixes a latent rowid-filter crash: convert_table_filters_to_expression resolved a filter column's type via returned_types.at(primary_idx), out of bounds for rowid's sentinel primary index. Rowid's filter type now resolves as BIGINT directly. Tests: walker unit tests migrated to the scan_plan signature; new test_build_scan_plan.cpp covers rowid routing, the parquet default drop, the missing-type throw, pure-filter columns, and reordered projection. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ed00455 to
19fc910
Compare
The unified scan_plan carried two per-column fields (is_rowid, type) that
only the duckdb-native reader populates; parquet/pinned left them at
defaults. Fold them into a single std::optional<reader_decode_info> so the
"type and is_rowid travel together" invariant is enforced by construction
instead of a runtime loop, and so the native-only nature is self-documenting
(reader_info present => reader-typed path; empty => the reader resolves types
itself).
- data_column::{is_rowid,type} -> data_column::reader_info
(reader_decode_info{type, is_rowid}); reader_decode_info cannot exist
without a type, so the build-time check now only verifies presence.
- Native decoder/walker/ingestible read reader_info->{type,is_rowid};
duckdb_column_metadata's own is_rowid is unchanged.
- Note the future direction: when a second reader needs per-column decode
metadata (ORC/Arrow), promote reader_info to a std::variant with each
format's arm defined outside scan_plan.
Tests updated to the new shape. [scan] suite green.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Collaborator
|
@kevkrist is this still an ongoing work? |
Collaborator
Author
|
@bwyogatama I will rebase and evaluate whether it serves any purpose after gpu_ingestible |
Collaborator
|
Yeah and this is not P0 so don't worry too much about it |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Routes the DuckDB-native GPU scan through the same
scan_planabstraction the parquet scan already uses. The native bind data now carries a sharedscan_planinstead of parallelprojected_cols/projected_typesvectors: the walker and decoder readplan.data_columns, filters resolve throughplan.batch_position_by_column_id, and output is reshaped viaassemble_scan_outputinstead of a hand-rolled projection.scan_plangains optionalis_rowid+ per-column type fields (behindbuild_scan_plan_options) so the native decoder can synthesize rowid and resolve types up front; the parquet path is unchanged. Also fixes a latent crash when a filter references therowidcolumn.