Skip to content

Unify scan_plan across parquet and DuckDB-native scans#869

Open
kevkrist wants to merge 2 commits into
sirius-db:devfrom
kevkrist:scan-plan-unification
Open

Unify scan_plan across parquet and DuckDB-native scans#869
kevkrist wants to merge 2 commits into
sirius-db:devfrom
kevkrist:scan-plan-unification

Conversation

@kevkrist

@kevkrist kevkrist commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator

Routes the DuckDB-native GPU scan through the same scan_plan abstraction the parquet scan already uses. The native bind data now carries a shared scan_plan instead of parallel projected_cols/projected_types vectors: the walker and decoder read plan.data_columns, filters resolve through plan.batch_position_by_column_id, and output is reshaped via assemble_scan_output instead of a hand-rolled projection. scan_plan gains optional is_rowid + per-column type fields (behind build_scan_plan_options) so the native decoder can synthesize rowid and resolve types up front; the parquet path is unchanged. Also fixes a latent crash when a filter references the rowid column.

@kevkrist kevkrist marked this pull request as ready for review June 3, 2026 19:46
@bwyogatama

Copy link
Copy Markdown
Collaborator

This PR touches a bunch of files that i deleted, which makes my rebase more difficult. Let's wait for #871 to merge before merging this

Route the DuckDB-native GPU scan through the same scan_plan abstraction the
parquet scan uses, replacing its parallel projected_cols / projected_types
vectors and ad-hoc "keep first N columns" projection with the shared plan +
assemble_scan_output.

- scan_plan::data_column gains optional is_rowid + logical-type fields, and
  build_scan_plan gains build_scan_plan_options (decode_rowid_columns +
  type_for) so the native path routes synthesized rowid columns into
  data_columns / output_layout and carries per-column types. Parquet is
  unchanged: default options reproduce today's behavior.
- duckdb_native_ingestible_table_info carries a shared scan_plan; the
  projected_column struct and the now-dead output_types / names /
  projection_ids fields are removed.
- The pipeline converter builds the plan once via build_scan_plan.
- The walker and decoder consume plan.data_columns directly.
- The ingestible builds filters over plan.batch_position_by_column_id and
  reshapes output via assemble_scan_output (needs_output_assembly), the same
  path parquet uses.

Also fixes a latent rowid-filter crash: convert_table_filters_to_expression
resolved a filter column's type via returned_types.at(primary_idx), out of
bounds for rowid's sentinel primary index. Rowid's filter type now resolves
as BIGINT directly.

Tests: walker unit tests migrated to the scan_plan signature; new
test_build_scan_plan.cpp covers rowid routing, the parquet default drop, the
missing-type throw, pure-filter columns, and reordered projection.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@kevkrist kevkrist force-pushed the scan-plan-unification branch from ed00455 to 19fc910 Compare June 10, 2026 15:47
@kevkrist kevkrist changed the title Scan plan unification + duckdb scan manager metadata parsing Unify scan_plan across parquet and DuckDB-native scans Jun 10, 2026
@kevkrist kevkrist enabled auto-merge June 10, 2026 16:38
The unified scan_plan carried two per-column fields (is_rowid, type) that
only the duckdb-native reader populates; parquet/pinned left them at
defaults. Fold them into a single std::optional<reader_decode_info> so the
"type and is_rowid travel together" invariant is enforced by construction
instead of a runtime loop, and so the native-only nature is self-documenting
(reader_info present => reader-typed path; empty => the reader resolves types
itself).

- data_column::{is_rowid,type} -> data_column::reader_info
  (reader_decode_info{type, is_rowid}); reader_decode_info cannot exist
  without a type, so the build-time check now only verifies presence.
- Native decoder/walker/ingestible read reader_info->{type,is_rowid};
  duckdb_column_metadata's own is_rowid is unchanged.
- Note the future direction: when a second reader needs per-column decode
  metadata (ORC/Arrow), promote reader_info to a std::variant with each
  format's arm defined outside scan_plan.

Tests updated to the new shape. [scan] suite green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@bwyogatama

Copy link
Copy Markdown
Collaborator

@kevkrist is this still an ongoing work?

@kevkrist

Copy link
Copy Markdown
Collaborator Author

@bwyogatama I will rebase and evaluate whether it serves any purpose after gpu_ingestible

@bwyogatama

Copy link
Copy Markdown
Collaborator

Yeah and this is not P0 so don't worry too much about it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants