Read nested parquet (struct/list/map) on GPU + transparent s3:// scan by ran-yuan-rui · Pull Request #872 · sirius-db/sirius

ran-yuan-rui · 2026-06-04T07:08:14Z

Summary

This PR adds Phase-1 nested Parquet support to the Sirius GPU parquet scan path.
The main goal is to let Sirius read and return common lakehouse nested columns from Parquet on the GPU path. This covers projecting top-level STRUCT, LIST / array, and MAP columns, including nested combinations such as struct-of-list, list-of-struct, nested struct, list-of-list, and MAP-as-list-of-struct layouts. The same scan/result materialization path works for both local Parquet files and S3 Parquet files.
This PR intentionally supports nested projection, not full nested expression semantics. Filtering on nested columns, joining on nested columns, grouping/aggregating by nested columns, subfield projection such as payload.user_id or items[1], and UNNEST remain out of scope for this phase. Those require planner, expression executor, and operator semantics over nested values, while this PR focuses on decoding nested Parquet layout and materializing DuckDB-compatible nested results. Unsupported nested operations return clear errors instead of silently falling back or producing partial semantics.
As a supporting change, this PR also registers a minimal Sirius-owned s3:// DuckDB FileSystem. This is not a CPU fallback path. It exists so DuckDB's native read_parquet('s3://...') can bind the file and read the Parquet footer through Sirius's existing s3_ioctx, without loading DuckDB httpfs and without exposing sirius_read_parquet as the public surface.

The public S3 SQL surface becomes:

SET gpu_execution = true;
SELECT ... FROM read_parquet('s3://bucket/file.parquet');

S3 is GPU-only — three guards prevent a silent CPU fallback

S3 data has no CPU path, so every route that could quietly read s3:// on the CPU is closed:

The FileSystem refuses to open s3:// when gpu_execution is off (there would be no
GPU consumer).
The FileSystem refuses to open s3:// while an internal CPU-replay query is active
(covers the indirect case where a failed GPU query is replayed on the CPU, including
through a view).
OnFinalizePrepare inspects the bound plan; if it reads s3:// and GPU translation
fails, it raises a clear "S3 CPU fallback is not supported" error instead of handing the
plan back to DuckDB's CPU engine.

Nested parquet read/project

Recursive parquet-group → DuckDB STRUCT / LIST / MAP schema mapping at bind time.
Nested cuDF column → DuckDB Vector materialization (list offset slicing, validity
bit-packing, recursive child handling).
Scan planning relaxed to allow nested projected columns; operators that cannot yet
operate on nested values reject them explicitly rather than producing wrong results.

Out of scope (follow-ups)

Operating on nested values — filter / join / group-by over STRUCT/LIST/MAP — and
sub-field projection pushdown (cuDF-limited). This PR covers read + project only.
No httpfs, no S3 write path (COPY … TO 's3://' is rejected), no S3 LIST/glob.

Notes for reviewers

The sirius_s3_filesystem is intentionally bind-only and read-only: it backs the footer read for the GPU path and never serves a CPU scan. The three guards above are the load-bearing part of the "S3 is GPU-only, no CPU fallback" contract.

kevkrist · 2026-06-04T11:50:28Z

@ran-yuan-rui are you sure a lot of this isn't already supported in the sirius parquet scan operator?
See #663

ran-yuan-rui · 2026-06-04T11:56:50Z

@ran-yuan-rui are you sure a lot of this isn't already supported in the sirius parquet scan operator? See #663

Good question — checked, and no, it's not. The GPU scan's cuDF→DuckDB type mapping had no STRUCT/LIST/MAP cases, so a nested column hit an "unsupported type" throw. Locally that just falls back to DuckDB's CPU read_parquet (so it looks supported), but on S3 — GPU-only, no CPU fallback — nested files were unreadable.

This PR adds the missing bits so the GPU scan reads + projects nested directly. Operating on nested (filter/join/group-by) is still rejected on purpose — that's cuDF-gated follow-up, not this PR.

The GPU parquet scan previously threw at bind on any struct/list/map column, making nested-column files unreadable on the S3 (GPU-only) path and forcing a CPU fallback for local files. This adds read + projection pass-through of nested columns. - extract_schema recursively maps nested parquet schema subtrees to DuckDB STRUCT / LIST / MAP types, matching DuckDB's own read_parquet bind shape (the 3-level LIST encoding and the MAP key_value group are handled). - host_table_chunk_reader materializes nested cuDF columns into DuckDB struct/list/map vectors recursively: list offsets become list_entry_t, the value/field children recurse, and validity is copied per level (distinguishing null vs empty list and null struct vs null field). - result collection carries the full DuckDB result types (with nested child types/names) so nested vectors are built faithfully; the flat GPU type representation, which cannot hold nested children, is bypassed for result materialization and tolerates nested types as placeholders elsewhere. - parquet scan planning accepts a projected nested top-level column, keeps nested top-level names out of hive-partition / schema-evolution detection, and reserves bytes for all leaf chunks under a nested column. - operating on a nested column raises a clear unsupported error naming the column instead of crashing: WHERE / GROUP BY / JOIN ON, including predicates pushed down into the scan's table filters; a nested column absent from some files under schema evolution is likewise rejected clearly. Adds parquet fixtures and tests covering schema mapping, bind shape vs DuckDB, scan values vs a DuckDB CPU oracle (struct/list/map/deep, incl. null and empty edges and a multi-chunk boundary case), and the unsupported-operation boundary.

Make `SET gpu_execution=true; SELECT ... FROM read_parquet('s3://bucket/file.parquet')` work transparently — no httpfs, no sirius_read_parquet rewrite, no S3 CPU fallback. Register a read-only Sirius DuckDB FileSystem for the s3:// scheme so DuckDB's native read_parquet binds an s3:// object by reading the parquet footer through s3_ioctx. The resulting native scan is captured by the existing transparent optimizer hook and executed on GPU, where column data is read via s3_ioctx (not this FileSystem). The FileSystem is stateless and resolves the per-connection backend lazily (FileOpener -> ClientContext -> SiriusContext -> scan_manager). S3 stays GPU-only; three guards keep CPU off the S3 data path: - OpenFile refuses an s3:// open unless gpu_execution is enabled ("S3 is GPU-only; SET gpu_execution=true"). - OpenFile refuses an s3:// open while an internal query is active (the CPU fallback replay path), so a query that reaches s3:// indirectly (e.g. via a view) cannot be served to a CPU plan. - When a captured plan that reads s3:// fails GPU translation, OnFinalizePrepare raises "S3 CPU fallback is not supported" instead of running DuckDB's CPU plan. Detection is plan-based (walk the LogicalGet tree for an s3:// MultiFileBindData), covering views whose body reads s3:// parquet. Tests: FileSystem unit + MinIO integration (scheme match, glob/write rejection, positional read, short-read/negative guards), transparent flat and nested (struct/list/map) projections matching a DuckDB CPU oracle, gpu_execution=false rejection, and GPU-unsupported fallback rejection. Drops an obsolete test that asserted s3:// reads fail when no S3 filesystem is registered.

The async libcurl-multi S3 ioctx is the default backend now, so the perf baseline should track it rather than the legacy blocking ioctx. Build the benchmark context from the async s3_ioctx and give it a dedicated CHUNK_SIZE (1 MiB) host memory resource for its device-read staging (the blocking backend reads host-side and needs none). Drop the now-unused blocking include and using.

Cover the common SQL operation shapes on S3-backed tables and check each against the same query reading the local parquet on DuckDB CPU: LEFT/RIGHT OUTER JOIN, GROUP BY / HAVING / COUNT DISTINCT, string LIKE / IN, a mixed local + S3 scan join, NULL / DECIMAL / TIMESTAMP edge values (with a new edge_types.parquet fixture), and CTE / subquery / ordered-limit shapes. Each case uses a total ORDER BY for deterministic comparison and tolerant matching for DECIMAL columns.

ran-yuan-rui requested review from aminaramoon and bwyogatama June 4, 2026 07:08

ran-yuan-rui changed the title ~~nested parquet read for Sirius~~ Read nested parquet (struct/list/map) on GPU + transparent s3:// scan Jun 4, 2026

ran-yuan-rui force-pushed the feature-S3datasource-transparent-readparquet branch 3 times, most recently from 195679b to 4632b09 Compare June 9, 2026 09:50

ran-yuan-rui added 4 commits June 13, 2026 08:09

ran-yuan-rui force-pushed the feature-S3datasource-transparent-readparquet branch from 15aa1c6 to cf3de83 Compare June 13, 2026 01:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read nested parquet (struct/list/map) on GPU + transparent s3:// scan#872

Read nested parquet (struct/list/map) on GPU + transparent s3:// scan#872
ran-yuan-rui wants to merge 4 commits into
sirius-db:devfrom
ran-yuan-rui:feature-S3datasource-transparent-readparquet

ran-yuan-rui commented Jun 4, 2026

Uh oh!

kevkrist commented Jun 4, 2026 •

edited

Loading

Uh oh!

ran-yuan-rui commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ran-yuan-rui commented Jun 4, 2026

Summary

S3 is GPU-only — three guards prevent a silent CPU fallback

Nested parquet read/project

Out of scope (follow-ups)

Notes for reviewers

Uh oh!

kevkrist commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ran-yuan-rui commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kevkrist commented Jun 4, 2026 •

edited

Loading