Skip to content

Read nested parquet (struct/list/map) on GPU + transparent s3:// scan#872

Open
ran-yuan-rui wants to merge 4 commits into
sirius-db:devfrom
ran-yuan-rui:feature-S3datasource-transparent-readparquet
Open

Read nested parquet (struct/list/map) on GPU + transparent s3:// scan#872
ran-yuan-rui wants to merge 4 commits into
sirius-db:devfrom
ran-yuan-rui:feature-S3datasource-transparent-readparquet

Conversation

@ran-yuan-rui

Copy link
Copy Markdown
Contributor

Summary

This PR adds Phase-1 nested Parquet support to the Sirius GPU parquet scan path.
The main goal is to let Sirius read and return common lakehouse nested columns from Parquet on the GPU path. This covers projecting top-level STRUCT, LIST / array, and MAP columns, including nested combinations such as struct-of-list, list-of-struct, nested struct, list-of-list, and MAP-as-list-of-struct layouts. The same scan/result materialization path works for both local Parquet files and S3 Parquet files.
This PR intentionally supports nested projection, not full nested expression semantics. Filtering on nested columns, joining on nested columns, grouping/aggregating by nested columns, subfield projection such as payload.user_id or items[1], and UNNEST remain out of scope for this phase. Those require planner, expression executor, and operator semantics over nested values, while this PR focuses on decoding nested Parquet layout and materializing DuckDB-compatible nested results. Unsupported nested operations return clear errors instead of silently falling back or producing partial semantics.
As a supporting change, this PR also registers a minimal Sirius-owned s3:// DuckDB FileSystem. This is not a CPU fallback path. It exists so DuckDB's native read_parquet('s3://...') can bind the file and read the Parquet footer through Sirius's existing s3_ioctx, without loading DuckDB httpfs and without exposing sirius_read_parquet as the public surface.

The public S3 SQL surface becomes:

SET gpu_execution = true;
SELECT ... FROM read_parquet('s3://bucket/file.parquet');

S3 is GPU-only — three guards prevent a silent CPU fallback

S3 data has no CPU path, so every route that could quietly read s3:// on the CPU is closed:

  • The FileSystem refuses to open s3:// when gpu_execution is off (there would be no
    GPU consumer).
  • The FileSystem refuses to open s3:// while an internal CPU-replay query is active
    (covers the indirect case where a failed GPU query is replayed on the CPU, including
    through a view).
  • OnFinalizePrepare inspects the bound plan; if it reads s3:// and GPU translation
    fails, it raises a clear "S3 CPU fallback is not supported" error instead of handing the
    plan back to DuckDB's CPU engine.

Nested parquet read/project

  • Recursive parquet-group → DuckDB STRUCT / LIST / MAP schema mapping at bind time.
  • Nested cuDF column → DuckDB Vector materialization (list offset slicing, validity
    bit-packing, recursive child handling).
  • Scan planning relaxed to allow nested projected columns; operators that cannot yet
    operate on nested values reject them explicitly rather than producing wrong results.

Out of scope (follow-ups)

  • Operating on nested values — filter / join / group-by over STRUCT/LIST/MAP — and
    sub-field projection pushdown (cuDF-limited). This PR covers read + project only.
  • No httpfs, no S3 write path (COPY … TO 's3://' is rejected), no S3 LIST/glob.

Notes for reviewers

  • The sirius_s3_filesystem is intentionally bind-only and read-only: it backs the footer read for the GPU path and never serves a CPU scan. The three guards above are the load-bearing part of the "S3 is GPU-only, no CPU fallback" contract.

@kevkrist

kevkrist commented Jun 4, 2026

Copy link
Copy Markdown
Collaborator

@ran-yuan-rui are you sure a lot of this isn't already supported in the sirius parquet scan operator?
See #663

@ran-yuan-rui

Copy link
Copy Markdown
Contributor Author

@ran-yuan-rui are you sure a lot of this isn't already supported in the sirius parquet scan operator? See #663

Good question — checked, and no, it's not. The GPU scan's cuDF→DuckDB type mapping had no STRUCT/LIST/MAP cases, so a nested column hit an "unsupported type" throw. Locally that just falls back to DuckDB's CPU read_parquet (so it looks supported), but on S3 — GPU-only, no CPU fallback — nested files were unreadable.

This PR adds the missing bits so the GPU scan reads + projects nested directly. Operating on nested (filter/join/group-by) is still rejected on purpose — that's cuDF-gated follow-up, not this PR.

@ran-yuan-rui ran-yuan-rui changed the title nested parquet read for Sirius Read nested parquet (struct/list/map) on GPU + transparent s3:// scan Jun 4, 2026
@ran-yuan-rui ran-yuan-rui force-pushed the feature-S3datasource-transparent-readparquet branch 3 times, most recently from 195679b to 4632b09 Compare June 9, 2026 09:50
The GPU parquet scan previously threw at bind on any struct/list/map column,
making nested-column files unreadable on the S3 (GPU-only) path and forcing a
CPU fallback for local files. This adds read + projection pass-through of
nested columns.

- extract_schema recursively maps nested parquet schema subtrees to DuckDB
  STRUCT / LIST / MAP types, matching DuckDB's own read_parquet bind shape
  (the 3-level LIST encoding and the MAP key_value group are handled).
- host_table_chunk_reader materializes nested cuDF columns into DuckDB
  struct/list/map vectors recursively: list offsets become list_entry_t, the
  value/field children recurse, and validity is copied per level (distinguishing
  null vs empty list and null struct vs null field).
- result collection carries the full DuckDB result types (with nested child
  types/names) so nested vectors are built faithfully; the flat GPU type
  representation, which cannot hold nested children, is bypassed for result
  materialization and tolerates nested types as placeholders elsewhere.
- parquet scan planning accepts a projected nested top-level column, keeps
  nested top-level names out of hive-partition / schema-evolution detection,
  and reserves bytes for all leaf chunks under a nested column.
- operating on a nested column raises a clear unsupported error naming the
  column instead of crashing: WHERE / GROUP BY / JOIN ON, including predicates
  pushed down into the scan's table filters; a nested column absent from some
  files under schema evolution is likewise rejected clearly.

Adds parquet fixtures and tests covering schema mapping, bind shape vs DuckDB,
scan values vs a DuckDB CPU oracle (struct/list/map/deep, incl. null and empty
edges and a multi-chunk boundary case), and the unsupported-operation boundary.
Make `SET gpu_execution=true; SELECT ... FROM read_parquet('s3://bucket/file.parquet')`
work transparently — no httpfs, no sirius_read_parquet rewrite, no S3 CPU fallback.

Register a read-only Sirius DuckDB FileSystem for the s3:// scheme so DuckDB's
native read_parquet binds an s3:// object by reading the parquet footer through
s3_ioctx. The resulting native scan is captured by the existing transparent
optimizer hook and executed on GPU, where column data is read via s3_ioctx (not
this FileSystem). The FileSystem is stateless and resolves the per-connection
backend lazily (FileOpener -> ClientContext -> SiriusContext -> scan_manager).

S3 stays GPU-only; three guards keep CPU off the S3 data path:
- OpenFile refuses an s3:// open unless gpu_execution is enabled ("S3 is GPU-only;
  SET gpu_execution=true").
- OpenFile refuses an s3:// open while an internal query is active (the CPU
  fallback replay path), so a query that reaches s3:// indirectly (e.g. via a
  view) cannot be served to a CPU plan.
- When a captured plan that reads s3:// fails GPU translation, OnFinalizePrepare
  raises "S3 CPU fallback is not supported" instead of running DuckDB's CPU plan.
  Detection is plan-based (walk the LogicalGet tree for an s3:// MultiFileBindData),
  covering views whose body reads s3:// parquet.

Tests: FileSystem unit + MinIO integration (scheme match, glob/write rejection,
positional read, short-read/negative guards), transparent flat and nested
(struct/list/map) projections matching a DuckDB CPU oracle, gpu_execution=false
rejection, and GPU-unsupported fallback rejection. Drops an obsolete test that
asserted s3:// reads fail when no S3 filesystem is registered.
The async libcurl-multi S3 ioctx is the default backend now, so the perf
baseline should track it rather than the legacy blocking ioctx. Build the
benchmark context from the async s3_ioctx and give it a dedicated
CHUNK_SIZE (1 MiB) host memory resource for its device-read staging (the
blocking backend reads host-side and needs none). Drop the now-unused
blocking include and using.
Cover the common SQL operation shapes on S3-backed tables and check each
against the same query reading the local parquet on DuckDB CPU: LEFT/RIGHT
OUTER JOIN, GROUP BY / HAVING / COUNT DISTINCT, string LIKE / IN, a mixed
local + S3 scan join, NULL / DECIMAL / TIMESTAMP edge values (with a new
edge_types.parquet fixture), and CTE / subquery / ordered-limit shapes.
Each case uses a total ORDER BY for deterministic comparison and tolerant
matching for DECIMAL columns.
@ran-yuan-rui ran-yuan-rui force-pushed the feature-S3datasource-transparent-readparquet branch from 15aa1c6 to cf3de83 Compare June 13, 2026 01:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants