Read nested parquet (struct/list/map) on GPU + transparent s3:// scan#872
Open
ran-yuan-rui wants to merge 4 commits into
Open
Read nested parquet (struct/list/map) on GPU + transparent s3:// scan#872ran-yuan-rui wants to merge 4 commits into
ran-yuan-rui wants to merge 4 commits into
Conversation
Collaborator
|
@ran-yuan-rui are you sure a lot of this isn't already supported in the sirius parquet scan operator? |
Contributor
Author
Good question — checked, and no, it's not. The GPU scan's cuDF→DuckDB type mapping had no STRUCT/LIST/MAP cases, so a nested column hit an "unsupported type" throw. Locally that just falls back to DuckDB's CPU read_parquet (so it looks supported), but on S3 — GPU-only, no CPU fallback — nested files were unreadable. This PR adds the missing bits so the GPU scan reads + projects nested directly. Operating on nested (filter/join/group-by) is still rejected on purpose — that's cuDF-gated follow-up, not this PR. |
195679b to
4632b09
Compare
The GPU parquet scan previously threw at bind on any struct/list/map column, making nested-column files unreadable on the S3 (GPU-only) path and forcing a CPU fallback for local files. This adds read + projection pass-through of nested columns. - extract_schema recursively maps nested parquet schema subtrees to DuckDB STRUCT / LIST / MAP types, matching DuckDB's own read_parquet bind shape (the 3-level LIST encoding and the MAP key_value group are handled). - host_table_chunk_reader materializes nested cuDF columns into DuckDB struct/list/map vectors recursively: list offsets become list_entry_t, the value/field children recurse, and validity is copied per level (distinguishing null vs empty list and null struct vs null field). - result collection carries the full DuckDB result types (with nested child types/names) so nested vectors are built faithfully; the flat GPU type representation, which cannot hold nested children, is bypassed for result materialization and tolerates nested types as placeholders elsewhere. - parquet scan planning accepts a projected nested top-level column, keeps nested top-level names out of hive-partition / schema-evolution detection, and reserves bytes for all leaf chunks under a nested column. - operating on a nested column raises a clear unsupported error naming the column instead of crashing: WHERE / GROUP BY / JOIN ON, including predicates pushed down into the scan's table filters; a nested column absent from some files under schema evolution is likewise rejected clearly. Adds parquet fixtures and tests covering schema mapping, bind shape vs DuckDB, scan values vs a DuckDB CPU oracle (struct/list/map/deep, incl. null and empty edges and a multi-chunk boundary case), and the unsupported-operation boundary.
Make `SET gpu_execution=true; SELECT ... FROM read_parquet('s3://bucket/file.parquet')`
work transparently — no httpfs, no sirius_read_parquet rewrite, no S3 CPU fallback.
Register a read-only Sirius DuckDB FileSystem for the s3:// scheme so DuckDB's
native read_parquet binds an s3:// object by reading the parquet footer through
s3_ioctx. The resulting native scan is captured by the existing transparent
optimizer hook and executed on GPU, where column data is read via s3_ioctx (not
this FileSystem). The FileSystem is stateless and resolves the per-connection
backend lazily (FileOpener -> ClientContext -> SiriusContext -> scan_manager).
S3 stays GPU-only; three guards keep CPU off the S3 data path:
- OpenFile refuses an s3:// open unless gpu_execution is enabled ("S3 is GPU-only;
SET gpu_execution=true").
- OpenFile refuses an s3:// open while an internal query is active (the CPU
fallback replay path), so a query that reaches s3:// indirectly (e.g. via a
view) cannot be served to a CPU plan.
- When a captured plan that reads s3:// fails GPU translation, OnFinalizePrepare
raises "S3 CPU fallback is not supported" instead of running DuckDB's CPU plan.
Detection is plan-based (walk the LogicalGet tree for an s3:// MultiFileBindData),
covering views whose body reads s3:// parquet.
Tests: FileSystem unit + MinIO integration (scheme match, glob/write rejection,
positional read, short-read/negative guards), transparent flat and nested
(struct/list/map) projections matching a DuckDB CPU oracle, gpu_execution=false
rejection, and GPU-unsupported fallback rejection. Drops an obsolete test that
asserted s3:// reads fail when no S3 filesystem is registered.
The async libcurl-multi S3 ioctx is the default backend now, so the perf baseline should track it rather than the legacy blocking ioctx. Build the benchmark context from the async s3_ioctx and give it a dedicated CHUNK_SIZE (1 MiB) host memory resource for its device-read staging (the blocking backend reads host-side and needs none). Drop the now-unused blocking include and using.
Cover the common SQL operation shapes on S3-backed tables and check each against the same query reading the local parquet on DuckDB CPU: LEFT/RIGHT OUTER JOIN, GROUP BY / HAVING / COUNT DISTINCT, string LIKE / IN, a mixed local + S3 scan join, NULL / DECIMAL / TIMESTAMP edge values (with a new edge_types.parquet fixture), and CTE / subquery / ordered-limit shapes. Each case uses a total ORDER BY for deterministic comparison and tolerant matching for DECIMAL columns.
15aa1c6 to
cf3de83
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds Phase-1 nested Parquet support to the Sirius GPU parquet scan path.
The main goal is to let Sirius read and return common lakehouse nested columns from Parquet on the GPU path. This covers projecting top-level
STRUCT,LIST/ array, andMAPcolumns, including nested combinations such as struct-of-list, list-of-struct, nested struct, list-of-list, and MAP-as-list-of-struct layouts. The same scan/result materialization path works for both local Parquet files and S3 Parquet files.This PR intentionally supports nested projection, not full nested expression semantics. Filtering on nested columns, joining on nested columns, grouping/aggregating by nested columns, subfield projection such as
payload.user_idoritems[1], andUNNESTremain out of scope for this phase. Those require planner, expression executor, and operator semantics over nested values, while this PR focuses on decoding nested Parquet layout and materializing DuckDB-compatible nested results. Unsupported nested operations return clear errors instead of silently falling back or producing partial semantics.As a supporting change, this PR also registers a minimal Sirius-owned
s3://DuckDB FileSystem. This is not a CPU fallback path. It exists so DuckDB's nativeread_parquet('s3://...')can bind the file and read the Parquet footer through Sirius's existings3_ioctx, without loading DuckDBhttpfsand without exposingsirius_read_parquetas the public surface.The public S3 SQL surface becomes:
SET gpu_execution = true;
SELECT ... FROM read_parquet('s3://bucket/file.parquet');
S3 is GPU-only — three guards prevent a silent CPU fallback
S3 data has no CPU path, so every route that could quietly read
s3://on the CPU is closed:s3://whengpu_executionis off (there would be noGPU consumer).
s3://while an internal CPU-replay query is active(covers the indirect case where a failed GPU query is replayed on the CPU, including
through a view).
OnFinalizePrepareinspects the bound plan; if it readss3://and GPU translationfails, it raises a clear "S3 CPU fallback is not supported" error instead of handing the
plan back to DuckDB's CPU engine.
Nested parquet read/project
STRUCT/LIST/MAPschema mapping at bind time.Vectormaterialization (list offset slicing, validitybit-packing, recursive child handling).
operate on nested values reject them explicitly rather than producing wrong results.
Out of scope (follow-ups)
sub-field projection pushdown (cuDF-limited). This PR covers read + project only.
httpfs, no S3 write path (COPY … TO 's3://'is rejected), no S3LIST/glob.Notes for reviewers
sirius_s3_filesystemis intentionally bind-only and read-only: it backs the footer read for the GPU path and never serves a CPU scan. The three guards above are the load-bearing part of the "S3 is GPU-only, no CPU fallback" contract.