Skip to content

feat(starrocks): infer FILES() parquet schema in the compute node#962

Open
mbrobbel wants to merge 1 commit into
sirius-db:devfrom
mbrobbel:starrocks-cn-file-schema
Open

feat(starrocks): infer FILES() parquet schema in the compute node#962
mbrobbel wants to merge 1 commit into
sirius-db:devfrom
mbrobbel:starrocks-cn-file-schema

Conversation

@mbrobbel

Copy link
Copy Markdown
Member

Implement the PInternalService get_file_schema BRPC handler so the FE can resolve the schema of SELECT * FROM FILES('...parquet') against the Rust compute node, instead of failing because no backend answered the schema proxy request.

The handler decodes the binary-thrift TGetFileSchemaRequest attachment, opens the parquet file, and maps its top-level columns to StarRocks slot descriptors. Type mapping mirrors the native scanner so the inferred schema is one the FILE_SCAN path can actually read:

  • integers map by physical width (INT32 -> INT, INT64 -> BIGINT)
  • DECIMAL32/64/128 by precision; wider precision falls back to VARCHAR
  • raw BYTE_ARRAY -> VARBINARY, FIXED_LEN_BYTE_ARRAY -> VARCHAR
  • DATE/TIME/TIMESTAMP, JSON, BSON handled via logical and legacy converted types

Inputs outside the supported surface are rejected with a clear error rather than resolving a schema the scanner would later choke on: nested columns, multi-file ranges, case-insensitive duplicate names, remote URI schemes, and non-local file authorities. Schema reading uses the parquet crate's native metadata reader (no arrow dependency).

Implement the PInternalService get_file_schema BRPC handler so the FE can
resolve the schema of SELECT * FROM FILES('...parquet') against the Rust
compute node, instead of failing because no backend answered the schema
proxy request.

The handler decodes the binary-thrift TGetFileSchemaRequest attachment,
reads the parquet footer asynchronously, and maps its top-level columns to
StarRocks slot descriptors. Type mapping mirrors the native scanner so the
inferred schema is one the FILE_SCAN path can actually read:

- integers map by physical width (INT32 -> INT, INT64 -> BIGINT)
- DECIMAL32/64/128 by precision; wider precision falls back to VARCHAR
- raw BYTE_ARRAY -> VARBINARY, FIXED_LEN_BYTE_ARRAY -> VARCHAR
- DATE/TIME/TIMESTAMP, JSON, BSON handled via logical and legacy converted types

Inputs outside the supported surface are rejected with a clear error rather
than resolving a schema the scanner would later choke on: nested columns,
multi-file ranges, case-insensitive duplicate names, remote URI schemes, and
non-local file authorities.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mbrobbel mbrobbel marked this pull request as ready for review June 16, 2026 20:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant