Skip to content

[BugFix] Honor per-snapshot schema for Iceberg time travel reads#74711

Draft
GavinMar wants to merge 1 commit into
StarRocks:mainfrom
GavinMar:fix_ice_time_travel_schema
Draft

[BugFix] Honor per-snapshot schema for Iceberg time travel reads#74711
GavinMar wants to merge 1 commit into
StarRocks:mainfrom
GavinMar:fix_ice_time_travel_schema

Conversation

@GavinMar

@GavinMar GavinMar commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Why I'm doing:

Iceberg time travel queries (VERSION AS OF / TIMESTAMP AS OF) resolved the target snapshot only to select data files, while column resolution, the result-set metadata, the descriptor sent to BE, and scan planning all used the latest table schema. After a schema change (e.g. a column rename), a time-travel read returned the current column names instead of the schema bound to the targeted snapshot:

CREATE TABLE t (id INT, c INT);            -- snapshot S1 committed with schema (id, c)
ALTER TABLE t RENAME COLUMN c TO c_renamed;
INSERT INTO t VALUES (...);                -- snapshot S2 with schema (id, c_renamed)

SELECT * FROM t VERSION AS OF <S1>;        -- returned column `c_renamed` instead of `c`
SELECT * FROM t VERSION AS OF <S1> WHERE c = 1;  -- failed: `c` could not be resolved

This diverges from Iceberg semantics. The snapshot's schema-id is defined by the Iceberg spec as "ID of the table's current schema when the snapshot was created", and the Iceberg reference implementation (SnapshotUtil.schemaFor, DataScan#useSnapshotSchema), other query engines read snapshot-id/timestamp/tag time travel with the snapshot's schema.

What I'm doing:

Resolve the query period once during analysis and bind the snapshot schema through the whole read path:

  • QueryAnalyzer: after table resolution, resolve the query period to a version range (via the new QueryPeriodResolver, extracted from RelationTransformer so both layers share one implementation), pin the resolved range on the TableRelation so the transformer does not resolve it again, and, when the targeted snapshot's schema differs from the current one, replace the table with an immutable per-query copy created by IcebergTable#withReadSchema.
  • IcebergTable: withReadSchema rebuilds the StarRocks-side full schema from the snapshot schema; getEffectiveIcebergSchema feeds partition/bucket column resolution and toThrift (columns, iceberg_schema, partition expressions and source column names) so FE and BE see one consistent schema. Queries whose current partition spec references a column missing from the snapshot schema are rejected with a clear error.
  • IcebergMetadata: convert pushdown predicates against the effective (snapshot) schema, matching the schema the scan binds expressions against.
  • StarRocksIcebergTableScan: the scan schema is the snapshot schema for time travel (DataScan#useSnapshotSchema), while table partition specs stay bound to the current schema. Rebind specs to the scan schema (scanSpecsById) for the residual/partition/metrics evaluators, DeleteFileIndex, ManifestGroup, and manifest filtering, so filter expressions bind consistently. A spec that cannot be rebound degrades to no pruning with the full filter kept as residual.

Branch semantics: a branch reference is resolved to its head snapshot and read with that snapshot's schema, same as Trino. Spark instead reads branches with the current table schema (SnapshotUtil.schemaFor(table, ref)); snapshot-id, timestamp, and tag reads behave identically across all three engines.

Verification:

  • New UTs cover schema rebinding, the analyzer path (including the pre-resolved external table path), the BE descriptor, and scan planning with predicates on renamed columns.
  • New SQL tests (test_timetravel_snapshot_schema) cover tag/branch reads, old/new column name resolution, predicates on renamed columns, and a partitioned table whose partition source column was renamed.
  • Verified end to end on a real cluster against a Hive-catalog Iceberg table: time travel to the pre-rename snapshot now returns the original column name and supports filtering on it, matching pyiceberg/Spark/Trino.

What type of PR is this:

  • BugFix
  • Feature
  • Enhancement
  • Refactor
  • UT
  • Doc
  • Tool

Does this PR entail a change in behavior?

  • Yes, this PR will result in a change in behavior.
  • No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

  • Interface/UI changes: syntax, type conversion, expression evaluation, display information
  • Parameter changes: default values, similar parameters but with different default values
  • Policy changes: use new policy to replace old one, functionality automatically enabled
  • Feature removed
  • Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

  • I have added test cases for my bug fix or my new feature
  • This pr needs user documentation (for new or modified features or behaviors)
    • I have added documentation for my new feature or new function
    • This pr needs auto generate documentation
  • This is a backport pr

Bugfix cherry-pick branch check:

  • I have checked the version labels which the pr will be auto-backported to the target branch
    • 4.1
    • 4.0
    • 3.5

Time travel queries (VERSION/TIMESTAMP AS OF) resolved the target
snapshot only to select data files, while column resolution, the
descriptor sent to BE, and scan planning all used the latest table
schema. After a schema change (e.g. a column rename), such queries
returned the current column names instead of the schema bound to the
targeted snapshot, diverging from Iceberg semantics.

Resolve the query period once during analysis, pin the resolved
version range on the table relation, and rebind the IcebergTable to
the snapshot schema via an immutable per-query copy. Convert pushdown
predicates against the same schema and rebind partition specs to the
scan schema inside StarRocksIcebergTableScan so partition, metrics,
residual, and manifest evaluators bind consistently.

Signed-off-by: GavinMar <yangguansuo@starrocks.com>
@github-actions

Copy link
Copy Markdown
Contributor

[Java-Extensions Incremental Coverage Report]

pass : 0 / 0 (0%)

@github-actions

Copy link
Copy Markdown
Contributor

[FE Incremental Coverage Report]

fail : 4 / 7 (57.14%)

file detail

path covered_line new_line coverage not_covered_line_detail
🔵 com/starrocks/catalog/IcebergTable.java 0 3 00.00% [200, 205, 225]
🔵 com/starrocks/sql/optimizer/transformer/RelationTransformer.java 1 1 100.00% []
🔵 com/starrocks/sql/analyzer/QueryAnalyzer.java 1 1 100.00% []
🔵 com/starrocks/connector/iceberg/IcebergApiConverter.java 2 2 100.00% []

@github-actions

Copy link
Copy Markdown
Contributor

[BE Incremental Coverage Report]

pass : 0 / 0 (0%)

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3e8c5c5528

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +433 to +436
for (PartitionField field : getNativeTable().spec().fields()) {
if (currentSchema.findColumnName(field.sourceId()) != null
&& schema.findColumnName(field.sourceId()) == null) {
throw new StarRocksConnectorException(

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Don't reject older snapshots after partition evolution

For a table that was readable at an older snapshot before a later ALTER TABLE ... ADD COLUMN p/partition-spec replacement on p, this check rejects VERSION AS OF that older snapshot solely because the current partition spec references a field absent from the snapshot schema. Iceberg snapshots are planned with the specs attached to their manifests, and this change already carries per-scan specs elsewhere, so the current spec should not make historical snapshots unusable; users time-travelling to pre-evolution snapshots will now get Time travel is not supported instead of the old rows.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant