Skip to content

[core] Unify single-column global index writer#8275

Merged
JingsongLi merged 3 commits into
apache:masterfrom
JingsongLi:codex/global-index-single-column-writer
Jun 18, 2026
Merged

[core] Unify single-column global index writer#8275
JingsongLi merged 3 commits into
apache:masterfrom
JingsongLi:codex/global-index-single-column-writer

Conversation

@JingsongLi

Copy link
Copy Markdown
Contributor

Summary

This PR merges the previous singleton and parallel single-column global index writer APIs into one GlobalIndexSingleColumnWriter interface. Single-column index writers now receive the caller-provided shard-relative row id through write(@Nullable Object key, long relativeRowId).

Changes

  • Replace GlobalIndexSingletonWriter and GlobalIndexParallelWriter with GlobalIndexSingleColumnWriter.
  • Update BTree, Vector, Lumina, and Tantivy global index writers to implement the unified single-column writer API.
  • Pass explicit shard-relative row ids from BTree, Flink, and Spark index build paths.
  • Keep vector/full-text row counts as logical row counts while persisting caller-provided row ids for non-null indexed entries.
  • Update test helper index formats and affected tests to read/write explicit relative row ids.

Testing

  • mvn -pl paimon-vector/paimon-vector-index -am -Pfast-build -DskipTests test-compile
  • mvn -pl paimon-lumina -am -Pfast-build -DskipTests test-compile
  • mvn -pl paimon-tantivy/paimon-tantivy-index -am -Pfast-build -DskipTests test-compile
  • mvn -pl paimon-flink/paimon-flink-common -am -Pfast-build -DskipTests test-compile
  • mvn -pl paimon-spark/paimon-spark-common -am -Pfast-build -DskipTests compile
  • git diff --check

@JingsongLi JingsongLi force-pushed the codex/global-index-single-column-writer branch from 3d2839a to fd0e495 Compare June 18, 2026 03:24

@leaves12138 leaves12138 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update. I re-reviewed the latest revision and the single-column writer API now looks consistent across BTree, vector, Lumina, Tantivy, Flink and Spark build paths.

The important semantic change is also correct: vector/full-text writers now persist the caller-provided relative row id instead of relying on an implicit dense sequence, while rowCount still represents logical rows processed for index metadata. This matches sparse/sharded row ranges and the added multiple-index-file tests cover the regression case.

I also ran the focused checks locally:

  • mvn -pl paimon-core -am -Pfast-build -DfailIfNoTests=false -Dtest='VectorSearchBuilderTest#testVectorSearchWithMultipleIndexFiles,FullTextSearchBuilderTest#testFullTextSearchWithMultipleIndexFiles' test
  • mvn -pl paimon-lumina -am -Pfast-build -DfailIfNoTests=false -Dtest='LuminaVectorGlobalIndexWriterTest' test

Both passed. +1 from my side.

@JingsongLi JingsongLi merged commit 3bee09f into apache:master Jun 18, 2026
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants