Support Concurrent Clustering of Small MOR File Groups with Upserts Using pendingReplacedFileIdMap #18433

suryaprasanna · 2026-04-01T06:37:04Z

suryaprasanna
Apr 1, 2026

I would like to discuss a design for supporting concurrent clustering and upserts for MOR tables using a new structure called pendingReplacedFileIdMap in the FileSystemView.

Problem:

Assume there are file groups F1 and F2 with the following file structure.
F1: 1 base file + 2 log files
F2: 1 base file + 3 log files
Assume clustering selects file groups F1 and F2 and plans to replace them and finally create F3.

The issue is that once clustering has decided to replace F1 and F2 with F3, new writes should ideally stop going to F1/F2 and start going to F3.

Another motivation is write amplification.

Even though Spark MOR upsert can handle small base files by routing inserts into an existing small file group and rewriting a new base file, doing this from the ingestion writer side can still cause significant write amplification. The ingestion path ends up repeatedly rewriting small file groups while regular writes are flowing, instead of letting clustering handle file-group consolidation in a more controlled way.
The same concern applies to Flink as well. Even if small-file handling is available on the writer side, relying on ingestion-time correction can still introduce unnecessary write amplification, especially under continuous writes.
Because of this, it is desirable for clustering to publish its replacement target early and let future writes move to the pending replacement file group, instead of continuing to correct small files through the ingestion writer path.

Proposal

Store a pendingReplacedFileIdMap in the File system view interface.
It is something like, F1 -> F3 & F2 -> F3. So, the requested clustering instant would contain following information. This requires a change in the clustering plan version.

input file groups being replaced
replacement file group ids
pendingReplacedFileIdMap

Expected behavior

Writer

If clustering plans to replace F1 and F2 with F3, then new writes should be redirected to F3 instead of continuing to write to F1 and F2. This we may not be able to achieve in a straight forward manner, so let us consider all the scenarios and what happens in each of these scenarios.

Spark (Batch flow)

Scenario 1: If the clustering plan is created before the write job is routed, the job sees pendingReplacedFileIdMap and writes directly to F3.
Scenario 2: If the clustering plan is created after the write job has already been routed, that job still writes to F1 and F2. In that case clustering should fail conflict resolution and retry. The next write job should see pendingReplacedFileIdMap and route to F3.
Scenario 3: A partially routed case is not really a Spark batch scenario because one Spark batch job uses one routing snapshot.

Flink (Streaming flow)

I do not have complete knowledge about Flink internals on which operator is best suited for the change. That I will leave it to Flink experts to decide, but overall following scenarios can happen,

Scenario 1: If pendingReplacedFileIdMap is visible before routing, writes that would have gone to F1/F2 can be redirected to F3.
Scenario 2: If some records were already routed before the alias map became visible, those records may still go to F1/F2, and clustering may need conflict handling for that case.
Scenario 3: I think partial routed case maybe possible for Flink, even then one file group may still go to existing filegroup let us say F1 but the the other can go to new file group F3. Even then ingestion does not fail, clustering fails on the first attempt and the logic is similar to scenario 2.
Not sure, if we want to persist pendingReplacedFileIdMap to operator for seamless translation of F1 -> F3 and F2 -> F3.

Reader

Read Optimized

Read optimized is easy to handle as it returns the base files directly.

Snapshot reads:

Snapshot reads might need some changes, since when F1 and F2 are getting replaced by F3 and the F3 base files is not created but updates on F1 and F2 might still be present on F3 log file.

suryaprasanna · 2026-04-01T06:37:17Z

suryaprasanna
Apr 1, 2026
Author

CC @danny0405

1 reply

danny0405 Apr 2, 2026
Collaborator

Thanks for the ideas.

For Flink scenario#2 and scenario#3, the clustering may fail continuously because of the overlapping of file group modifications, since the Flink ingestion job is long-running and the upserts are randomly across partitions/file groups. Same case for DeltaStreamer I think.

BTW, in consistent-hashing bucket index, the ingestion job will do a dual write for both replaced file group id and the new target one until the clustering finished(consistent hashing utilitizes clustering to merge/split the file groups). The dual write is there to ensure the visibility of the dataset for readers. I think there are many similarities between these two(consistent hashing and the solution proposed here). The consistent hashing ring plays similiar role with the pendingReplacedFileIdMap you mentioned here.

kbuci · 2026-04-01T07:42:21Z

kbuci
Apr 1, 2026

Thanks for the discussion (since I was also curious how we could prevent clustering <-> upsert conflicts for frequent writes to MOR).
Just to clarify:

Unlike the current clustering plan which just stores a set of all replaces files per partition, here the structure pendingReplacedFileIdMap needs to store a mapping of [input file groups] -> output file right? So this isn't intended for all clustering strategies (like sorting all records within a partition) right - just types of clustering where we can explicitly "know" that multiple specific input files map to one output file (like the recent clustering strategy that uses parquet APIs to combine multiple small files into one)?
Once a clustering plan is scheduled, it can now never be rolled back right?

0 replies

cshuo · 2026-04-02T13:17:36Z

cshuo
Apr 2, 2026
Collaborator

Thanks for writing this up. I have a few concerns with respect to the proposal:

One concern on the writer routing side: today the writer may choose F1 based on the small-file profile policy. With this proposal, if we directly redirect that write to F3, is it possible that F3 would not actually qualify as a valid small-file target under the existing writer logic? In other words, this seems to bypass the current file-group selection policy and force writes into a pending replacement file group that may not have been chosen otherwise. Is F3 intended to be treated as a special routing target outside the normal small-file policy?
Is the redirection intended only for inserts, or also for updates? Redirecting updates seems much trickier because record location/indexing would still point to F1/F2 until the replace commit completes?
For flink writer, the committing and table service scheduling are asynchronous on coordinator(like spark driver) receiving checkpoint success event, the data ingestion flow are continuous without blocking, so it seems scenario 2/3 will probably happens frequently and there is a risk the clustering keeps retrying and never make progress.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Concurrent Clustering of Small MOR File Groups with Upserts Using pendingReplacedFileIdMap #18433

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Support Concurrent Clustering of Small MOR File Groups with Upserts Using pendingReplacedFileIdMap #18433

Uh oh!

suryaprasanna Apr 1, 2026

Problem:

Proposal

Expected behavior

Writer

Spark (Batch flow)

Flink (Streaming flow)

Reader

Read Optimized

Snapshot reads:

Replies: 3 comments · 1 reply

Uh oh!

suryaprasanna Apr 1, 2026 Author

Uh oh!

Uh oh!

danny0405 Apr 2, 2026 Collaborator

Uh oh!

kbuci Apr 1, 2026

Uh oh!

cshuo Apr 2, 2026 Collaborator

suryaprasanna
Apr 1, 2026

Replies: 3 comments 1 reply

suryaprasanna
Apr 1, 2026
Author

danny0405 Apr 2, 2026
Collaborator

kbuci
Apr 1, 2026

cshuo
Apr 2, 2026
Collaborator