Proposal: Implement checkpoints for SyncMetas in compactors by waltherlee · Pull Request #8765 · thanos-io/thanos

waltherlee · 2026-04-12T05:22:32Z

A little proposal to implement in-memory cache for the compactor's MetaSync to speed up sequential syncs, especially the periodic ones for progress metrics. And also an on-disk checkpoint to recover a long MetaSync interrupted by a restart.

The main problems this is trying to solve are:

As the number of blocks grow, listing all the IDs in storage takes a long time. With recursive discovery, it can easily take about an hour with 1M+ blocks.
This not only halts compactions, it also affects the frequency of periodic progress metrics.
On top of that, it makes hard to scale compactors with a VPA because they take a long time to start compacting, and when they finally do and resource usage goes up, a VPA recommendation would restart them, starting a whole new sync over again.

I tried to keep it short, but this is something I already implemented on a v0.39.2 fork and has been running for about 1 month with no issues in multiple buckets with 1M+ blocks, so I'm happy to share any more details you need 🙂

Signed-off-by: Walther Lee <walthere.lee@gmail.com>

GiedriusS · 2026-04-14T09:29:35Z

+* `knownBlockIDs` `map[ulid.ULID]struct{}`, all block IDs known since last compaction.  
+* `checkpointDir` `string`, for the on-disk checkpoint.
+
+`blocksDownloading` keeps track of blocks downloading. After completion, only blocks with no previous blocks pending will be used for checkpoints.


only blocks with no previous blocks pending ? 😄 what does that mean?

Yeah, I struggled to explain the relations here, but I meant pending = download in progress 😅

I described it a little more in the paragraph below. Block meta files are downloaded concurrently, so if blocks A, B and C are being downloaded, it can happen that:

C finishes but A is still pending. Without this, the lastCheckpointableID will be C.

Compactor is restarted without finishing the download of A. A checkpoint is saved with C.

After restart, the compactor resumes listing blocks after C, but it never downloaded A.

By keeping track of blocks being downloaded, it only uses as checkpoints blocks that don't have other lexicographically lower blocks pending full download.

Realistically, meta files are small, and maybe the compactors wait to finish all downloads in progress when it receives a sigterm, but I wanted to cover some edge cases like throttling in storage, sudden conn issues, etc.

GiedriusS · 2026-04-14T09:30:01Z

+
+For the on-disk checkpoint, `lastCheckpointableID` and `knownBlockIDs` are stored in a gzip-compressed JSON file in `checkpointDir`. If a file exists, it is loaded to `BaseFetcher` when the compactor starts. 
+
+To resume a sync, the `GetActiveAndPartialBlockIDs` method in the `Lister` interface in the package `block` takes a new `startAfter` `string` that is passed to the object storage implementation in `obj_store`. Storage must respond by listing only objects alphabetically higher than the value. In this case, it doesn’t matter if it’s inclusive or not.


Could you show some example of how this param could be used? Is there no way to pass a date there? 🤔

Interesting. It's def possible. The implementations in objstore takes string keys, and the prefix in Thanos is the block ID, so it made sense to me to use what we already had in the listing results.

But yes, this could be a date or timestamp as well.

You mean to avoid referencing blocks in Lister?

GiedriusS · 2026-04-14T09:30:52Z

+
+At the end of a cycle it removes the file and clears the in-memory checkpoint in `BaseFetcher`.
+
+The checkpoint implementation is by default disabled and will have to be enabled with a flag.


Why not enable this for all? This seems like a great addition.

No reason I can think of. You mean removing the flag or just enabling it by default?

GiedriusS · 2026-04-14T09:31:23Z

+
+The compactor saves the checkpoint file after every `syncMetas` in the main compaction cycle. Basically before starting compactions, before each of both downsampling cycles and before applying retention policies. It also saves a checkpoint if an error interrupts the main `runCompact`. 
+
+At the end of a cycle it removes the file and clears the in-memory checkpoint in `BaseFetcher`.


Why do we need to delete the checkpoint?

To remove from the checkpoints blocks deleted in the cycle that just ended.

I implemented this in v0.39.2, so I think that changes already in the current version, but in v0.39 after finishing compactions and downsamples, the compactor deletes old blocks marked for deletion.

If we keep those in the checkpoint, the next cycle will fail when it tries to check if they have a deletion mark (or chunks), and that will trigger a full sync anyway, so I just decided to do it here anyway 🙂

GiedriusS · 2026-04-14T09:32:12Z

+* `enableCheckpoint` `bool`, true to enable checkpoints. False by default.  
+* `blocksDownloading` `struct`, sorted list to keep track of downloads pending.   
+* `lastCheckpointableID` `ulid.ULID`, all blocks up to this ID have been completely downloaded.  
+* `knownBlockIDs` `map[ulid.ULID]struct{}`, all block IDs known since last compaction.  


Why do we need this? Could you expand a little bit?

Sure! This backfills resp.metas (here) with all blocks that were already downloaded before the checkpoint, without re-fetching them from object storage.

We could recreate this from disk, but I've seen storage issues in interrupted compactors. For example, compactor shards that finish successfully delete blocks as expected, but stuck shards keep resuming and piling up meta files because cleanup only occurs after a full cycle. So these interrupted shards were hoarding blocks that were already deleted.

I also had shards that had been stuck for months, so the list of blocks on disk was way longer than the one on storage. By keeping only a list of blocks from the latest sync, I skip those deleted blocks and save disk lookups.

GiedriusS · 2026-04-14T09:32:46Z

+The compactors already use `BaseFetcher` to sync metadata. This adds 5 new fields:
+
+* `enableCheckpoint` `bool`, true to enable checkpoints. False by default.  
+* `blocksDownloading` `struct`, sorted list to keep track of downloads pending.   


Also, why this is needed? Isn't it enough to store the last timestamp embedded in a block?

Same as here: #8765 (comment)

To prevent a race condition where out-of-order downloads cause the checkpoint to jump ahead. This ensures we don't skip pending blocks that are lexicographically earlier but slower to download, avoiding gaps after a restart.

GiedriusS · 2026-06-12T13:18:42Z

@waltherlee out of curiosity: how long it takes to list blocks in your case right now? Is it really about an hour? If you can answer: do you use some homegrown or open-source object storage implementation or something from SaaS providers? Is it the same with both listing strategies?

waltherlee · 2026-06-19T22:01:24Z

@GiedriusS Yes, over an hour with AWS S3 buckets in multiple clusters. It was mostly caused by a high rate of new blocks, a long list of pending ones (+1M total per cluster), and a low compaction rate in some shards.

I could also check our GCP buckets, but the ones I checked when working on this were S3. However, we've had these checkpoints implemented in an internal Thanos fork for around 3 months now, so it no longer takes that long and our compactors already caught up with the queue. It takes around 7 minutes now to sync from disk, and milliseconds from the in-memory checkpoint.

waltherlee · 2026-06-19T22:07:59Z

Is it the same with both listing strategies?

Above 1h was with recursive. I did try with eager and remember it was faster but not by much. I can check my notes next week. But the main problem we had was the interactions with autoscaler. We have heavy blocks, so for a fresh compactor, it usually takes time and multiple restarts to reach the right resources to process them. However, if the pod is constantly restarting and spending most of the time just reading from S3, the VPA scales down, and it makes them fall in a cycle of scaling up and down just listing. That's how we ended up with the long list of pending blocks

waltherlee · 2026-06-19T23:07:59Z

Ok, I checked my notes and eager was taking ~30m in total, including disk lookups, but we still had 3 issues that checkpoints solved:

The scaling trap I mentioned above.
Disk lookups to check if a block's meta was cached on disk during the initial listing was also taking a long time, so we wanted to avoid redundant lookups.
And we wanted to rely on recursive to avoid blasting S3 when dozens of shards are restarted at about the same time after a deployment.

waltherlee added 2 commits April 11, 2026 21:44

add proposal for checkpoints for SyncMetas in compactors

5385367

Signed-off-by: Walther Lee <walthere.lee@gmail.com>

fix typo

e19a306

Signed-off-by: Walther Lee <walthere.lee@gmail.com>

pull-request-size Bot added the size/M label Apr 12, 2026

waltherlee mentioned this pull request Apr 12, 2026

.*: Add support for StartAfter option in ListObjects thanos-io/objstore#256

Open

2 tasks

GiedriusS reviewed Apr 14, 2026

View reviewed changes

Merge branch 'main' into compactor-meta-sync-checkpoints-proposal

726646f

waltherlee requested a review from GiedriusS June 19, 2026 22:10


		For the on-disk checkpoint, `lastCheckpointableID` and `knownBlockIDs` are stored in a gzip-compressed JSON file in `checkpointDir`. If a file exists, it is loaded to `BaseFetcher` when the compactor starts.

		To resume a sync, the `GetActiveAndPartialBlockIDs` method in the `Lister` interface in the package `block` takes a new `startAfter` `string` that is passed to the object storage implementation in `obj_store`. Storage must respond by listing only objects alphabetically higher than the value. In this case, it doesn’t matter if it’s inclusive or not.


		At the end of a cycle it removes the file and clears the in-memory checkpoint in `BaseFetcher`.

		The checkpoint implementation is by default disabled and will have to be enabled with a flag.


		The compactor saves the checkpoint file after every `syncMetas` in the main compaction cycle. Basically before starting compactions, before each of both downsampling cycles and before applying retention policies. It also saves a checkpoint if an error interrupts the main `runCompact`.

		At the end of a cycle it removes the file and clears the in-memory checkpoint in `BaseFetcher`.

Uh oh!

Conversation

waltherlee commented Apr 12, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GiedriusS commented Jun 12, 2026

Uh oh!

waltherlee commented Jun 19, 2026

Uh oh!

waltherlee commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

waltherlee commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

waltherlee commented Jun 19, 2026 •

edited

Loading