Skip to content

Proposal: Implement checkpoints for SyncMetas in compactors#8765

Open
waltherlee wants to merge 3 commits into
thanos-io:mainfrom
waltherlee:compactor-meta-sync-checkpoints-proposal
Open

Proposal: Implement checkpoints for SyncMetas in compactors#8765
waltherlee wants to merge 3 commits into
thanos-io:mainfrom
waltherlee:compactor-meta-sync-checkpoints-proposal

Conversation

@waltherlee

Copy link
Copy Markdown
Contributor

A little proposal to implement in-memory cache for the compactor's MetaSync to speed up sequential syncs, especially the periodic ones for progress metrics. And also an on-disk checkpoint to recover a long MetaSync interrupted by a restart.

The main problems this is trying to solve are:

  • As the number of blocks grow, listing all the IDs in storage takes a long time. With recursive discovery, it can easily take about an hour with 1M+ blocks.
  • This not only halts compactions, it also affects the frequency of periodic progress metrics.
  • On top of that, it makes hard to scale compactors with a VPA because they take a long time to start compacting, and when they finally do and resource usage goes up, a VPA recommendation would restart them, starting a whole new sync over again.

I tried to keep it short, but this is something I already implemented on a v0.39.2 fork and has been running for about 1 month with no issues in multiple buckets with 1M+ blocks, so I'm happy to share any more details you need 🙂

Signed-off-by: Walther Lee <walthere.lee@gmail.com>
Signed-off-by: Walther Lee <walthere.lee@gmail.com>
* `knownBlockIDs` `map[ulid.ULID]struct{}`, all block IDs known since last compaction.
* `checkpointDir` `string`, for the on-disk checkpoint.

`blocksDownloading` keeps track of blocks downloading. After completion, only blocks with no previous blocks pending will be used for checkpoints.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only blocks with no previous blocks pending ? 😄 what does that mean?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I struggled to explain the relations here, but I meant pending = download in progress 😅

I described it a little more in the paragraph below. Block meta files are downloaded concurrently, so if blocks A, B and C are being downloaded, it can happen that:

  1. C finishes but A is still pending. Without this, the lastCheckpointableID will be C.
  2. Compactor is restarted without finishing the download of A. A checkpoint is saved with C.
  3. After restart, the compactor resumes listing blocks after C, but it never downloaded A.

By keeping track of blocks being downloaded, it only uses as checkpoints blocks that don't have other lexicographically lower blocks pending full download.

Realistically, meta files are small, and maybe the compactors wait to finish all downloads in progress when it receives a sigterm, but I wanted to cover some edge cases like throttling in storage, sudden conn issues, etc.


For the on-disk checkpoint, `lastCheckpointableID` and `knownBlockIDs` are stored in a gzip-compressed JSON file in `checkpointDir`. If a file exists, it is loaded to `BaseFetcher` when the compactor starts.

To resume a sync, the `GetActiveAndPartialBlockIDs` method in the `Lister` interface in the package `block` takes a new `startAfter` `string` that is passed to the object storage implementation in `obj_store`. Storage must respond by listing only objects alphabetically higher than the value. In this case, it doesn’t matter if it’s inclusive or not.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you show some example of how this param could be used? Is there no way to pass a date there? 🤔

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. It's def possible. The implementations in objstore takes string keys, and the prefix in Thanos is the block ID, so it made sense to me to use what we already had in the listing results.

But yes, this could be a date or timestamp as well.

You mean to avoid referencing blocks in Lister?


At the end of a cycle it removes the file and clears the in-memory checkpoint in `BaseFetcher`.

The checkpoint implementation is by default disabled and will have to be enabled with a flag.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not enable this for all? This seems like a great addition.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No reason I can think of. You mean removing the flag or just enabling it by default?


The compactor saves the checkpoint file after every `syncMetas` in the main compaction cycle. Basically before starting compactions, before each of both downsampling cycles and before applying retention policies. It also saves a checkpoint if an error interrupts the main `runCompact`.

At the end of a cycle it removes the file and clears the in-memory checkpoint in `BaseFetcher`.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to delete the checkpoint?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To remove from the checkpoints blocks deleted in the cycle that just ended.

I implemented this in v0.39.2, so I think that changes already in the current version, but in v0.39 after finishing compactions and downsamples, the compactor deletes old blocks marked for deletion.

If we keep those in the checkpoint, the next cycle will fail when it tries to check if they have a deletion mark (or chunks), and that will trigger a full sync anyway, so I just decided to do it here anyway 🙂

* `enableCheckpoint` `bool`, true to enable checkpoints. False by default.
* `blocksDownloading` `struct`, sorted list to keep track of downloads pending.
* `lastCheckpointableID` `ulid.ULID`, all blocks up to this ID have been completely downloaded.
* `knownBlockIDs` `map[ulid.ULID]struct{}`, all block IDs known since last compaction.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this? Could you expand a little bit?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! This backfills resp.metas (here) with all blocks that were already downloaded before the checkpoint, without re-fetching them from object storage.

We could recreate this from disk, but I've seen storage issues in interrupted compactors. For example, compactor shards that finish successfully delete blocks as expected, but stuck shards keep resuming and piling up meta files because cleanup only occurs after a full cycle. So these interrupted shards were hoarding blocks that were already deleted.

I also had shards that had been stuck for months, so the list of blocks on disk was way longer than the one on storage. By keeping only a list of blocks from the latest sync, I skip those deleted blocks and save disk lookups.

The compactors already use `BaseFetcher` to sync metadata. This adds 5 new fields:

* `enableCheckpoint` `bool`, true to enable checkpoints. False by default.
* `blocksDownloading` `struct`, sorted list to keep track of downloads pending.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, why this is needed? Isn't it enough to store the last timestamp embedded in a block?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as here: #8765 (comment)

To prevent a race condition where out-of-order downloads cause the checkpoint to jump ahead. This ensures we don't skip pending blocks that are lexicographically earlier but slower to download, avoiding gaps after a restart.

@GiedriusS

Copy link
Copy Markdown
Member

@waltherlee out of curiosity: how long it takes to list blocks in your case right now? Is it really about an hour? If you can answer: do you use some homegrown or open-source object storage implementation or something from SaaS providers? Is it the same with both listing strategies?

@waltherlee

Copy link
Copy Markdown
Contributor Author

@GiedriusS Yes, over an hour with AWS S3 buckets in multiple clusters. It was mostly caused by a high rate of new blocks, a long list of pending ones (+1M total per cluster), and a low compaction rate in some shards.

I could also check our GCP buckets, but the ones I checked when working on this were S3. However, we've had these checkpoints implemented in an internal Thanos fork for around 3 months now, so it no longer takes that long and our compactors already caught up with the queue. It takes around 7 minutes now to sync from disk, and milliseconds from the in-memory checkpoint.

@waltherlee

waltherlee commented Jun 19, 2026

Copy link
Copy Markdown
Contributor Author

Is it the same with both listing strategies?

Above 1h was with recursive. I did try with eager and remember it was faster but not by much. I can check my notes next week. But the main problem we had was the interactions with autoscaler. We have heavy blocks, so for a fresh compactor, it usually takes time and multiple restarts to reach the right resources to process them. However, if the pod is constantly restarting and spending most of the time just reading from S3, the VPA scales down, and it makes them fall in a cycle of scaling up and down just listing. That's how we ended up with the long list of pending blocks

@waltherlee waltherlee requested a review from GiedriusS June 19, 2026 22:10
@waltherlee

Copy link
Copy Markdown
Contributor Author

Ok, I checked my notes and eager was taking ~30m in total, including disk lookups, but we still had 3 issues that checkpoints solved:

  • The scaling trap I mentioned above.
  • Disk lookups to check if a block's meta was cached on disk during the initial listing was also taking a long time, so we wanted to avoid redundant lookups.
  • And we wanted to rely on recursive to avoid blasting S3 when dozens of shards are restarted at about the same time after a deployment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants