Checkpoint hanging when object store is enabled#1647
Open
KiKoS0 wants to merge 3 commits intomicrosoft:mainfrom
Open
Checkpoint hanging when object store is enabled#1647KiKoS0 wants to merge 3 commits intomicrosoft:mainfrom
KiKoS0 wants to merge 3 commits intomicrosoft:mainfrom
Conversation
9be6a9a to
879bc71
Compare
Contributor
There was a problem hiding this comment.
Pull request overview
Fixes a checkpoint deadlock in Tsavorite’s state machine when two-store (main + object store) checkpoints call TrackLastVersion for the same version, which could orphan a semaphore in waitingList and hang ProcessWaitingListAsync (seen in Garnet with object store + in-flight transactions).
Changes:
- Prevent
TrackLastVersionfrom creating/enqueuing more than one semaphore per version. - Add a regression test that exercises calling
TrackLastVersiontwice for the same transaction version and verifies no orphaned waiters remain.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| libs/storage/Tsavorite/cs/src/core/Index/Checkpointing/StateMachineDriver.cs | Adds a guard in TrackLastVersion to avoid enqueueing duplicate semaphores for the same version (prevents deadlock). |
| libs/storage/Tsavorite/cs/test/StateMachineDriverTests.cs | Adds a regression test to validate the fix and prevent recurrence. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
6f2dabf to
96eb445
Compare
TrackLastVersion is called once per store during IN_PROGRESS. Each call creates a new semaphore and overwrites lastVersionTransactionsDone, orphaning the previous one in the waitingList. DecrementActiveTransactions only releases the last one. ProcessWaitingListAsync blocks forever on the orphaned semaphore. Since both stores share the same transaction counter, we only need one semaphore per version. If TrackLastVersion has already been called for a given version, subsequent calls return immediately. Includes a regression test that fails without the fix.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
96eb445 to
ece3b7a
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
We noticed a primary Garnet hanging on a checkpoint forever so I took a closer look at a heap dump, noticed this hanging stack and traced it to a semaphore deadlock that can happen when the Object store is enabled and there are in-flight transactions that need to be awaited before a checkpoint can proceed.
It was easily reproducible locally, I just hammered transactional commands and
BGSAVEaggressively (i can share it if needed)This has also broken the replication link but i'm not yet sure if that's a different issue or a cascading effect of this issue yet.
TrackLastVersionis called once per store duringIN_PROGRESS. Each call creates a new semaphore and overwriteslastVersionTransactionsDone, orphaning the previous one in thewaitingList.DecrementActiveTransactionsonly releases the last one.ProcessWaitingListAsyncblocks forever on the orphaned semaphore.Since both stores share the same transaction counter, we only need one semaphore per version. If
TrackLastVersionhas already been called for a given version, subsequent calls return immediately.Includes a regression test that fails without the fix.