Skip to content

Checkpoint hanging when object store is enabled#1647

Open
KiKoS0 wants to merge 3 commits intomicrosoft:mainfrom
KiKoS0:riadh/fix-deadlock
Open

Checkpoint hanging when object store is enabled#1647
KiKoS0 wants to merge 3 commits intomicrosoft:mainfrom
KiKoS0:riadh/fix-deadlock

Conversation

@KiKoS0
Copy link
Copy Markdown
Contributor

@KiKoS0 KiKoS0 commented Mar 27, 2026

We noticed a primary Garnet hanging on a checkpoint forever so I took a closer look at a heap dump, noticed this hanging stack and traced it to a semaphore deadlock that can happen when the Object store is enabled and there are in-flight transactions that need to be awaited before a checkpoint can proceed.

It was easily reproducible locally, I just hammered transactional commands and BGSAVE aggressively (i can share it if needed)

This has also broken the replication link but i'm not yet sure if that's a different issue or a cascading effect of this issue yet.

STACK 11
00007e9d56365678 00007f9539410b90 ( ) System.Threading.SemaphoreSlim+TaskNode
  00007e9d563656d0 00007f953a9be440 (1) Tsavorite.core.StateMachineDriver+<ProcessWaitingListAsync>d__34
    00007e9d56365758 00007f953a9be7f8 (0) Tsavorite.core.StateMachineDriver+<RunStateMachine>d__35
      00007e9d563657d0 00007f953a9bebb0 (0) Tsavorite.core.StateMachineDriver+<RunAsync>d__28
        00007e9d56365850 00007f953a9bef58 (0) Garnet.server.DatabaseManagerBase+<InitiateCheckpointAsync>d__70
          00007e9d563658f0 00007f953a9bf358 (0) Garnet.server.DatabaseManagerBase+<TakeCheckpointAsync>d__55
            00007e9d56365990 00007f953a9bfa78 (0) Garnet.server.SingleDatabaseManager+<TaskCheckpointBasedOnAofSizeLimitAsync>d__16
              00007e9ad6c00420 00007f953a1fc740 (1) Garnet.server.StoreWrapper+<AutoCheckpointBasedOnAofSizeLimit>d__77

TrackLastVersion is called once per store during IN_PROGRESS. Each call creates a new semaphore and overwrites lastVersionTransactionsDone, orphaning the previous one in the waitingList. DecrementActiveTransactions only releases the last one. ProcessWaitingListAsync blocks forever on the orphaned semaphore.

Since both stores share the same transaction counter, we only need one semaphore per version. If TrackLastVersion has already been called for a given version, subsequent calls return immediately.

Includes a regression test that fails without the fix.

Copilot AI review requested due to automatic review settings March 27, 2026 02:31
@KiKoS0 KiKoS0 force-pushed the riadh/fix-deadlock branch from 9be6a9a to 879bc71 Compare March 27, 2026 02:32
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a checkpoint deadlock in Tsavorite’s state machine when two-store (main + object store) checkpoints call TrackLastVersion for the same version, which could orphan a semaphore in waitingList and hang ProcessWaitingListAsync (seen in Garnet with object store + in-flight transactions).

Changes:

  • Prevent TrackLastVersion from creating/enqueuing more than one semaphore per version.
  • Add a regression test that exercises calling TrackLastVersion twice for the same transaction version and verifies no orphaned waiters remain.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
libs/storage/Tsavorite/cs/src/core/Index/Checkpointing/StateMachineDriver.cs Adds a guard in TrackLastVersion to avoid enqueueing duplicate semaphores for the same version (prevents deadlock).
libs/storage/Tsavorite/cs/test/StateMachineDriverTests.cs Adds a regression test to validate the fix and prevent recurrence.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@KiKoS0 KiKoS0 force-pushed the riadh/fix-deadlock branch 2 times, most recently from 6f2dabf to 96eb445 Compare March 27, 2026 15:17
KiKoS0 and others added 2 commits March 30, 2026 22:57
TrackLastVersion is called once per store during IN_PROGRESS. Each call creates
a new semaphore and overwrites lastVersionTransactionsDone, orphaning the
previous one in the waitingList. DecrementActiveTransactions only releases the
last one. ProcessWaitingListAsync blocks forever on the orphaned semaphore.

Since both stores share the same transaction counter, we only need one
semaphore per version. If TrackLastVersion has already been called for a given
version, subsequent calls return immediately.

Includes a regression test that fails without the fix.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@KiKoS0 KiKoS0 force-pushed the riadh/fix-deadlock branch from 96eb445 to ece3b7a Compare March 30, 2026 21:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants