Fix snapshot flush and checkpoint exception propagation [dev] by badrishc · Pull Request #1679 · microsoft/garnet

badrishc · 2026-04-07T21:20:29Z

Fix snapshot flush ObjectIdMap use-after-free and checkpoint exception propagation

Root cause: During snapshot checkpoint flush, ObjectAllocatorImpl.WriteAsync accessed the page's ObjectIdMap after releasing epoch protection. Under low memory with small segments, HeadAddress could advance during epoch gaps, causing page eviction which clears the ObjectIdMap (tail = -1). Subsequent objectIdMap.Get() hit a Debug.Assert failure, throwing an unhandled exception in FlushRunner. Since FlushRunner ran via Task.Run, the exception was unobserved, the completion semaphore was never released, and the checkpoint hung indefinitely.

Fixes:

ObjectAllocatorImpl: Capture objectIdMap under epoch protection with a HeadAddress guard. Skip objectIdMap dereference (KeyOverflow, ValueOverflow, ValueObject) when HeadAddress advances past the current record, marking the record invalid instead. All objectIdMap accesses now occur strictly under epoch protection.
AllocatorBase: Change AsyncFlushPagesForSnapshot and AsyncFlushDeltaToDevice to return Task instead of SemaphoreSlim, so exceptions in FlushRunner propagate naturally through the Task to the state machine.
StateMachineDriver: Unify waiting list as List<(StateMachineTaskType, Task)>. Add StateMachineTaskType enum for identifying faulting tasks in error logs. ProcessWaitingListAsync logs the task type before re-throwing exceptions.
AsyncFlushPageForSnapshotCallback: Move result.Release() into finally block to ensure release even if dirty-bit clearing throws.

Test fixes:

ClusterTestUtils.Checkpoint: Retry on 'checkpoint already in progress' (race with on-demand checkpoint from replication).
ClusterTestContext.AttachAndWaitForSync: Fix loop bounds bug (was i < replica_count, should be i < replicaStartIndex + replicaCount). Improve parameter names for clarity.
ClusterReplicationBaseTests: Use Task.WhenAll for parallel task waiting with diagnostic status on timeout.

Copilot

Pull request overview

This PR addresses a snapshot-checkpoint flush correctness issue (ObjectIdMap use-after-free under epoch gaps) and improves checkpoint failure behavior by propagating flush exceptions through the checkpoint state machine rather than leaving them unobserved/hanging.

Changes:

Adjust snapshot flush logic in ObjectAllocatorImpl to only access ObjectIdMap under epoch protection and to handle HeadAddress advancement by skipping object serialization and invalidating affected records.
Refactor Tsavorite checkpoint waiting from SemaphoreSlim-based completion to Task-based waiting, and enhance state-machine diagnostics via StateMachineTaskType.
Stabilize cluster replication tests by fixing replica attach loop bounds, adding checkpoint retry on “already in progress”, and improving timeout diagnostics.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
test/Garnet.test.cluster/ReplicationTests/ClusterReplicationBaseTests.cs	Increases test timeout, fixes shard query index, and improves parallel task waiting/diagnostics.
test/Garnet.test.cluster/ClusterTestUtils.cs	Retries checkpoint when a checkpoint is already in progress to reduce replication race flakiness.
test/Garnet.test.cluster/ClusterTestContext.cs	Fixes replica attach loop bounds and parameter clarity; aligns primary index usage in sync waits.
libs/storage/Tsavorite/cs/src/core/Utilities/PageAsyncResultTypes.cs	Updates flush completion tracking structure in support of Task-based waiting changes.
libs/storage/Tsavorite/cs/src/core/Index/Recovery/IndexCheckpoint.cs	Tags checkpoint waiters with task types for improved fault logging.
libs/storage/Tsavorite/cs/src/core/Index/CheckpointManagement/RecoveryInfo.cs	Converts checkpoint flush completion tracking field from semaphore to task.
libs/storage/Tsavorite/cs/src/core/Index/Checkpointing/StateMachineTaskType.cs	Adds enum to identify faulting checkpoint wait tasks in logs.
libs/storage/Tsavorite/cs/src/core/Index/Checkpointing/StateMachineDriver.cs	Switches waiting list to typed tasks and logs task type on failure before rethrowing.
libs/storage/Tsavorite/cs/src/core/Index/Checkpointing/SnapshotCheckpointSMTask.cs	Waits on Task-based snapshot flush completion and tags it for diagnostics.
libs/storage/Tsavorite/cs/src/core/Index/Checkpointing/IncrementalSnapshotCheckpointSMTask.cs	Waits on Task-based delta flush completion and tags it for diagnostics.
libs/storage/Tsavorite/cs/src/core/Index/Checkpointing/FoldOverSMTask.cs	Converts foldover flush completion semaphore to a Task waiter and tags it for diagnostics.
libs/storage/Tsavorite/cs/src/core/Allocator/ObjectAllocatorImpl.cs	Fixes epoch scoping around `ObjectIdMap` usage during snapshot flush to avoid use-after-free.
libs/storage/Tsavorite/cs/src/core/Allocator/AllocatorBase.cs	Refactors snapshot/delta flush completion to Task-based signaling and adjusts callback release behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

libs/storage/Tsavorite/cs/src/core/Allocator/AllocatorBase.cs

libs/storage/Tsavorite/cs/src/core/Utilities/PageAsyncResultTypes.cs

…n propagation Root cause: During snapshot checkpoint flush, ObjectAllocatorImpl.WriteAsync accessed the page's ObjectIdMap after releasing epoch protection. Under low memory with small segments, HeadAddress could advance during epoch gaps, causing page eviction which clears the ObjectIdMap (tail = -1). Subsequent objectIdMap.Get() hit a Debug.Assert failure, throwing an unhandled exception in FlushRunner. Since FlushRunner ran via Task.Run, the exception was unobserved, the completion semaphore was never released, and the checkpoint hung indefinitely. Fixes: - ObjectAllocatorImpl: Capture objectIdMap under epoch protection with a HeadAddress guard. Skip objectIdMap dereference (KeyOverflow, ValueOverflow, ValueObject) when HeadAddress advances past the current record, marking the record invalid instead. All objectIdMap accesses now occur strictly under epoch protection. - AllocatorBase: Change AsyncFlushPagesForSnapshot and AsyncFlushDeltaToDevice to return Task instead of SemaphoreSlim, so exceptions in FlushRunner propagate naturally through the Task to the state machine. - StateMachineDriver: Unify waiting list as List<(StateMachineTaskType, Task)>. Add StateMachineTaskType enum for identifying faulting tasks in error logs. ProcessWaitingListAsync logs the task type before re-throwing exceptions. - AsyncFlushPageForSnapshotCallback: Move result.Release() into finally block to ensure release even if dirty-bit clearing throws. Test fixes: - ClusterTestUtils.Checkpoint: Retry on 'checkpoint already in progress' (race with on-demand checkpoint from replication). - ClusterTestContext.AttachAndWaitForSync: Fix loop bounds bug (was i < replica_count, should be i < replicaStartIndex + replicaCount). Improve parameter names for clarity. - ClusterReplicationBaseTests: Use Task.WhenAll for parallel task waiting with diagnostic status on timeout. Increase CTS timeout to 120s. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Replace SemaphoreSlim with TaskCompletionSource in FlushCompletionTracker so that exceptions propagate to the state machine via faulted Task, and IO callback completion is correctly tracked (TCS completes when count reaches 0). Keep SemaphoreSlim only for per-page throttle waiting. Remove dead _completedSemaphore from AsyncFlushPagesForSnapshot. Fix final newline in StateMachineTaskType.cs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

When flushSize <= 0, the page was skipped without calling CompleteFlush(), so the FlushCompletionTracker count never reached 0, causing the completionTcs to never complete — a potential deadlock. Now call CompleteFlush() for skipped pages so the count is properly decremented. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Replace SemaphoreSlim parameter with bool enableThrottling so the semaphore is a private implementation detail. Move FlushCompletionTracker from PageAsyncResultTypes.cs to its own file. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Complete the TCS immediately when count is 0, so the state machine does not hang if all pages in the range have flushSize <= 0. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The objectIdMap reference captured outside epoch is safe — no dereference occurs until the per-record loop, which is already inside epoch protection with a HeadAddress guard that skips records via goto NextRecord when the page has been evicted. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

libs/storage/Tsavorite/cs/src/core/Index/Checkpointing/StateMachineDriver.cs

test/Garnet.test.cluster/ClusterTestUtils.cs

…n; harden test utility StateMachineDriver: Store SemaphoreSlim directly in waiting list tuple instead of eagerly calling WaitAsync(). Call WaitAsync(token) at await-time in ProcessWaitingListAsync so cancellation propagates correctly and no orphaned waits are created. ClusterTestUtils.Checkpoint: Use StringComparison.OrdinalIgnoreCase for error message check. Guard against null context with null-conditional. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Convert all remaining SemaphoreSlim-based completion signals to TaskCompletionSource so the state machine waiting list is uniformly Task-based. This eliminates orphaned SemaphoreSlim.WaitAsync() tasks that could consume future Release() calls after cancellation. Converted: - StateMachineDriver.lastVersionTransactionsDone - IndexCheckpoint.mainIndexCheckpointSemaphore -> mainIndexCheckpointTcs - MallocFixedPageSize.checkpointSemaphore -> checkpointTcs - AllocatorBase.notifyFlushedUntilAddressSemaphore -> notifyFlushedUntilAddressTcs - ShiftReadOnlyToTail now returns out Task instead of out SemaphoreSlim StateMachineDriver.AddToWaitingList reduced to single Task overload. ProcessWaitingListAsync uses task.WaitAsync(token) for cancellation. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

libs/storage/Tsavorite/cs/src/core/Allocator/AllocatorBase.cs

libs/storage/Tsavorite/cs/src/core/Index/Checkpointing/StateMachineTaskType.cs

…ernal enum - AllocatorBase: Null-conditional on notifyFlushedUntilAddressTcs to prevent NullReferenceException after Dispose or when no notification was requested. - AllocatorBase: Drain throttle semaphore after asyncResult.Release() in the WriteNotIssued path, matching the flushSize<=0 pattern, so a no-op page's semaphore release doesn't satisfy the next real page's WaitOneFlush. - StateMachineTaskType: Changed from public to internal — only used internally for logging and waiting list bookkeeping. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

libs/storage/Tsavorite/cs/src/core/Index/Recovery/IndexCheckpoint.cs:86

BeginMainIndexCheckpoint() uses a fire-and-forget Task.Run(FlushRunner) when throttling is enabled, but FlushRunner exceptions are not observed and mainIndexCheckpointTcs is never faulted/completed in that case. That can leave WAIT_INDEX_CHECKPOINT waiting forever. Consider wrapping FlushRunner in try/catch that calls mainIndexCheckpointTcs.TrySetException(ex), and/or storing/awaiting the Task so faults propagate (similar to the other flush paths that now return Tasks).

            mainIndexCheckpointTcs = new TaskCompletionSource<bool>(TaskCreationOptions.RunContinuationsAsynchronously);

            if (throttleCheckpointFlushDelayMs >= 0)
                Task.Run(FlushRunner);
            else
                FlushRunner();

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

test/Garnet.test.cluster/ClusterTestUtils.cs

libs/storage/Tsavorite/cs/src/core/Allocator/AllocatorBase.cs

- ObjectAllocatorImpl.WriteAsyncToDeviceForSnapshot: Recalculate pageFlushSize when fromAddress is adjusted due to HeadAddress advancement. Previously the original size was used, causing endOffset to overshoot untilAddress. - AllocatorBase.AsyncFlushPageForSnapshotCallback: Advance startAddress alongside physicalAddress in the dirty-bit clearing loop. Previously CreateLogRecord was called with the same startAddress each iteration, always returning the first record. - IndexCheckpoint.BeginMainIndexCheckpoint: Wrap FlushRunner in try/catch to fault the TCS on exception, preventing the state machine from hanging if index flush fails. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Setting asyncResult.partial = true when HeadAddress bumps fromAddress lets WriteAsync recalculate endOffset and numBytesToWrite from the actual from/until range. This is what the partial flag was designed for and also correctly handles sector-aligned final write size. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

HeadAddress can advance between reads. Capture it once so both the full-eviction check and the partial-adjustment use a consistent value. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

libs/storage/Tsavorite/cs/src/core/Allocator/AllocatorBase.cs

- ClusterTestUtils.Checkpoint: Add max 10 retries to prevent infinite loop if checkpoint stays in progress indefinitely. - AllocatorBase.Dispose: Call TrySetCanceled() on notifyFlushedUntilAddressTcs before nulling, so any awaiter unblocks with OperationCanceledException instead of hanging. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

vazois

LGTM!

Copilot AI review requested due to automatic review settings April 7, 2026 21:20

Copilot started reviewing on behalf of badrishc April 7, 2026 21:22 View session

badrishc force-pushed the fix/cluster-replication-tests branch 2 times, most recently from f206b42 to 71fd6e9 Compare April 7, 2026 21:27

Copilot AI reviewed Apr 7, 2026

View reviewed changes

libs/storage/Tsavorite/cs/src/core/Allocator/AllocatorBase.cs Outdated Show resolved Hide resolved

libs/storage/Tsavorite/cs/src/core/Utilities/PageAsyncResultTypes.cs Outdated Show resolved Hide resolved

badrishc marked this pull request as draft April 7, 2026 21:31

badrishc marked this pull request as ready for review April 7, 2026 21:50

badrishc force-pushed the fix/cluster-replication-tests branch from eae2688 to 61a93f3 Compare April 7, 2026 22:07

badrishc and others added 4 commits April 7, 2026 15:23

Reorder waiting list tuple to (task, type) for consistency

8afa433

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

badrishc force-pushed the fix/cluster-replication-tests branch 2 times, most recently from 03e740e to 0bd520a Compare April 7, 2026 22:38

badrishc force-pushed the fix/cluster-replication-tests branch from 0bd520a to 9588892 Compare April 7, 2026 22:47

badrishc and others added 3 commits April 7, 2026 15:53

Handle zero page count in FlushCompletionTracker constructor

a6b9d05

Complete the TCS immediately when count is 0, so the state machine does not hang if all pages in the range have flushSize <= 0. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

simplify

1752ada

badrishc requested a review from Copilot April 7, 2026 23:49

Copilot started reviewing on behalf of badrishc April 7, 2026 23:50 View session

Copilot AI reviewed Apr 7, 2026

View reviewed changes

libs/storage/Tsavorite/cs/src/core/Index/Checkpointing/StateMachineDriver.cs Outdated Show resolved Hide resolved

test/Garnet.test.cluster/ClusterTestUtils.cs Outdated Show resolved Hide resolved

badrishc and others added 2 commits April 7, 2026 17:23

badrishc requested a review from Copilot April 8, 2026 00:50

Copilot started reviewing on behalf of badrishc April 8, 2026 00:51 View session

Copilot AI reviewed Apr 8, 2026

View reviewed changes

badrishc requested a review from Copilot April 8, 2026 01:02

Copilot started reviewing on behalf of badrishc April 8, 2026 01:03 View session

Copilot AI reviewed Apr 8, 2026

View reviewed changes

test/Garnet.test.cluster/ClusterTestUtils.cs Outdated Show resolved Hide resolved

libs/storage/Tsavorite/cs/src/core/Allocator/AllocatorBase.cs Show resolved Hide resolved

badrishc and others added 3 commits April 7, 2026 18:17

Snapshot local copy of HeadAddress in WriteAsyncToDeviceForSnapshot

6b46f73

HeadAddress can advance between reads. Capture it once so both the full-eviction check and the partial-adjustment use a consistent value. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

vazois reviewed Apr 8, 2026

View reviewed changes

libs/storage/Tsavorite/cs/src/core/Allocator/AllocatorBase.cs Show resolved Hide resolved

vazois approved these changes Apr 8, 2026

View reviewed changes

badrishc merged commit 1dc10ec into dev Apr 8, 2026
30 checks passed

badrishc deleted the fix/cluster-replication-tests branch April 8, 2026 02:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix snapshot flush and checkpoint exception propagation [dev]#1679

Fix snapshot flush and checkpoint exception propagation [dev]#1679
badrishc merged 15 commits intodevfrom
fix/cluster-replication-tests

badrishc commented Apr 7, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vazois left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

badrishc commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fix snapshot flush ObjectIdMap use-after-free and checkpoint exception propagation

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vazois left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

badrishc commented Apr 7, 2026 •

edited

Loading