Skip to content

Fix snapshot flush and checkpoint exception propagation [dev]#1679

Merged
badrishc merged 15 commits intodevfrom
fix/cluster-replication-tests
Apr 8, 2026
Merged

Fix snapshot flush and checkpoint exception propagation [dev]#1679
badrishc merged 15 commits intodevfrom
fix/cluster-replication-tests

Conversation

@badrishc
Copy link
Copy Markdown
Collaborator

@badrishc badrishc commented Apr 7, 2026

Fix snapshot flush ObjectIdMap use-after-free and checkpoint exception propagation

Root cause: During snapshot checkpoint flush, ObjectAllocatorImpl.WriteAsync accessed the page's ObjectIdMap after releasing epoch protection. Under low memory with small segments, HeadAddress could advance during epoch gaps, causing page eviction which clears the ObjectIdMap (tail = -1). Subsequent objectIdMap.Get() hit a Debug.Assert failure, throwing an unhandled exception in FlushRunner. Since FlushRunner ran via Task.Run, the exception was unobserved, the completion semaphore was never released, and the checkpoint hung indefinitely.

Fixes:

  • ObjectAllocatorImpl: Capture objectIdMap under epoch protection with a HeadAddress guard. Skip objectIdMap dereference (KeyOverflow, ValueOverflow, ValueObject) when HeadAddress advances past the current record, marking the record invalid instead. All objectIdMap accesses now occur strictly under epoch protection.
  • AllocatorBase: Change AsyncFlushPagesForSnapshot and AsyncFlushDeltaToDevice to return Task instead of SemaphoreSlim, so exceptions in FlushRunner propagate naturally through the Task to the state machine.
  • StateMachineDriver: Unify waiting list as List<(StateMachineTaskType, Task)>. Add StateMachineTaskType enum for identifying faulting tasks in error logs. ProcessWaitingListAsync logs the task type before re-throwing exceptions.
  • AsyncFlushPageForSnapshotCallback: Move result.Release() into finally block to ensure release even if dirty-bit clearing throws.

Test fixes:

  • ClusterTestUtils.Checkpoint: Retry on 'checkpoint already in progress' (race with on-demand checkpoint from replication).
  • ClusterTestContext.AttachAndWaitForSync: Fix loop bounds bug (was i < replica_count, should be i < replicaStartIndex + replicaCount). Improve parameter names for clarity.
  • ClusterReplicationBaseTests: Use Task.WhenAll for parallel task waiting with diagnostic status on timeout.

Copilot AI review requested due to automatic review settings April 7, 2026 21:20
@badrishc badrishc force-pushed the fix/cluster-replication-tests branch 2 times, most recently from f206b42 to 71fd6e9 Compare April 7, 2026 21:27
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a snapshot-checkpoint flush correctness issue (ObjectIdMap use-after-free under epoch gaps) and improves checkpoint failure behavior by propagating flush exceptions through the checkpoint state machine rather than leaving them unobserved/hanging.

Changes:

  • Adjust snapshot flush logic in ObjectAllocatorImpl to only access ObjectIdMap under epoch protection and to handle HeadAddress advancement by skipping object serialization and invalidating affected records.
  • Refactor Tsavorite checkpoint waiting from SemaphoreSlim-based completion to Task-based waiting, and enhance state-machine diagnostics via StateMachineTaskType.
  • Stabilize cluster replication tests by fixing replica attach loop bounds, adding checkpoint retry on “already in progress”, and improving timeout diagnostics.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
test/Garnet.test.cluster/ReplicationTests/ClusterReplicationBaseTests.cs Increases test timeout, fixes shard query index, and improves parallel task waiting/diagnostics.
test/Garnet.test.cluster/ClusterTestUtils.cs Retries checkpoint when a checkpoint is already in progress to reduce replication race flakiness.
test/Garnet.test.cluster/ClusterTestContext.cs Fixes replica attach loop bounds and parameter clarity; aligns primary index usage in sync waits.
libs/storage/Tsavorite/cs/src/core/Utilities/PageAsyncResultTypes.cs Updates flush completion tracking structure in support of Task-based waiting changes.
libs/storage/Tsavorite/cs/src/core/Index/Recovery/IndexCheckpoint.cs Tags checkpoint waiters with task types for improved fault logging.
libs/storage/Tsavorite/cs/src/core/Index/CheckpointManagement/RecoveryInfo.cs Converts checkpoint flush completion tracking field from semaphore to task.
libs/storage/Tsavorite/cs/src/core/Index/Checkpointing/StateMachineTaskType.cs Adds enum to identify faulting checkpoint wait tasks in logs.
libs/storage/Tsavorite/cs/src/core/Index/Checkpointing/StateMachineDriver.cs Switches waiting list to typed tasks and logs task type on failure before rethrowing.
libs/storage/Tsavorite/cs/src/core/Index/Checkpointing/SnapshotCheckpointSMTask.cs Waits on Task-based snapshot flush completion and tags it for diagnostics.
libs/storage/Tsavorite/cs/src/core/Index/Checkpointing/IncrementalSnapshotCheckpointSMTask.cs Waits on Task-based delta flush completion and tags it for diagnostics.
libs/storage/Tsavorite/cs/src/core/Index/Checkpointing/FoldOverSMTask.cs Converts foldover flush completion semaphore to a Task waiter and tags it for diagnostics.
libs/storage/Tsavorite/cs/src/core/Allocator/ObjectAllocatorImpl.cs Fixes epoch scoping around ObjectIdMap usage during snapshot flush to avoid use-after-free.
libs/storage/Tsavorite/cs/src/core/Allocator/AllocatorBase.cs Refactors snapshot/delta flush completion to Task-based signaling and adjusts callback release behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@badrishc badrishc marked this pull request as draft April 7, 2026 21:31
@badrishc badrishc marked this pull request as ready for review April 7, 2026 21:50
@badrishc badrishc force-pushed the fix/cluster-replication-tests branch from eae2688 to 61a93f3 Compare April 7, 2026 22:07
badrishc and others added 4 commits April 7, 2026 15:23
…n propagation

Root cause: During snapshot checkpoint flush, ObjectAllocatorImpl.WriteAsync
accessed the page's ObjectIdMap after releasing epoch protection. Under low
memory with small segments, HeadAddress could advance during epoch gaps,
causing page eviction which clears the ObjectIdMap (tail = -1). Subsequent
objectIdMap.Get() hit a Debug.Assert failure, throwing an unhandled exception
in FlushRunner. Since FlushRunner ran via Task.Run, the exception was
unobserved, the completion semaphore was never released, and the checkpoint
hung indefinitely.

Fixes:
- ObjectAllocatorImpl: Capture objectIdMap under epoch protection with a
  HeadAddress guard. Skip objectIdMap dereference (KeyOverflow, ValueOverflow,
  ValueObject) when HeadAddress advances past the current record, marking the
  record invalid instead. All objectIdMap accesses now occur strictly under
  epoch protection.
- AllocatorBase: Change AsyncFlushPagesForSnapshot and AsyncFlushDeltaToDevice
  to return Task instead of SemaphoreSlim, so exceptions in FlushRunner
  propagate naturally through the Task to the state machine.
- StateMachineDriver: Unify waiting list as List<(StateMachineTaskType, Task)>.
  Add StateMachineTaskType enum for identifying faulting tasks in error logs.
  ProcessWaitingListAsync logs the task type before re-throwing exceptions.
- AsyncFlushPageForSnapshotCallback: Move result.Release() into finally block
  to ensure release even if dirty-bit clearing throws.

Test fixes:
- ClusterTestUtils.Checkpoint: Retry on 'checkpoint already in progress'
  (race with on-demand checkpoint from replication).
- ClusterTestContext.AttachAndWaitForSync: Fix loop bounds bug (was
  i < replica_count, should be i < replicaStartIndex + replicaCount).
  Improve parameter names for clarity.
- ClusterReplicationBaseTests: Use Task.WhenAll for parallel task waiting
  with diagnostic status on timeout. Increase CTS timeout to 120s.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace SemaphoreSlim with TaskCompletionSource in FlushCompletionTracker
so that exceptions propagate to the state machine via faulted Task, and
IO callback completion is correctly tracked (TCS completes when count
reaches 0). Keep SemaphoreSlim only for per-page throttle waiting.

Remove dead _completedSemaphore from AsyncFlushPagesForSnapshot.
Fix final newline in StateMachineTaskType.cs.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When flushSize <= 0, the page was skipped without calling CompleteFlush(),
so the FlushCompletionTracker count never reached 0, causing the
completionTcs to never complete — a potential deadlock. Now call
CompleteFlush() for skipped pages so the count is properly decremented.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@badrishc badrishc force-pushed the fix/cluster-replication-tests branch 2 times, most recently from 03e740e to 0bd520a Compare April 7, 2026 22:38
Replace SemaphoreSlim parameter with bool enableThrottling so the
semaphore is a private implementation detail. Move FlushCompletionTracker
from PageAsyncResultTypes.cs to its own file.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@badrishc badrishc force-pushed the fix/cluster-replication-tests branch from 0bd520a to 9588892 Compare April 7, 2026 22:47
badrishc and others added 3 commits April 7, 2026 15:53
Complete the TCS immediately when count is 0, so the state machine
does not hang if all pages in the range have flushSize <= 0.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The objectIdMap reference captured outside epoch is safe — no
dereference occurs until the per-record loop, which is already
inside epoch protection with a HeadAddress guard that skips
records via goto NextRecord when the page has been evicted.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

badrishc and others added 2 commits April 7, 2026 17:23
…n; harden test utility

StateMachineDriver: Store SemaphoreSlim directly in waiting list tuple
instead of eagerly calling WaitAsync(). Call WaitAsync(token) at await-time
in ProcessWaitingListAsync so cancellation propagates correctly and no
orphaned waits are created.

ClusterTestUtils.Checkpoint: Use StringComparison.OrdinalIgnoreCase for
error message check. Guard against null context with null-conditional.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Convert all remaining SemaphoreSlim-based completion signals to
TaskCompletionSource so the state machine waiting list is uniformly
Task-based. This eliminates orphaned SemaphoreSlim.WaitAsync() tasks
that could consume future Release() calls after cancellation.

Converted:
- StateMachineDriver.lastVersionTransactionsDone
- IndexCheckpoint.mainIndexCheckpointSemaphore -> mainIndexCheckpointTcs
- MallocFixedPageSize.checkpointSemaphore -> checkpointTcs
- AllocatorBase.notifyFlushedUntilAddressSemaphore -> notifyFlushedUntilAddressTcs
- ShiftReadOnlyToTail now returns out Task instead of out SemaphoreSlim

StateMachineDriver.AddToWaitingList reduced to single Task overload.
ProcessWaitingListAsync uses task.WaitAsync(token) for cancellation.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…ernal enum

- AllocatorBase: Null-conditional on notifyFlushedUntilAddressTcs to
  prevent NullReferenceException after Dispose or when no notification
  was requested.
- AllocatorBase: Drain throttle semaphore after asyncResult.Release() in
  the WriteNotIssued path, matching the flushSize<=0 pattern, so a
  no-op page's semaphore release doesn't satisfy the next real page's
  WaitOneFlush.
- StateMachineTaskType: Changed from public to internal — only used
  internally for logging and waiting list bookkeeping.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

libs/storage/Tsavorite/cs/src/core/Index/Recovery/IndexCheckpoint.cs:86

  • BeginMainIndexCheckpoint() uses a fire-and-forget Task.Run(FlushRunner) when throttling is enabled, but FlushRunner exceptions are not observed and mainIndexCheckpointTcs is never faulted/completed in that case. That can leave WAIT_INDEX_CHECKPOINT waiting forever. Consider wrapping FlushRunner in try/catch that calls mainIndexCheckpointTcs.TrySetException(ex), and/or storing/awaiting the Task so faults propagate (similar to the other flush paths that now return Tasks).
            mainIndexCheckpointTcs = new TaskCompletionSource<bool>(TaskCreationOptions.RunContinuationsAsynchronously);

            if (throttleCheckpointFlushDelayMs >= 0)
                Task.Run(FlushRunner);
            else
                FlushRunner();

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

badrishc and others added 3 commits April 7, 2026 18:17
- ObjectAllocatorImpl.WriteAsyncToDeviceForSnapshot: Recalculate
  pageFlushSize when fromAddress is adjusted due to HeadAddress
  advancement. Previously the original size was used, causing
  endOffset to overshoot untilAddress.

- AllocatorBase.AsyncFlushPageForSnapshotCallback: Advance
  startAddress alongside physicalAddress in the dirty-bit clearing
  loop. Previously CreateLogRecord was called with the same
  startAddress each iteration, always returning the first record.

- IndexCheckpoint.BeginMainIndexCheckpoint: Wrap FlushRunner in
  try/catch to fault the TCS on exception, preventing the state
  machine from hanging if index flush fails.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Setting asyncResult.partial = true when HeadAddress bumps fromAddress
lets WriteAsync recalculate endOffset and numBytesToWrite from the
actual from/until range. This is what the partial flag was designed
for and also correctly handles sector-aligned final write size.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
HeadAddress can advance between reads. Capture it once so both the
full-eviction check and the partial-adjustment use a consistent value.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- ClusterTestUtils.Checkpoint: Add max 10 retries to prevent infinite
  loop if checkpoint stays in progress indefinitely.
- AllocatorBase.Dispose: Call TrySetCanceled() on notifyFlushedUntilAddressTcs
  before nulling, so any awaiter unblocks with OperationCanceledException
  instead of hanging.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@vazois vazois left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@badrishc badrishc merged commit 1dc10ec into dev Apr 8, 2026
30 checks passed
@badrishc badrishc deleted the fix/cluster-replication-tests branch April 8, 2026 02:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants