Fix snapshot flush and checkpoint exception propagation [dev]#1679
Fix snapshot flush and checkpoint exception propagation [dev]#1679
Conversation
f206b42 to
71fd6e9
Compare
There was a problem hiding this comment.
Pull request overview
This PR addresses a snapshot-checkpoint flush correctness issue (ObjectIdMap use-after-free under epoch gaps) and improves checkpoint failure behavior by propagating flush exceptions through the checkpoint state machine rather than leaving them unobserved/hanging.
Changes:
- Adjust snapshot flush logic in
ObjectAllocatorImplto only accessObjectIdMapunder epoch protection and to handle HeadAddress advancement by skipping object serialization and invalidating affected records. - Refactor Tsavorite checkpoint waiting from
SemaphoreSlim-based completion toTask-based waiting, and enhance state-machine diagnostics viaStateMachineTaskType. - Stabilize cluster replication tests by fixing replica attach loop bounds, adding checkpoint retry on “already in progress”, and improving timeout diagnostics.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| test/Garnet.test.cluster/ReplicationTests/ClusterReplicationBaseTests.cs | Increases test timeout, fixes shard query index, and improves parallel task waiting/diagnostics. |
| test/Garnet.test.cluster/ClusterTestUtils.cs | Retries checkpoint when a checkpoint is already in progress to reduce replication race flakiness. |
| test/Garnet.test.cluster/ClusterTestContext.cs | Fixes replica attach loop bounds and parameter clarity; aligns primary index usage in sync waits. |
| libs/storage/Tsavorite/cs/src/core/Utilities/PageAsyncResultTypes.cs | Updates flush completion tracking structure in support of Task-based waiting changes. |
| libs/storage/Tsavorite/cs/src/core/Index/Recovery/IndexCheckpoint.cs | Tags checkpoint waiters with task types for improved fault logging. |
| libs/storage/Tsavorite/cs/src/core/Index/CheckpointManagement/RecoveryInfo.cs | Converts checkpoint flush completion tracking field from semaphore to task. |
| libs/storage/Tsavorite/cs/src/core/Index/Checkpointing/StateMachineTaskType.cs | Adds enum to identify faulting checkpoint wait tasks in logs. |
| libs/storage/Tsavorite/cs/src/core/Index/Checkpointing/StateMachineDriver.cs | Switches waiting list to typed tasks and logs task type on failure before rethrowing. |
| libs/storage/Tsavorite/cs/src/core/Index/Checkpointing/SnapshotCheckpointSMTask.cs | Waits on Task-based snapshot flush completion and tags it for diagnostics. |
| libs/storage/Tsavorite/cs/src/core/Index/Checkpointing/IncrementalSnapshotCheckpointSMTask.cs | Waits on Task-based delta flush completion and tags it for diagnostics. |
| libs/storage/Tsavorite/cs/src/core/Index/Checkpointing/FoldOverSMTask.cs | Converts foldover flush completion semaphore to a Task waiter and tags it for diagnostics. |
| libs/storage/Tsavorite/cs/src/core/Allocator/ObjectAllocatorImpl.cs | Fixes epoch scoping around ObjectIdMap usage during snapshot flush to avoid use-after-free. |
| libs/storage/Tsavorite/cs/src/core/Allocator/AllocatorBase.cs | Refactors snapshot/delta flush completion to Task-based signaling and adjusts callback release behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
libs/storage/Tsavorite/cs/src/core/Utilities/PageAsyncResultTypes.cs
Outdated
Show resolved
Hide resolved
eae2688 to
61a93f3
Compare
…n propagation Root cause: During snapshot checkpoint flush, ObjectAllocatorImpl.WriteAsync accessed the page's ObjectIdMap after releasing epoch protection. Under low memory with small segments, HeadAddress could advance during epoch gaps, causing page eviction which clears the ObjectIdMap (tail = -1). Subsequent objectIdMap.Get() hit a Debug.Assert failure, throwing an unhandled exception in FlushRunner. Since FlushRunner ran via Task.Run, the exception was unobserved, the completion semaphore was never released, and the checkpoint hung indefinitely. Fixes: - ObjectAllocatorImpl: Capture objectIdMap under epoch protection with a HeadAddress guard. Skip objectIdMap dereference (KeyOverflow, ValueOverflow, ValueObject) when HeadAddress advances past the current record, marking the record invalid instead. All objectIdMap accesses now occur strictly under epoch protection. - AllocatorBase: Change AsyncFlushPagesForSnapshot and AsyncFlushDeltaToDevice to return Task instead of SemaphoreSlim, so exceptions in FlushRunner propagate naturally through the Task to the state machine. - StateMachineDriver: Unify waiting list as List<(StateMachineTaskType, Task)>. Add StateMachineTaskType enum for identifying faulting tasks in error logs. ProcessWaitingListAsync logs the task type before re-throwing exceptions. - AsyncFlushPageForSnapshotCallback: Move result.Release() into finally block to ensure release even if dirty-bit clearing throws. Test fixes: - ClusterTestUtils.Checkpoint: Retry on 'checkpoint already in progress' (race with on-demand checkpoint from replication). - ClusterTestContext.AttachAndWaitForSync: Fix loop bounds bug (was i < replica_count, should be i < replicaStartIndex + replicaCount). Improve parameter names for clarity. - ClusterReplicationBaseTests: Use Task.WhenAll for parallel task waiting with diagnostic status on timeout. Increase CTS timeout to 120s. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace SemaphoreSlim with TaskCompletionSource in FlushCompletionTracker so that exceptions propagate to the state machine via faulted Task, and IO callback completion is correctly tracked (TCS completes when count reaches 0). Keep SemaphoreSlim only for per-page throttle waiting. Remove dead _completedSemaphore from AsyncFlushPagesForSnapshot. Fix final newline in StateMachineTaskType.cs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When flushSize <= 0, the page was skipped without calling CompleteFlush(), so the FlushCompletionTracker count never reached 0, causing the completionTcs to never complete — a potential deadlock. Now call CompleteFlush() for skipped pages so the count is properly decremented. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
03e740e to
0bd520a
Compare
Replace SemaphoreSlim parameter with bool enableThrottling so the semaphore is a private implementation detail. Move FlushCompletionTracker from PageAsyncResultTypes.cs to its own file. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
0bd520a to
9588892
Compare
Complete the TCS immediately when count is 0, so the state machine does not hang if all pages in the range have flushSize <= 0. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The objectIdMap reference captured outside epoch is safe — no dereference occurs until the per-record loop, which is already inside epoch protection with a HeadAddress guard that skips records via goto NextRecord when the page has been evicted. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 14 out of 14 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
libs/storage/Tsavorite/cs/src/core/Index/Checkpointing/StateMachineDriver.cs
Outdated
Show resolved
Hide resolved
…n; harden test utility StateMachineDriver: Store SemaphoreSlim directly in waiting list tuple instead of eagerly calling WaitAsync(). Call WaitAsync(token) at await-time in ProcessWaitingListAsync so cancellation propagates correctly and no orphaned waits are created. ClusterTestUtils.Checkpoint: Use StringComparison.OrdinalIgnoreCase for error message check. Guard against null context with null-conditional. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Convert all remaining SemaphoreSlim-based completion signals to TaskCompletionSource so the state machine waiting list is uniformly Task-based. This eliminates orphaned SemaphoreSlim.WaitAsync() tasks that could consume future Release() calls after cancellation. Converted: - StateMachineDriver.lastVersionTransactionsDone - IndexCheckpoint.mainIndexCheckpointSemaphore -> mainIndexCheckpointTcs - MallocFixedPageSize.checkpointSemaphore -> checkpointTcs - AllocatorBase.notifyFlushedUntilAddressSemaphore -> notifyFlushedUntilAddressTcs - ShiftReadOnlyToTail now returns out Task instead of out SemaphoreSlim StateMachineDriver.AddToWaitingList reduced to single Task overload. ProcessWaitingListAsync uses task.WaitAsync(token) for cancellation. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 16 out of 16 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
libs/storage/Tsavorite/cs/src/core/Index/Checkpointing/StateMachineTaskType.cs
Outdated
Show resolved
Hide resolved
…ernal enum - AllocatorBase: Null-conditional on notifyFlushedUntilAddressTcs to prevent NullReferenceException after Dispose or when no notification was requested. - AllocatorBase: Drain throttle semaphore after asyncResult.Release() in the WriteNotIssued path, matching the flushSize<=0 pattern, so a no-op page's semaphore release doesn't satisfy the next real page's WaitOneFlush. - StateMachineTaskType: Changed from public to internal — only used internally for logging and waiting list bookkeeping. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 16 out of 16 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (1)
libs/storage/Tsavorite/cs/src/core/Index/Recovery/IndexCheckpoint.cs:86
- BeginMainIndexCheckpoint() uses a fire-and-forget Task.Run(FlushRunner) when throttling is enabled, but FlushRunner exceptions are not observed and mainIndexCheckpointTcs is never faulted/completed in that case. That can leave WAIT_INDEX_CHECKPOINT waiting forever. Consider wrapping FlushRunner in try/catch that calls mainIndexCheckpointTcs.TrySetException(ex), and/or storing/awaiting the Task so faults propagate (similar to the other flush paths that now return Tasks).
mainIndexCheckpointTcs = new TaskCompletionSource<bool>(TaskCreationOptions.RunContinuationsAsynchronously);
if (throttleCheckpointFlushDelayMs >= 0)
Task.Run(FlushRunner);
else
FlushRunner();
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- ObjectAllocatorImpl.WriteAsyncToDeviceForSnapshot: Recalculate pageFlushSize when fromAddress is adjusted due to HeadAddress advancement. Previously the original size was used, causing endOffset to overshoot untilAddress. - AllocatorBase.AsyncFlushPageForSnapshotCallback: Advance startAddress alongside physicalAddress in the dirty-bit clearing loop. Previously CreateLogRecord was called with the same startAddress each iteration, always returning the first record. - IndexCheckpoint.BeginMainIndexCheckpoint: Wrap FlushRunner in try/catch to fault the TCS on exception, preventing the state machine from hanging if index flush fails. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Setting asyncResult.partial = true when HeadAddress bumps fromAddress lets WriteAsync recalculate endOffset and numBytesToWrite from the actual from/until range. This is what the partial flag was designed for and also correctly handles sector-aligned final write size. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
HeadAddress can advance between reads. Capture it once so both the full-eviction check and the partial-adjustment use a consistent value. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- ClusterTestUtils.Checkpoint: Add max 10 retries to prevent infinite loop if checkpoint stays in progress indefinitely. - AllocatorBase.Dispose: Call TrySetCanceled() on notifyFlushedUntilAddressTcs before nulling, so any awaiter unblocks with OperationCanceledException instead of hanging. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fix snapshot flush ObjectIdMap use-after-free and checkpoint exception propagation
Root cause: During snapshot checkpoint flush, ObjectAllocatorImpl.WriteAsync accessed the page's ObjectIdMap after releasing epoch protection. Under low memory with small segments, HeadAddress could advance during epoch gaps, causing page eviction which clears the ObjectIdMap (tail = -1). Subsequent objectIdMap.Get() hit a Debug.Assert failure, throwing an unhandled exception in FlushRunner. Since FlushRunner ran via Task.Run, the exception was unobserved, the completion semaphore was never released, and the checkpoint hung indefinitely.
Fixes:
Test fixes: