Fix snapshot install livelock with background snapshot IO#118
Open
antonio2368 wants to merge 1 commit into
Open
Fix snapshot install livelock with background snapshot IO#118antonio2368 wants to merge 1 commit into
antonio2368 wants to merge 1 commit into
Conversation
With use_bg_thread_for_snapshot_io_, create_sync_snapshot_req only queues the object read and returns, so the peer's snapshot_sync_ctx stays at offset 0 for the whole first-object round trip. The sync IO path never exposes this state: it reads and sends the object inside the same call while the peer is busy, so offset 0 with a live context is never observed concurrently. On the async path that window is open, and the leader's append cycle can during it: - re-select a newer snapshot via the "offset == 0" clause in create_sync_snapshot_req, replacing the in-flight sync context, or - clear the context through the "peer caught up past snapshot" branch in create_append_entries_req. Either way the follower's accepted response for object 0 arrives to find no matching sync context and is dropped, offset never advances, and the next cycle restarts the install from object 0 against a newer snapshot. With a state machine that creates snapshots frequently the transfer restarts forever and the follower can never catch up. Fix by restoring the serialization invariant the sync path relies on, as explicit per-peer state on snapshot_sync_ctx: - async_snapshot_transfer_started_ pins the snapshot for the lifetime of the install: no re-selection on offset 0 and no caught-up clear while a transfer is in flight, - async_snapshot_request_in_progress_ serializes one object round trip: request_append_entries is a no-op for the peer while a request is queued, being read, or awaiting its response. The flags are cleared on every exit path: response handled (continue, done, declined), snapshot read failure, queue push failure, RPC error (including errors without a request), sync context timeout, leadership loss, peer removal, and io manager shutdown. The background worker also revalidates leadership and sync context identity after the object read and before sending, dropping stale reads instead of sending them. The existing sync context timeout still applies while a transfer is pinned, so a dead follower cannot wedge the leader's snapshot state. The wire protocol is unchanged. Observed in production as a follower stuck in RECEIVING_SNAPSHOT for 40 minutes: 43 install attempts, all at obj 0, each acked by the follower with NextIndex=1 and each ack dropped by the leader, while the same snapshot transferred in 14 seconds once a leader using sync IO took over. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
With
use_bg_thread_for_snapshot_io_enabled, snapshot install to a lagging follower can livelock: the transfer restarts from object 0 forever.On the sync path, object 0 is read and sent inside
create_sync_snapshot_req, sooffset == 0with a live sync context is never observed concurrently. On the async path the call only queues the read, sooffsetstays 0 for the whole first-object round trip. During that window the leader's append cycle either re-selects a newer snapshot (theoffset == 0clause) or clears the context (the "caught up past snapshot" branch) — so the follower's object-0 ack finds no live context and is dropped, and the install restarts against a newer snapshot every time.Observed in production on a ClickHouse Keeper cluster (~14 GB snapshots, new snapshot every ~40 s): 43 install attempts over 40 minutes, all stuck at object 0. The same snapshot transferred in 14 s once a sync-IO leader took over.
Fix
Add per-peer pin state to
snapshot_sync_ctx:async_snapshot_transfer_started_pins the snapshot for the whole install: no re-selection atoffset == 0, no caught-up clear while a transfer is in flight.async_snapshot_request_in_progress_serializes one object round trip.The pin is cleared on every exit path (response done/continue/declined, read or push failure, RPC error, sync-context timeout, leadership loss, shutdown), and the background worker revalidates leadership and context identity after the read, before sending. The timeout backstop still applies while pinned, so a dead follower cannot wedge the leader. Wire protocol unchanged.
Testing
New unit tests (
snapshot_test.cxx) and ASIO assertions; a deterministic ClickHouse integration test reproduces the livelock without this fix and passes with it.