Skip to content

Fix snapshot install livelock with background snapshot IO#118

Open
antonio2368 wants to merge 1 commit into
masterfrom
fix-async-snapshot-io-livelock
Open

Fix snapshot install livelock with background snapshot IO#118
antonio2368 wants to merge 1 commit into
masterfrom
fix-async-snapshot-io-livelock

Conversation

@antonio2368

@antonio2368 antonio2368 commented Jun 10, 2026

Copy link
Copy Markdown
Member

Problem

With use_bg_thread_for_snapshot_io_ enabled, snapshot install to a lagging follower can livelock: the transfer restarts from object 0 forever.

On the sync path, object 0 is read and sent inside create_sync_snapshot_req, so offset == 0 with a live sync context is never observed concurrently. On the async path the call only queues the read, so offset stays 0 for the whole first-object round trip. During that window the leader's append cycle either re-selects a newer snapshot (the offset == 0 clause) or clears the context (the "caught up past snapshot" branch) — so the follower's object-0 ack finds no live context and is dropped, and the install restarts against a newer snapshot every time.

Observed in production on a ClickHouse Keeper cluster (~14 GB snapshots, new snapshot every ~40 s): 43 install attempts over 40 minutes, all stuck at object 0. The same snapshot transferred in 14 s once a sync-IO leader took over.

Fix

Add per-peer pin state to snapshot_sync_ctx:

  • async_snapshot_transfer_started_ pins the snapshot for the whole install: no re-selection at offset == 0, no caught-up clear while a transfer is in flight.
  • async_snapshot_request_in_progress_ serializes one object round trip.

The pin is cleared on every exit path (response done/continue/declined, read or push failure, RPC error, sync-context timeout, leadership loss, shutdown), and the background worker revalidates leadership and context identity after the read, before sending. The timeout backstop still applies while pinned, so a dead follower cannot wedge the leader. Wire protocol unchanged.

Testing

New unit tests (snapshot_test.cxx) and ASIO assertions; a deterministic ClickHouse integration test reproduces the livelock without this fix and passes with it.

With use_bg_thread_for_snapshot_io_, create_sync_snapshot_req only
queues the object read and returns, so the peer's snapshot_sync_ctx
stays at offset 0 for the whole first-object round trip. The sync IO
path never exposes this state: it reads and sends the object inside the
same call while the peer is busy, so offset 0 with a live context is
never observed concurrently.

On the async path that window is open, and the leader's append cycle
can during it:
 - re-select a newer snapshot via the "offset == 0" clause in
   create_sync_snapshot_req, replacing the in-flight sync context, or
 - clear the context through the "peer caught up past snapshot" branch
   in create_append_entries_req.

Either way the follower's accepted response for object 0 arrives to
find no matching sync context and is dropped, offset never advances,
and the next cycle restarts the install from object 0 against a newer
snapshot. With a state machine that creates snapshots frequently the
transfer restarts forever and the follower can never catch up.

Fix by restoring the serialization invariant the sync path relies on,
as explicit per-peer state on snapshot_sync_ctx:
 - async_snapshot_transfer_started_ pins the snapshot for the lifetime
   of the install: no re-selection on offset 0 and no caught-up clear
   while a transfer is in flight,
 - async_snapshot_request_in_progress_ serializes one object round
   trip: request_append_entries is a no-op for the peer while a request
   is queued, being read, or awaiting its response.

The flags are cleared on every exit path: response handled (continue,
done, declined), snapshot read failure, queue push failure, RPC error
(including errors without a request), sync context timeout, leadership
loss, peer removal, and io manager shutdown. The background worker also
revalidates leadership and sync context identity after the object read
and before sending, dropping stale reads instead of sending them.

The existing sync context timeout still applies while a transfer is
pinned, so a dead follower cannot wedge the leader's snapshot state.
The wire protocol is unchanged.

Observed in production as a follower stuck in RECEIVING_SNAPSHOT for 40
minutes: 43 install attempts, all at obj 0, each acked by the follower
with NextIndex=1 and each ack dropped by the leader, while the same
snapshot transferred in 14 seconds once a leader using sync IO took
over.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant