Fix snapshot install livelock with background snapshot IO by antonio2368 · Pull Request #118 · ClickHouse/NuRaft

antonio2368 · 2026-06-10T08:58:23Z

Problem

With use_bg_thread_for_snapshot_io_ enabled, snapshot install to a lagging follower can livelock: the transfer restarts from object 0 forever.

On the sync path, object 0 is read and sent inside create_sync_snapshot_req, so offset == 0 with a live sync context is never observed concurrently. On the async path the call only queues the read, so offset stays 0 for the whole first-object round trip. During that window the leader's append cycle either re-selects a newer snapshot (the offset == 0 clause) or clears the context (the "caught up past snapshot" branch) — so the follower's object-0 ack finds no live context and is dropped, and the install restarts against a newer snapshot every time.

Observed in production on a ClickHouse Keeper cluster (~14 GB snapshots, new snapshot every ~40 s): 43 install attempts over 40 minutes, all stuck at object 0. The same snapshot transferred in 14 s once a sync-IO leader took over.

Fix

Add per-peer pin state to snapshot_sync_ctx:

async_snapshot_transfer_started_ pins the snapshot for the whole install: no re-selection at offset == 0, no caught-up clear while a transfer is in flight.
async_snapshot_request_in_progress_ serializes one object round trip.

The pin is cleared on every exit path (response done/continue/declined, read or push failure, RPC error, sync-context timeout, leadership loss, shutdown), and the background worker revalidates leadership and context identity after the read, before sending. The timeout backstop still applies while pinned, so a dead follower cannot wedge the leader. Wire protocol unchanged.

Testing

New unit tests (snapshot_test.cxx) and ASIO assertions; a deterministic ClickHouse integration test reproduces the livelock without this fix and passes with it.

With use_bg_thread_for_snapshot_io_, create_sync_snapshot_req only queues the object read and returns, so the peer's snapshot_sync_ctx stays at offset 0 for the whole first-object round trip. The sync IO path never exposes this state: it reads and sends the object inside the same call while the peer is busy, so offset 0 with a live context is never observed concurrently. On the async path that window is open, and the leader's append cycle can during it: - re-select a newer snapshot via the "offset == 0" clause in create_sync_snapshot_req, replacing the in-flight sync context, or - clear the context through the "peer caught up past snapshot" branch in create_append_entries_req. Either way the follower's accepted response for object 0 arrives to find no matching sync context and is dropped, offset never advances, and the next cycle restarts the install from object 0 against a newer snapshot. With a state machine that creates snapshots frequently the transfer restarts forever and the follower can never catch up. Fix by restoring the serialization invariant the sync path relies on, as explicit per-peer state on snapshot_sync_ctx: - async_snapshot_transfer_started_ pins the snapshot for the lifetime of the install: no re-selection on offset 0 and no caught-up clear while a transfer is in flight, - async_snapshot_request_in_progress_ serializes one object round trip: request_append_entries is a no-op for the peer while a request is queued, being read, or awaiting its response. The flags are cleared on every exit path: response handled (continue, done, declined), snapshot read failure, queue push failure, RPC error (including errors without a request), sync context timeout, leadership loss, peer removal, and io manager shutdown. The background worker also revalidates leadership and sync context identity after the object read and before sending, dropping stale reads instead of sending them. The existing sync context timeout still applies while a transfer is pinned, so a dead follower cannot wedge the leader's snapshot state. The wire protocol is unchanged. Observed in production as a follower stuck in RECEIVING_SNAPSHOT for 40 minutes: 43 install attempts, all at obj 0, each acked by the follower with NextIndex=1 and each ack dropped by the leader, while the same snapshot transferred in 14 seconds once a leader using sync IO took over. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix snapshot install livelock with background snapshot IO#118

Fix snapshot install livelock with background snapshot IO#118
antonio2368 wants to merge 1 commit into
masterfrom
fix-async-snapshot-io-livelock

antonio2368 commented Jun 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

antonio2368 commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

antonio2368 commented Jun 10, 2026 •

edited

Loading