fix: do not cancel in-progress blob metadata recovery when gaining new shards by halfprice · Pull Request #3462 · MystenLabs/walrus

halfprice · 2026-06-12T06:55:37Z

Description

Found this issue while working on node recovery across epoch.

When a node joins the committee, it enters RecoverMetadata status and starts a sync-shards task that first recovers all certified blob metadata before starting the individual shard syncs. If metadata recovery takes longer than an epoch and the node gains another shard in a subsequent epoch, the new start_sync_shards call aborts the in-flight task. Since the node is no longer "newly joining", the replacement task:

skipped metadata recovery entirely,
only started syncs for the newly gained shards (the earlier epoch's shards never reached record_start_shard_sync, so they stayed at status None), and
incorrectly flipped the node status from RecoverMetadata to Active.

As a result, metadata recovery was silently lost and the shards gained in the earlier epoch were orphaned, with no repair on restart because restart_syncs only resumes shards persisted as ActiveSync/ActiveRecover and the node status was already Active.

Fix

sync_shards_task no longer takes recover_metadata from the caller and instead derives it from the persisted node status:

When the task observes RecoverMetadata, it runs the metadata recovery and then starts syncs for all shards that the node owns in the current committee and stores locally (already-active shards are skipped in start_new_shard_sync), so a task that aborts its predecessor adopts the predecessor's unstarted work. This mirrors what restart_syncs already does in the RecoverMetadata branch, and that branch now reuses the same code path.
Stored shards the node does not own — for example, shards locked for transfer to another node in the same epoch change — are filtered out, so the sync task cannot clobber a LockedToMove status back to ActiveSync and start syncing a shard the node no longer owns. (This exposure previously existed in the restart_syncs RecoverMetadata branch as well, which synced all existing shard storages unconditionally.)
The RecoverMetadata -> Active transition is now only performed by the task that actually executed the metadata recovery, so a concurrent task can no longer prematurely mark the node Active.

This makes aborting the previous sync-shards task safe, because the replacement task reconstructs its work from persisted state. Note that aborting the orchestrator task never cancelled the already-running per-shard sync tasks (they are tracked separately in shard_sync_in_progress), so long-running shard transfers are unaffected.

Test plan

Two new simtests, using a new pause fail point in sync_certified_blob_metadata:

sync_shard_new_epoch_does_not_cancel_metadata_recovery: parks metadata recovery in flight, gains another shard via a second start_sync_shards call, releases the pause, and asserts that the node reaches Active with both shards fully synced and an unowned locked shard untouched. Fails on main (the first shard remains at status None); passes with this fix.
sync_shard_recover_metadata_skips_unowned_shards: node in RecoverMetadata holds storages for shards it no longer owns in the current committee (one locked for transfer); asserts owned shards sync to Active while the locked shard keeps its LockedToMove status. Fails without the ownership filter (the locked shard is clobbered to ActiveSync and retries syncing a shard the source nodes reject); passes with this fix.

Also ran:

cargo simtest failure_injection_tests — all 41 shard-sync failure-injection simtests pass
cargo nextest run -p walrus-service sync_shard shard_sync — 30 tests pass

…w shards When a node joins the committee, it enters RecoverMetadata status and starts a sync-shards task that first recovers all certified blob metadata before starting the individual shard syncs. If metadata recovery takes longer than an epoch and the node gains another shard in a subsequent epoch, the new start_sync_shards call aborts the in-flight task. Since the node is no longer "newly joining", the replacement task skipped metadata recovery, only started syncs for the newly gained shards, and incorrectly flipped the node status from RecoverMetadata to Active. As a result, metadata recovery was silently lost and the shards gained in the earlier epoch were orphaned at status None, with no repair on restart because the node status was already Active. Fix: sync_shards_task no longer takes recover_metadata from the caller and instead derives it from the persisted node status. When it observes RecoverMetadata, it runs the metadata recovery and then starts syncs for all shards that the node owns in the current committee and stores locally (already-active shards are skipped), so a task that aborts its predecessor adopts the predecessor's unstarted work. Stored shards the node does not own (for example, shards locked for transfer to another node in the same epoch change) are filtered out so their status is not clobbered. The RecoverMetadata -> Active transition now only happens in the task that actually performed the metadata recovery. Also adds a pause fail point in sync_certified_blob_metadata and simtests that reproduce both scenarios.

sadhansood · 2026-06-15T17:59:52Z

-            .node
-            .storage
-            .node_status()
-            .expect("failed to read node status from db");


@halfprice i think this is problematic - i did a quick lookup and it seems like it is possible that node can go into all kinds of state without consulting its previous state i.e. continue_event_stream can take the node into RecoveryCatchUpWithIncompleteHistory, enter_recovery_mode can take it into RecoveryCatchup, and if node stops being a part of the committee it enters into StandBy during epoch change. So i think we need to read the node_status from db again before doing if node_status == NodeStatus::RecoverMetadata.

Great point. This also reminds me of a broader question beyond this PR's scope. Beyond checking node_status again right before the compare-and-set, we could centralize all status transitions in one place that enforces the transition rules. Each call site only send requests that get serialized. Or we can use similar atomic compare-and-set mechanism to avoid this type of issue.

halfprice force-pushed the zhewu/start_sync_shards_bug branch from c0a102c to 44f418c Compare June 12, 2026 06:57

halfprice force-pushed the zhewu/start_sync_shards_bug branch from 44f418c to d42d2a3 Compare June 12, 2026 07:34

halfprice requested review from sadhansood and shuowang12 June 12, 2026 07:42

sadhansood reviewed Jun 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: do not cancel in-progress blob metadata recovery when gaining new shards#3462

fix: do not cancel in-progress blob metadata recovery when gaining new shards#3462
halfprice wants to merge 1 commit into
mainfrom
zhewu/start_sync_shards_bug

halfprice commented Jun 12, 2026 •

edited

Loading

Uh oh!

sadhansood Jun 15, 2026

Uh oh!

shuowang12 Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

halfprice commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Fix

Test plan

Uh oh!

sadhansood Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

shuowang12 Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

halfprice commented Jun 12, 2026 •

edited

Loading