Skip to content

fix: do not cancel in-progress blob metadata recovery when gaining new shards#3462

Open
halfprice wants to merge 1 commit into
mainfrom
zhewu/start_sync_shards_bug
Open

fix: do not cancel in-progress blob metadata recovery when gaining new shards#3462
halfprice wants to merge 1 commit into
mainfrom
zhewu/start_sync_shards_bug

Conversation

@halfprice

@halfprice halfprice commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

Description

Found this issue while working on node recovery across epoch.

When a node joins the committee, it enters RecoverMetadata status and starts a sync-shards task that first recovers all certified blob metadata before starting the individual shard syncs. If metadata recovery takes longer than an epoch and the node gains another shard in a subsequent epoch, the new start_sync_shards call aborts the in-flight task. Since the node is no longer "newly joining", the replacement task:

  • skipped metadata recovery entirely,
  • only started syncs for the newly gained shards (the earlier epoch's shards never reached record_start_shard_sync, so they stayed at status None), and
  • incorrectly flipped the node status from RecoverMetadata to Active.

As a result, metadata recovery was silently lost and the shards gained in the earlier epoch were orphaned, with no repair on restart because restart_syncs only resumes shards persisted as ActiveSync/ActiveRecover and the node status was already Active.

Fix

sync_shards_task no longer takes recover_metadata from the caller and instead derives it from the persisted node status:

  • When the task observes RecoverMetadata, it runs the metadata recovery and then starts syncs for all shards that the node owns in the current committee and stores locally (already-active shards are skipped in start_new_shard_sync), so a task that aborts its predecessor adopts the predecessor's unstarted work. This mirrors what restart_syncs already does in the RecoverMetadata branch, and that branch now reuses the same code path.
  • Stored shards the node does not own — for example, shards locked for transfer to another node in the same epoch change — are filtered out, so the sync task cannot clobber a LockedToMove status back to ActiveSync and start syncing a shard the node no longer owns. (This exposure previously existed in the restart_syncs RecoverMetadata branch as well, which synced all existing shard storages unconditionally.)
  • The RecoverMetadata -> Active transition is now only performed by the task that actually executed the metadata recovery, so a concurrent task can no longer prematurely mark the node Active.

This makes aborting the previous sync-shards task safe, because the replacement task reconstructs its work from persisted state. Note that aborting the orchestrator task never cancelled the already-running per-shard sync tasks (they are tracked separately in shard_sync_in_progress), so long-running shard transfers are unaffected.

Test plan

Two new simtests, using a new pause fail point in sync_certified_blob_metadata:

  • sync_shard_new_epoch_does_not_cancel_metadata_recovery: parks metadata recovery in flight, gains another shard via a second start_sync_shards call, releases the pause, and asserts that the node reaches Active with both shards fully synced and an unowned locked shard untouched. Fails on main (the first shard remains at status None); passes with this fix.
  • sync_shard_recover_metadata_skips_unowned_shards: node in RecoverMetadata holds storages for shards it no longer owns in the current committee (one locked for transfer); asserts owned shards sync to Active while the locked shard keeps its LockedToMove status. Fails without the ownership filter (the locked shard is clobbered to ActiveSync and retries syncing a shard the source nodes reject); passes with this fix.

Also ran:

  • cargo simtest failure_injection_tests — all 41 shard-sync failure-injection simtests pass
  • cargo nextest run -p walrus-service sync_shard shard_sync — 30 tests pass

@halfprice halfprice force-pushed the zhewu/start_sync_shards_bug branch from c0a102c to 44f418c Compare June 12, 2026 06:57
…w shards

When a node joins the committee, it enters RecoverMetadata status and starts a
sync-shards task that first recovers all certified blob metadata before
starting the individual shard syncs. If metadata recovery takes longer than an
epoch and the node gains another shard in a subsequent epoch, the new
start_sync_shards call aborts the in-flight task. Since the node is no longer
"newly joining", the replacement task skipped metadata recovery, only started
syncs for the newly gained shards, and incorrectly flipped the node status
from RecoverMetadata to Active. As a result, metadata recovery was silently
lost and the shards gained in the earlier epoch were orphaned at status None,
with no repair on restart because the node status was already Active.

Fix: sync_shards_task no longer takes recover_metadata from the caller and
instead derives it from the persisted node status. When it observes
RecoverMetadata, it runs the metadata recovery and then starts syncs for all
shards that the node owns in the current committee and stores locally
(already-active shards are skipped), so a task that aborts its predecessor
adopts the predecessor's unstarted work. Stored shards the node does not own
(for example, shards locked for transfer to another node in the same epoch
change) are filtered out so their status is not clobbered. The
RecoverMetadata -> Active transition now only happens in the task that
actually performed the metadata recovery.

Also adds a pause fail point in sync_certified_blob_metadata and simtests
that reproduce both scenarios.
@halfprice halfprice force-pushed the zhewu/start_sync_shards_bug branch from 44f418c to d42d2a3 Compare June 12, 2026 07:34
.node
.storage
.node_status()
.expect("failed to read node status from db");

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@halfprice i think this is problematic - i did a quick lookup and it seems like it is possible that node can go into all kinds of state without consulting its previous state i.e. continue_event_stream can take the node into RecoveryCatchUpWithIncompleteHistory, enter_recovery_mode can take it into RecoveryCatchup, and if node stops being a part of the committee it enters into StandBy during epoch change. So i think we need to read the node_status from db again before doing if node_status == NodeStatus::RecoverMetadata.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point. This also reminds me of a broader question beyond this PR's scope. Beyond checking node_status again right before the compare-and-set, we could centralize all status transitions in one place that enforces the transition rules. Each call site only send requests that get serialized. Or we can use similar atomic compare-and-set mechanism to avoid this type of issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants