fix: do not cancel in-progress blob metadata recovery when gaining new shards#3462
fix: do not cancel in-progress blob metadata recovery when gaining new shards#3462halfprice wants to merge 1 commit into
Conversation
c0a102c to
44f418c
Compare
…w shards When a node joins the committee, it enters RecoverMetadata status and starts a sync-shards task that first recovers all certified blob metadata before starting the individual shard syncs. If metadata recovery takes longer than an epoch and the node gains another shard in a subsequent epoch, the new start_sync_shards call aborts the in-flight task. Since the node is no longer "newly joining", the replacement task skipped metadata recovery, only started syncs for the newly gained shards, and incorrectly flipped the node status from RecoverMetadata to Active. As a result, metadata recovery was silently lost and the shards gained in the earlier epoch were orphaned at status None, with no repair on restart because the node status was already Active. Fix: sync_shards_task no longer takes recover_metadata from the caller and instead derives it from the persisted node status. When it observes RecoverMetadata, it runs the metadata recovery and then starts syncs for all shards that the node owns in the current committee and stores locally (already-active shards are skipped), so a task that aborts its predecessor adopts the predecessor's unstarted work. Stored shards the node does not own (for example, shards locked for transfer to another node in the same epoch change) are filtered out so their status is not clobbered. The RecoverMetadata -> Active transition now only happens in the task that actually performed the metadata recovery. Also adds a pause fail point in sync_certified_blob_metadata and simtests that reproduce both scenarios.
44f418c to
d42d2a3
Compare
| .node | ||
| .storage | ||
| .node_status() | ||
| .expect("failed to read node status from db"); |
There was a problem hiding this comment.
@halfprice i think this is problematic - i did a quick lookup and it seems like it is possible that node can go into all kinds of state without consulting its previous state i.e. continue_event_stream can take the node into RecoveryCatchUpWithIncompleteHistory, enter_recovery_mode can take it into RecoveryCatchup, and if node stops being a part of the committee it enters into StandBy during epoch change. So i think we need to read the node_status from db again before doing if node_status == NodeStatus::RecoverMetadata.
There was a problem hiding this comment.
Great point. This also reminds me of a broader question beyond this PR's scope. Beyond checking node_status again right before the compare-and-set, we could centralize all status transitions in one place that enforces the transition rules. Each call site only send requests that get serialized. Or we can use similar atomic compare-and-set mechanism to avoid this type of issue.
Description
Found this issue while working on node recovery across epoch.
When a node joins the committee, it enters
RecoverMetadatastatus and starts a sync-shards task that first recovers all certified blob metadata before starting the individual shard syncs. If metadata recovery takes longer than an epoch and the node gains another shard in a subsequent epoch, the newstart_sync_shardscall aborts the in-flight task. Since the node is no longer "newly joining", the replacement task:record_start_shard_sync, so they stayed at statusNone), andRecoverMetadatatoActive.As a result, metadata recovery was silently lost and the shards gained in the earlier epoch were orphaned, with no repair on restart because
restart_syncsonly resumes shards persisted asActiveSync/ActiveRecoverand the node status was alreadyActive.Fix
sync_shards_taskno longer takesrecover_metadatafrom the caller and instead derives it from the persisted node status:RecoverMetadata, it runs the metadata recovery and then starts syncs for all shards that the node owns in the current committee and stores locally (already-active shards are skipped instart_new_shard_sync), so a task that aborts its predecessor adopts the predecessor's unstarted work. This mirrors whatrestart_syncsalready does in theRecoverMetadatabranch, and that branch now reuses the same code path.LockedToMovestatus back toActiveSyncand start syncing a shard the node no longer owns. (This exposure previously existed in therestart_syncsRecoverMetadatabranch as well, which synced all existing shard storages unconditionally.)RecoverMetadata -> Activetransition is now only performed by the task that actually executed the metadata recovery, so a concurrent task can no longer prematurely mark the nodeActive.This makes aborting the previous sync-shards task safe, because the replacement task reconstructs its work from persisted state. Note that aborting the orchestrator task never cancelled the already-running per-shard sync tasks (they are tracked separately in
shard_sync_in_progress), so long-running shard transfers are unaffected.Test plan
Two new simtests, using a new pause fail point in
sync_certified_blob_metadata:sync_shard_new_epoch_does_not_cancel_metadata_recovery: parks metadata recovery in flight, gains another shard via a secondstart_sync_shardscall, releases the pause, and asserts that the node reachesActivewith both shards fully synced and an unowned locked shard untouched. Fails onmain(the first shard remains at statusNone); passes with this fix.sync_shard_recover_metadata_skips_unowned_shards: node inRecoverMetadataholds storages for shards it no longer owns in the current committee (one locked for transfer); asserts owned shards sync toActivewhile the locked shard keeps itsLockedToMovestatus. Fails without the ownership filter (the locked shard is clobbered toActiveSyncand retries syncing a shard the source nodes reject); passes with this fix.Also ran:
cargo simtest failure_injection_tests— all 41 shard-sync failure-injection simtests passcargo nextest run -p walrus-service sync_shard shard_sync— 30 tests pass