fix: :bug: Propagate errors during block processing by snowmead · Pull Request #665 · Moonsong-Labs/storage-hub

snowmead · 2026-01-27T21:00:33Z

Summary

Ensures that failed finalized blocks are not marked as processed, allowing them to be retried automatically.

Problem

Previously, when block processing encountered errors (such as runtime API failures or event fetching failures), the code would log the error but continue processing. This caused several issues:

Failed blocks were marked as processed - Block trackers were updated even when processing failed, meaning failed blocks would never be retried.
Errors were silently swallowed - Some functions treated "bucket not managed by this MSP" and "runtime API call failed" the same way, making it impossible to distinguish recoverable scenarios from actual failures.
Intermediate finalized blocks weren't tracked individually - When processing a batch of finalized blocks (e.g., blocks 5 through 10), if block 7 failed, there was no way to avoid reprocessing blocks 5 and 6 on retry.
Forest root change failures were ignored - During sync and block import, if applying forest root mutations failed, processing continued anyway, potentially leaving the local state inconsistent.

Solution

Block processing functions now return errors instead of silently continuing, allowing callers to decide how to handle failures.
Block trackers are only updated after successful processing. When processing a batch of finalized blocks, each successful block is tracked individually, so only the failed block and later blocks need to be retried.
Runtime API failures are now properly distinguished from expected scenarios like "bucket not managed by this MSP" or "bucket was deleted".
Forest root change processing now propagates errors through the entire chain.

…o enable retry on restart - Change process_sync_block, process_finality_events, and related functions to return Result - Update validate_bucket_mutations_for_msp to return Result<bool> with proper error handling - Update last_finalised_block_processed after each successful intermediate block - Handle QueryMspIdOfBucketIdError::BucketNotFound as valid (deleted bucket) vs InternalError - Blocks that fail processing are not marked as processed, ensuring retry on restart Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

… catchup Update `process_sync_reorg` and `catch_up_missed_blocks` to return `Result<(), anyhow::Error>` and propagate errors instead of silently logging and continuing. Previously, if finality event processing failed during sync reorg or startup catchup, the error was logged but processing continued, which could mark failed blocks as processed. Now errors are propagated and block trackers are not updated on failure, ensuring failed finalized blocks can be retried on the next finality notification or restart.

Update forest root change processing to return Result and propagate errors instead of silently logging and continuing. - bsp/msp_process_forest_root_changing_events now return Result - apply_forest_root_changes now returns Result and propagates errors - forest_root_changes_catchup now returns Result and propagates errors - bsp/msp_init_block_processing now return Result - Call sites handle errors gracefully, logging and continuing

ffarall · 2026-01-30T22:21:41Z

-                error!(target: LOG_TARGET, "CRITICAL ❗️❗️ Failed to apply mutations and verify root for Bucket [{:?}]. \nError: {:?}", bucket_id, e);
-                return;
-            };
+            self.apply_forest_mutations_and_verify_root(


This change is worth discussing. If we push this to the deployed nodes right now, if there's a single bucket that whose root is corrupted, the whole MSP will stop working.

We know for a fact that there are several of those right now, so this change will block the MSP.

Having this "failed but continue" policy so far, has allowed us to continue operational even after errors. I would be ok with eventually introducing this change, if/when we reach a point where forest root errors/desynchronisation is RARELY happening. But not right now.

I agree, I added a TODO and made it log and continue for now.

What about for the BSP. I don't think we should continue and ignore the error in this case.

storage-hub/client/blockchain-service/src/handler_bsp.rs

Lines 556 to 576 in fd6f4c9

// Apply forest root changes to the BSP's Forest Storage.

// At this point, we only apply the mutation of this file and its metadata to the Forest of this BSP,

// and not to the File Storage.

// This is because if in a future block built on top of this one, the BSP needs to provide

// a proof, it will be against the Forest root with this change applied.

// For file deletions, we will remove the file from the File Storage only after finality is reached.

// This gives us the opportunity to put the file back in the Forest if this block is re-orged.

let current_forest_key = CURRENT_FOREST_KEY.to_vec();

self.apply_forest_mutations_and_verify_root(

current_forest_key,

&mutations,

revert,

old_root,

new_root,

)

.await

.map_err(|e| {

let err_msg = format!("CRITICAL ❗️❗️ Failed to apply mutations and verify root for BSP [{:?}]. \nError: {:?}", provider_id, e);

error!(target: LOG_TARGET, "{}", err_msg);

anyhow::anyhow!(err_msg)

})?;

…ng error A single corrupted bucket root should not block the entire MSP from processing other buckets.

# Conflicts: # client/blockchain-service/src/handler_msp.rs

ffarall · 2026-02-13T16:55:19Z

+                    error!(
+                        target: LOG_TARGET,
+                        "Failed to process BSP forest root changes for block #{}: {:?}",
+                        block_number, e
+                    );


This log is misplaced I believe. Here you're calling bsp_init_block_processing inside of process_block_import. It doesn't seem intuitive that whatever error you get, you immediately make the assumption that "Failed to process BSP forest root changes"

ffarall · 2026-02-13T16:59:11Z

I'm not sure I want to push forwards with this change. The fact that MSP/BSPs don't get stuck on one block if there's an error, can be a benefit. Indexers on the other hand, absolutely should not move forward. But the fact that MSP/BSPs don't get stuck trying to process a block, can be a feature not a bug. In fact, so far that has been the case for us. It has been beneficial that they don't get stuck when we faced errors due to our own bugs.

Anyways, worth discussing.

snowmead and others added 2 commits January 26, 2026 12:49

Merge branch 'main' into fix/block-processing-error-propagation

d0bf234

snowmead added B5-clientnoteworthy Changes should be mentioned client-related release notes D2-noauditneeded🙈 PR doesn't need to be audited not-breaking Does not need to be mentioned in breaking changes labels Jan 27, 2026

snowmead added 2 commits January 27, 2026 16:02

fix fmt

2d75a48

update comments

18f3fc9

snowmead marked this pull request as ready for review January 27, 2026 21:25

snowmead requested review from TDemeco and ffarall January 27, 2026 21:25

snowmead and others added 5 commits January 27, 2026 16:40

fix fmt

ff1e32e

Merge branch 'main' into fix/block-processing-error-propagation

7bef4bf

unify error messages, return error where appropriate

96bc1a9

ffarall reviewed Jan 30, 2026

View reviewed changes

snowmead added 2 commits February 10, 2026 13:07

log and continue on forest root mutation failure instead of propagati…

fd6f4c9

…ng error A single corrupted bucket root should not block the entire MSP from processing other buckets.

Merge branch 'main' into fix/block-processing-error-propagation

29b5665

# Conflicts: # client/blockchain-service/src/handler_msp.rs

santikaplan requested a review from ffarall February 13, 2026 13:12

ffarall reviewed Feb 13, 2026

View reviewed changes

snowmead marked this pull request as draft February 18, 2026 20:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: 🐛 Propagate errors during block processing#665

fix: 🐛 Propagate errors during block processing#665
snowmead wants to merge 11 commits intomainfrom
fix/block-processing-error-propagation

snowmead commented Jan 27, 2026 •

edited

Loading

Uh oh!

ffarall Jan 30, 2026

Uh oh!

snowmead Feb 10, 2026

Uh oh!

ffarall Feb 13, 2026

Uh oh!

ffarall commented Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	// Apply forest root changes to the BSP's Forest Storage.
	// At this point, we only apply the mutation of this file and its metadata to the Forest of this BSP,
	// and not to the File Storage.
	// This is because if in a future block built on top of this one, the BSP needs to provide
	// a proof, it will be against the Forest root with this change applied.
	// For file deletions, we will remove the file from the File Storage only after finality is reached.
	// This gives us the opportunity to put the file back in the Forest if this block is re-orged.
	let current_forest_key = CURRENT_FOREST_KEY.to_vec();
	self.apply_forest_mutations_and_verify_root(
	current_forest_key,
	&mutations,
	revert,
	old_root,
	new_root,
	)
	.await
	.map_err(\|e\| {
	let err_msg = format!("CRITICAL ❗️❗️ Failed to apply mutations and verify root for BSP [{:?}]. \nError: {:?}", provider_id, e);
	error!(target: LOG_TARGET, "{}", err_msg);
	anyhow::anyhow!(err_msg)
	})?;

Conversation

snowmead commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Uh oh!

ffarall Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

snowmead Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

ffarall Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

ffarall commented Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

snowmead commented Jan 27, 2026 •

edited

Loading