fix: 🐛 Propagate errors during block processing#665
Conversation
…o enable retry on restart - Change process_sync_block, process_finality_events, and related functions to return Result - Update validate_bucket_mutations_for_msp to return Result<bool> with proper error handling - Update last_finalised_block_processed after each successful intermediate block - Handle QueryMspIdOfBucketIdError::BucketNotFound as valid (deleted bucket) vs InternalError - Blocks that fail processing are not marked as processed, ensuring retry on restart Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
… catchup Update `process_sync_reorg` and `catch_up_missed_blocks` to return `Result<(), anyhow::Error>` and propagate errors instead of silently logging and continuing. Previously, if finality event processing failed during sync reorg or startup catchup, the error was logged but processing continued, which could mark failed blocks as processed. Now errors are propagated and block trackers are not updated on failure, ensuring failed finalized blocks can be retried on the next finality notification or restart.
Update forest root change processing to return Result and propagate errors instead of silently logging and continuing. - bsp/msp_process_forest_root_changing_events now return Result - apply_forest_root_changes now returns Result and propagates errors - forest_root_changes_catchup now returns Result and propagates errors - bsp/msp_init_block_processing now return Result - Call sites handle errors gracefully, logging and continuing
| error!(target: LOG_TARGET, "CRITICAL ❗️❗️ Failed to apply mutations and verify root for Bucket [{:?}]. \nError: {:?}", bucket_id, e); | ||
| return; | ||
| }; | ||
| self.apply_forest_mutations_and_verify_root( |
There was a problem hiding this comment.
This change is worth discussing. If we push this to the deployed nodes right now, if there's a single bucket that whose root is corrupted, the whole MSP will stop working.
We know for a fact that there are several of those right now, so this change will block the MSP.
Having this "failed but continue" policy so far, has allowed us to continue operational even after errors. I would be ok with eventually introducing this change, if/when we reach a point where forest root errors/desynchronisation is RARELY happening. But not right now.
There was a problem hiding this comment.
I agree, I added a TODO and made it log and continue for now.
What about for the BSP. I don't think we should continue and ignore the error in this case.
storage-hub/client/blockchain-service/src/handler_bsp.rs
Lines 556 to 576 in fd6f4c9
…ng error A single corrupted bucket root should not block the entire MSP from processing other buckets.
# Conflicts: # client/blockchain-service/src/handler_msp.rs
| error!( | ||
| target: LOG_TARGET, | ||
| "Failed to process BSP forest root changes for block #{}: {:?}", | ||
| block_number, e | ||
| ); |
There was a problem hiding this comment.
This log is misplaced I believe. Here you're calling bsp_init_block_processing inside of process_block_import. It doesn't seem intuitive that whatever error you get, you immediately make the assumption that "Failed to process BSP forest root changes"
|
I'm not sure I want to push forwards with this change. The fact that MSP/BSPs don't get stuck on one block if there's an error, can be a benefit. Indexers on the other hand, absolutely should not move forward. But the fact that MSP/BSPs don't get stuck trying to process a block, can be a feature not a bug. In fact, so far that has been the case for us. It has been beneficial that they don't get stuck when we faced errors due to our own bugs. Anyways, worth discussing. |
Summary
Ensures that failed finalized blocks are not marked as processed, allowing them to be retried automatically.
Problem
Previously, when block processing encountered errors (such as runtime API failures or event fetching failures), the code would log the error but continue processing. This caused several issues:
Solution