Skip to content

fix: 🐛 Propagate errors during block processing#665

Draft
snowmead wants to merge 11 commits intomainfrom
fix/block-processing-error-propagation
Draft

fix: 🐛 Propagate errors during block processing#665
snowmead wants to merge 11 commits intomainfrom
fix/block-processing-error-propagation

Conversation

@snowmead
Copy link
Copy Markdown
Contributor

@snowmead snowmead commented Jan 27, 2026

Summary

Ensures that failed finalized blocks are not marked as processed, allowing them to be retried automatically.

Problem

Previously, when block processing encountered errors (such as runtime API failures or event fetching failures), the code would log the error but continue processing. This caused several issues:

  • Failed blocks were marked as processed - Block trackers were updated even when processing failed, meaning failed blocks would never be retried.
  • Errors were silently swallowed - Some functions treated "bucket not managed by this MSP" and "runtime API call failed" the same way, making it impossible to distinguish recoverable scenarios from actual failures.
  • Intermediate finalized blocks weren't tracked individually - When processing a batch of finalized blocks (e.g., blocks 5 through 10), if block 7 failed, there was no way to avoid reprocessing blocks 5 and 6 on retry.
  • Forest root change failures were ignored - During sync and block import, if applying forest root mutations failed, processing continued anyway, potentially leaving the local state inconsistent.

Solution

  • Block processing functions now return errors instead of silently continuing, allowing callers to decide how to handle failures.
  • Block trackers are only updated after successful processing. When processing a batch of finalized blocks, each successful block is tracked individually, so only the failed block and later blocks need to be retried.
  • Runtime API failures are now properly distinguished from expected scenarios like "bucket not managed by this MSP" or "bucket was deleted".
  • Forest root change processing now propagates errors through the entire chain.

snowmead and others added 2 commits January 26, 2026 12:49
…o enable retry on restart

- Change process_sync_block, process_finality_events, and related functions to return Result
- Update validate_bucket_mutations_for_msp to return Result<bool> with proper error handling
- Update last_finalised_block_processed after each successful intermediate block
- Handle QueryMspIdOfBucketIdError::BucketNotFound as valid (deleted bucket) vs InternalError
- Blocks that fail processing are not marked as processed, ensuring retry on restart

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@snowmead snowmead added B5-clientnoteworthy Changes should be mentioned client-related release notes D2-noauditneeded🙈 PR doesn't need to be audited not-breaking Does not need to be mentioned in breaking changes labels Jan 27, 2026
@snowmead snowmead marked this pull request as ready for review January 27, 2026 21:25
@snowmead snowmead requested review from TDemeco and ffarall January 27, 2026 21:25
snowmead and others added 5 commits January 27, 2026 16:40
… catchup

  Update `process_sync_reorg` and `catch_up_missed_blocks` to return
  `Result<(), anyhow::Error>` and propagate errors instead of silently
  logging and continuing.

  Previously, if finality event processing failed during sync reorg or
  startup catchup, the error was logged but processing continued, which
  could mark failed blocks as processed. Now errors are propagated and
  block trackers are not updated on failure, ensuring failed finalized
  blocks can be retried on the next finality notification or restart.
  Update forest root change processing to return Result and propagate
  errors instead of silently logging and continuing.
  - bsp/msp_process_forest_root_changing_events now return Result
  - apply_forest_root_changes now returns Result and propagates errors
  - forest_root_changes_catchup now returns Result and propagates errors
  - bsp/msp_init_block_processing now return Result
  - Call sites handle errors gracefully, logging and continuing
error!(target: LOG_TARGET, "CRITICAL ❗️❗️ Failed to apply mutations and verify root for Bucket [{:?}]. \nError: {:?}", bucket_id, e);
return;
};
self.apply_forest_mutations_and_verify_root(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is worth discussing. If we push this to the deployed nodes right now, if there's a single bucket that whose root is corrupted, the whole MSP will stop working.

We know for a fact that there are several of those right now, so this change will block the MSP.

Having this "failed but continue" policy so far, has allowed us to continue operational even after errors. I would be ok with eventually introducing this change, if/when we reach a point where forest root errors/desynchronisation is RARELY happening. But not right now.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I added a TODO and made it log and continue for now.

What about for the BSP. I don't think we should continue and ignore the error in this case.

// Apply forest root changes to the BSP's Forest Storage.
// At this point, we only apply the mutation of this file and its metadata to the Forest of this BSP,
// and not to the File Storage.
// This is because if in a future block built on top of this one, the BSP needs to provide
// a proof, it will be against the Forest root with this change applied.
// For file deletions, we will remove the file from the File Storage only after finality is reached.
// This gives us the opportunity to put the file back in the Forest if this block is re-orged.
let current_forest_key = CURRENT_FOREST_KEY.to_vec();
self.apply_forest_mutations_and_verify_root(
current_forest_key,
&mutations,
revert,
old_root,
new_root,
)
.await
.map_err(|e| {
let err_msg = format!("CRITICAL ❗️❗️ Failed to apply mutations and verify root for BSP [{:?}]. \nError: {:?}", provider_id, e);
error!(target: LOG_TARGET, "{}", err_msg);
anyhow::anyhow!(err_msg)
})?;

…ng error

A single corrupted bucket root should not block the entire MSP from
processing other buckets.
# Conflicts:
#	client/blockchain-service/src/handler_msp.rs
@santikaplan santikaplan requested a review from ffarall February 13, 2026 13:12
Comment on lines +1994 to +1998
error!(
target: LOG_TARGET,
"Failed to process BSP forest root changes for block #{}: {:?}",
block_number, e
);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This log is misplaced I believe. Here you're calling bsp_init_block_processing inside of process_block_import. It doesn't seem intuitive that whatever error you get, you immediately make the assumption that "Failed to process BSP forest root changes"

@ffarall
Copy link
Copy Markdown
Collaborator

ffarall commented Feb 13, 2026

I'm not sure I want to push forwards with this change. The fact that MSP/BSPs don't get stuck on one block if there's an error, can be a benefit. Indexers on the other hand, absolutely should not move forward. But the fact that MSP/BSPs don't get stuck trying to process a block, can be a feature not a bug. In fact, so far that has been the case for us. It has been beneficial that they don't get stuck when we faced errors due to our own bugs.

Anyways, worth discussing.

@snowmead snowmead marked this pull request as draft February 18, 2026 20:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

B5-clientnoteworthy Changes should be mentioned client-related release notes D2-noauditneeded🙈 PR doesn't need to be audited not-breaking Does not need to be mentioned in breaking changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants