Skip to content

Cumulus v3: find_parent() adjustments#12296

Open
serban300 wants to merge 5 commits into
paritytech:masterfrom
serban300:cumulus-v3-find-parent
Open

Cumulus v3: find_parent() adjustments#12296
serban300 wants to merge 5 commits into
paritytech:masterfrom
serban300:cumulus-v3-find-parent

Conversation

@serban300

Copy link
Copy Markdown
Contributor

Related to #11624

Implements the changes discussed in #11624 .

The plan is to add unit & integration tests in follow-up PRs because:

  • incremental changes are easier to review
  • the V2 logic is not changed and the V3 logic is not used yet
  • it's good to have all the pieces in place as soon as possible for E2E testing

@serban300 serban300 self-assigned this Jun 8, 2026
@serban300 serban300 added the T9-cumulus This PR/Issue is related to cumulus. label Jun 8, 2026
serban300 added 4 commits June 8, 2026 11:32
- avoid querying the relay_client for detecting session changes
- reorder some arguments
- avoid passing the ancestry_lookback as an argument
  (read it from the relay_client)
@serban300 serban300 force-pushed the cumulus-v3-find-parent branch from f0cfc16 to ef77928 Compare June 8, 2026 08:41
@serban300 serban300 requested review from alindima and iulianbarbu June 8, 2026 09:25

@eskimor eskimor left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick first pass. I am not sure yet personally on correctness, I find the code a bit confusing - left a few suggestions for improvements.


let included_hash = included_header.hash();
// If the included block is not locally known, we can't do anything.
let Some(_) = get_para_header(backend, included_hash) else {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But is this a concern of this function? It is a bit weird to do an unrelated validation, where we are not even interested in the result.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we need this validation, and we need it both when we fetch the included header at relay parent and at scheduling parent, I think it's good to do it here. Otherwise we would duplicate it.

@@ -93,35 +100,27 @@ pub async fn find_parent_for_building<B: BlockT>(
// `OccupiedCoreAssumption::Included` so the candidate pending availability gets enacted
// before being returned to us.
let pending_header = relay_client

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super nit: I find it confusing that we use a helper for the included block, but none for the pending one, despite both fetches being almost 100% the same code. This either warrants a helper or not.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading a function, ideally the reader stays at one abstraction layer.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried to reorganize the logic. PTAL !

break None;
};
if current_hash == para_best_hash {
para_best_header = Some(current_header.clone());

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this is super confusing. Why not just initialize para_best_header before the loop already and make it immutable?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, this function is quite confusing. Can we get a clearer version that is easier to follow? Try to encode your intent as directly as possible.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just initialize para_best_header before the loop already and make it immutable?

This will make it simpler to read it, indeed. Also doing the has_ancestor_relay_parent_info call for it before the loop. The loop can iterate back to start, and do the appropriate has_ancestor_relay_parent_info on the direct descendant of start and return accordingly in the end. I think the rest of this fn reads clear to me.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree about the simplification ideas.

moreover, we don't even need a loop if we only do a check on the start_header + 1 and the best hash

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the loop is useful to check that best is not on some weird fork that doesn't end with the start header.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this is super confusing. Why not just initialize para_best_header before the loop already and make it immutable?

In general, this function is quite confusing. Can we get a clearer version that is easier to follow? Try to encode your intent as directly as possible.

My intention was to avoid duplication as much as possible. Every bit of logic that we get outside of the loop will be duplicated logic, because it's needed inside the loop as well.

Moved the para_best_header initialization outside the loop and also added some comments and did some small adjustments. Please let me know if it is easier to follow now.

Also doing the has_ancestor_relay_parent_info call for it before the loop. The loop can iterate back to start, and do the appropriate has_ancestor_relay_parent_info on the direct descendant of start and return accordingly in the end. I think the rest of this fn reads clear to me.

Apart from the code duplication, another problem with moving has_ancestor_relay_parent_info outside the loop is calling it twice unnecessarily when the start + 1 == para_best_header number . That's why I would suggest to keep it as it is. But I can move it if there are strong preferences.

I think the loop is useful to check that best is not on some weird fork that doesn't end with the start header.

Yes, exactly. The loop is there for ensuring that the start header and the para best header are on the same fork. Don't we need to check this ?

@alindima alindima Jun 10, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly. The loop is there for ensuring that the start header and the para best header are on the same fork. Don't we need to check this ?

oh right. but what if they are not. the DFS we did for v2 would have handled that, no?
I mean if the best header is temporarily on some fork that will be discarded

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If they are not, we return the start_header. Is that correct ? I understand that because of the resubmission logic it doesn't make sense to return the deepest valid parent

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that because of the resubmission logic it doesn't make sense to return the deepest valid parent

hmm why not? @iulianbarbu I think you proposed getting rid of the deepest valid parent search.

Comment thread cumulus/client/consensus/common/src/parent_search.rs
@skunert skunert self-requested a review June 9, 2026 08:20

@iulianbarbu iulianbarbu left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, delta the comments from @eskimor related to get_para_header concern, or for the included/PA blocks fetching abstraction.

break None;
};
if current_hash == para_best_hash {
para_best_header = Some(current_header.clone());

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just initialize para_best_header before the loop already and make it immutable?

This will make it simpler to read it, indeed. Also doing the has_ancestor_relay_parent_info call for it before the loop. The loop can iterate back to start, and do the appropriate has_ancestor_relay_parent_info on the direct descendant of start and return accordingly in the end. I think the rest of this fn reads clear to me.

@alindima alindima left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also update the v3 throughput in zombienet tests

let scheduling_parent = *params.scheduling_parent();

let Some(ParentSearchStart { included: included_header, start: (start_hash, start_header) }) =
get_parent_search_start(relay_client, backend, para_id, scheduling_parent).await?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this function should just return the starting point. the included header if it's needed here should be retrieved here before the call to get_parent_search_start and then passed to it also as a param

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. But it's probably not worth having this function now. Removed.

// Determine the starting point for the search.
let (start_hash, start_header) = match &maybe_pending {
Some((pending_header, pending_hash)) => {
Some((pending_hash, pending_header)) => {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a needless check IMO. it guards against a potential severe bug of the relay chain (backing an invalid chain of candidates)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed


// Search for the deepest valid parent starting from the pending/included block.
let best_parent_header =
find_deepest_valid_parent(backend, start_header, start_hash, &rp_ancestry);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the name does not make it obvious that this is v2-specific

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't think of a good name. Any suggestion ?

let relay_parent = 'get_relay_parent: {
let digest = header.digest();

if let Some(relay_parent) = cumulus_primitives_core::extract_relay_parent(digest) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the control flow of this entire let relay_parent is hard to follow due to the usage of these named break statements.

why do we even handle the possibility of cumulus_primitives_core::extract_relay_parent failing? is that a real possibility?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we even handle the possibility of cumulus_primitives_core::extract_relay_parent failing? is that a real possibility?

Yes. At some point I added logs and checked this. I didn't count the exact number of occurrences, but it happened very often. I would say once in 2-3 slots there was a header that didn't have the relay parent digest and only had the storage root information.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could not find any place within the code that sets the relay parent digest item. Maybe it's something historical.

Although you mentioned that you also saw blocks that did have it?

@skunert do you know more about these?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although you mentioned that you also saw blocks that did have it?

Hmmm actually I didn't look specifically for that. I only put logs on the storage root path and they were triggered quite often. I was under the impression that the relay parent digest item is also present sometimes, but now that you mention it, I'm not so sure. I'll check.

Comment thread cumulus/client/consensus/common/src/parent_search.rs Outdated
break None;
};
if current_hash == para_best_hash {
para_best_header = Some(current_header.clone());

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree about the simplification ideas.

moreover, we don't even need a loop if we only do a check on the start_header + 1 and the best hash

let included_header = match v3_enabled {
false => parent_search_result.included_header,
true => {
let Ok(Some((_, included_header))) = fetch_included_from_relay_chain(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not at all obvious why we use a different included header here than what is returned by the find_parent for V3

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, this also quite stumped me. It seems technically correct, but is far from obvious. It also feels like a hack, (ignoring the returned one, but fetching it our own ... why is the function not returning the correct one already? ..) TL;DR: This deserves some docs/comments + also some cleanup/simplification: Let the function already do the correct thing in an obviously correct way.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parent search result contains the included header at the scheduling parent. For the logic that follows this we need the included header at the relay parent since it will be used in can_build_upon(). Because some of the cheks done in can_build_upon() are also done by the runtime in the set_validation_data inherent. And the inherent uses the relay parent context.

Renamed this included to included_at_execution and the one in the parent search result to included_at_scheduling. Also added a comment. PTAL !

@alindima

alindima commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Let's also update the v3 throughput in zombienet tests

I don't see an improvement of the v3 throughput locally (tested with functional::scheduling_v3::scheduling_v2_and_v3_collator_with_v3_validators::case_1_non_zero_relay_parent_offset)

@serban300

serban300 commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

Let's also update the v3 throughput in zombienet tests

I don't see an improvement of the v3 throughput locally (tested with functional::scheduling_v3::scheduling_v2_and_v3_collator_with_v3_validators::case_1_non_zero_relay_parent_offset)

Yes, exactly. On my local machine the throughput was constantly 10 previously and now it's between 9-11 . So no change on average.

LE: I forgot to mention. With the previous approach + the cumulus test runtime fix the throughput would increase to 14

@alindima

Copy link
Copy Markdown
Contributor

With the previous approach

what do you mean by that?

on the current state of the PR, I don't get that increase in the throughput.

The purpose of this task was to fix the relay parent usage such that we manage to achieve the same throughput as with v2 (which is how I discovered the problems in the first place). As discussed, we should not be able to get to the same value as v2 without resubmissions (because we don't know the scheduling parent of candidates so that we can stop building on top of them if a session change occurs).

But the end goal of this PR would be to increase the throughput if possible and have an explanation for the remaining lag (ideally by validating the hypothesis via some temporary hack/workaround).

Otherwise, we don't know if we're fully solving the problem or not

@serban300

serban300 commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

what do you mean by that?

on the current state of the PR, I don't get that increase in the throughput.

I mean without the changes to find_parent() from this PR. If you just cherry pick the cumulus test runtime fix and put it on master, you will get throughput 14

But the end goal of this PR would be to increase the throughput if possible and have an explanation for the remaining lag (ideally by validating the hypothesis via some temporary hack/workaround).

I will try to reapply your changes or something similar

@alindima

alindima commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

what do you mean by that?
on the current state of the PR, I don't get that increase in the throughput.

I mean without the changes to find_parent() from this PR. If you just cherry pick the cumulus test runtime fix and put it on master, you will get throughput 14

I'm very confused by this

I will try to reapply your changes or something similar

I added a hack for determining the scheduling session and stopping building if it's from an old session: 78fc228

throughput goes to 15. still confused why it's not 20 as with v2 (I believe my old fixes achieved that)

@iulianbarbu

iulianbarbu commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

what do you mean by that?
on the current state of the PR, I don't get that increase in the throughput.

I mean without the changes to find_parent() from this PR. If you just cherry pick the cumulus test runtime fix and put it on master, you will get throughput 14

I'm very confused by this

I will try to reapply your changes or something similar

I added a hack for determining the scheduling session and stopping building if it's from an old session: 78fc228

throughput goes to 15. still confused why it's not 20 as with v2 (I believe my old fixes achieved that)

Might be because you build with a relay parent offset of 2 and in the past not, or v2 was running with rpo 0?

// - As we are building on `RELAY_PARENT_OFFSET` old relay parents, the included block from the
// parachain is also `RELAY_PARENT_OFFSET` relay blocks older (one relay block may contains
// multiple parachain blocks).
block_processing_velocity() * (3 + relay_parent_offset())

@serban300 serban300 Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you mean by that?
on the current state of the PR, I don't get that increase in the throughput.

I mean without the changes to find_parent() from this PR. If you just cherry pick the cumulus test runtime fix and put it on master, you will get throughput 14

I'm very confused by this

I will try to reapply your changes or something similar

I added a hack for determining the scheduling session and stopping building if it's from an old session: 78fc228
throughput goes to 15. still confused why it's not 20 as with v2 (I believe my old fixes achieved that)

Might be because you build with a relay parent offset of 2 and in the past not, or v2 was running with rpo 0?

Let's move this discussion to a thread for better tracking.

Another thing is that the branch with your old fixes lacked some later changes. For example later we added the logic in wait_for_scheduling_parent that is checking if the scheduling parent is in the same session as the relay tip.

Also block bundling wasn't merged yet.

@serban300 serban300 Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a hack for determining the scheduling session and stopping building if it's from an old session: 78fc228

If has_ancestor_relay_parent_info works correctly, than I think your code will only do one thing: when the start header is the PA header, it will go back and return the included one instead. I need to check. But I wonder why this is increasing the throughput

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a hack for determining the scheduling session and stopping building if it's from an old session: 78fc228

If has_ancestor_relay_parent_info works correctly, than I think your code will only do one thing: when the start header is the PA header, it will go back and return the included one instead. I need to check. But I wonder why this is increasing the throughput

my code does this:

if the latest scheduling parent we want to pick is from a new session, it drops building on top of candidates whose scheduling session are different (going back to the included head indeed).
Without this, we keep building on top of them but they will never be backed (because they are not resubmitted with newer scheduling parents).

I just checked and the throughput is on par with v2 with relay parent offset 2 (so as good as it can get without resubmissions). This is because once we have a larger than zero relay parent offset, we also have a larger unincluded segment, so more candidates are dropped at session boundaries

I'm testing with a"max_relay_parent_session_age": 2

@alindima alindima Jun 10, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean without the changes to find_parent() from this PR. If you just cherry pick the cumulus test runtime fix and put it on master, you will get throughput 14

I think this makse sense, but only if the rp-offset is smaller than the scheduling_lookahead.
Try setting the rp offset to 8 and the "max_relay_parent_session_age": 2. This PR with my hack should achieve better throughput

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

T9-cumulus This PR/Issue is related to cumulus.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants