Cumulus v3: `find_parent()` adjustments by serban300 · Pull Request #12296 · paritytech/polkadot-sdk

serban300 · 2026-06-08T07:55:35Z

Related to #11624

Implements the changes discussed in #11624 .

The plan is to add unit & integration tests in follow-up PRs because:

incremental changes are easier to review
the V2 logic is not changed and the V3 logic is not used yet
it's good to have all the pieces in place as soon as possible for E2E testing

- avoid querying the relay_client for detecting session changes - reorder some arguments - avoid passing the ancestry_lookback as an argument (read it from the relay_client)

eskimor

Quick first pass. I am not sure yet personally on correctness, I find the code a bit confusing - left a few suggestions for improvements.

eskimor · 2026-06-08T13:55:43Z

+
+	let included_hash = included_header.hash();
+	// If the included block is not locally known, we can't do anything.
+	let Some(_) = get_para_header(backend, included_hash) else {


But is this a concern of this function? It is a bit weird to do an unrelated validation, where we are not even interested in the result.

If we need this validation, and we need it both when we fetch the included header at relay parent and at scheduling parent, I think it's good to do it here. Otherwise we would duplicate it.

eskimor · 2026-06-08T13:59:52Z

@@ -93,35 +100,27 @@ pub async fn find_parent_for_building<B: BlockT>(
 		// `OccupiedCoreAssumption::Included` so the candidate pending availability gets enacted
 		// before being returned to us.
 		let pending_header = relay_client


Super nit: I find it confusing that we use a helper for the included block, but none for the pending one, despite both fetches being almost 100% the same code. This either warrants a helper or not.

Reading a function, ideally the reader stays at one abstraction layer.

Tried to reorganize the logic. PTAL !

eskimor · 2026-06-08T15:17:47Z

+					break None;
+				};
+				if current_hash == para_best_hash {
+					para_best_header = Some(current_header.clone());


Ok, this is super confusing. Why not just initialize para_best_header before the loop already and make it immutable?

In general, this function is quite confusing. Can we get a clearer version that is easier to follow? Try to encode your intent as directly as possible.

Why not just initialize para_best_header before the loop already and make it immutable?

This will make it simpler to read it, indeed. Also doing the has_ancestor_relay_parent_info call for it before the loop. The loop can iterate back to start, and do the appropriate has_ancestor_relay_parent_info on the direct descendant of start and return accordingly in the end. I think the rest of this fn reads clear to me.

agree about the simplification ideas.

moreover, we don't even need a loop if we only do a check on the start_header + 1 and the best hash

I think the loop is useful to check that best is not on some weird fork that doesn't end with the start header.

Ok, this is super confusing. Why not just initialize para_best_header before the loop already and make it immutable?

In general, this function is quite confusing. Can we get a clearer version that is easier to follow? Try to encode your intent as directly as possible.

My intention was to avoid duplication as much as possible. Every bit of logic that we get outside of the loop will be duplicated logic, because it's needed inside the loop as well.

Moved the para_best_header initialization outside the loop and also added some comments and did some small adjustments. Please let me know if it is easier to follow now.

Also doing the has_ancestor_relay_parent_info call for it before the loop. The loop can iterate back to start, and do the appropriate has_ancestor_relay_parent_info on the direct descendant of start and return accordingly in the end. I think the rest of this fn reads clear to me.

Apart from the code duplication, another problem with moving has_ancestor_relay_parent_info outside the loop is calling it twice unnecessarily when the start + 1 == para_best_header number . That's why I would suggest to keep it as it is. But I can move it if there are strong preferences.

I think the loop is useful to check that best is not on some weird fork that doesn't end with the start header.

Yes, exactly. The loop is there for ensuring that the start header and the para best header are on the same fork. Don't we need to check this ?

Yes, exactly. The loop is there for ensuring that the start header and the para best header are on the same fork. Don't we need to check this ?

oh right. but what if they are not. the DFS we did for v2 would have handled that, no?
I mean if the best header is temporarily on some fork that will be discarded

If they are not, we return the start_header. Is that correct ? I understand that because of the resubmission logic it doesn't make sense to return the deepest valid parent

I understand that because of the resubmission logic it doesn't make sense to return the deepest valid parent

hmm why not? @iulianbarbu I think you proposed getting rid of the deepest valid parent search.

iulianbarbu

LGTM, delta the comments from @eskimor related to get_para_header concern, or for the included/PA blocks fetching abstraction.

iulianbarbu · 2026-06-09T09:49:09Z

+					break None;
+				};
+				if current_hash == para_best_hash {
+					para_best_header = Some(current_header.clone());


Why not just initialize para_best_header before the loop already and make it immutable?

This will make it simpler to read it, indeed. Also doing the has_ancestor_relay_parent_info call for it before the loop. The loop can iterate back to start, and do the appropriate has_ancestor_relay_parent_info on the direct descendant of start and return accordingly in the end. I think the rest of this fn reads clear to me.

alindima

Let's also update the v3 throughput in zombienet tests

alindima · 2026-06-09T11:07:37Z

+	let scheduling_parent = *params.scheduling_parent();
+
+	let Some(ParentSearchStart { included: included_header, start: (start_hash, start_header) }) =
+		get_parent_search_start(relay_client, backend, para_id, scheduling_parent).await?


IMO this function should just return the starting point. the included header if it's needed here should be retrieved here before the call to get_parent_search_start and then passed to it also as a param

Agree. But it's probably not worth having this function now. Removed.

alindima · 2026-06-09T11:10:52Z

 	// Determine the starting point for the search.
 	let (start_hash, start_header) = match &maybe_pending {
-		Some((pending_header, pending_hash)) => {
+		Some((pending_hash, pending_header)) => {


this is a needless check IMO. it guards against a potential severe bug of the relay chain (backing an invalid chain of candidates)

alindima · 2026-06-09T11:26:26Z

+
+			// Search for the deepest valid parent starting from the pending/included block.
+			let best_parent_header =
+				find_deepest_valid_parent(backend, start_header, start_hash, &rp_ancestry);


the name does not make it obvious that this is v2-specific

Couldn't think of a good name. Any suggestion ?

alindima · 2026-06-09T11:36:56Z

+	let relay_parent = 'get_relay_parent: {
+		let digest = header.digest();
+
+		if let Some(relay_parent) = cumulus_primitives_core::extract_relay_parent(digest) {


the control flow of this entire let relay_parent is hard to follow due to the usage of these named break statements.

why do we even handle the possibility of cumulus_primitives_core::extract_relay_parent failing? is that a real possibility?

why do we even handle the possibility of cumulus_primitives_core::extract_relay_parent failing? is that a real possibility?

Yes. At some point I added logs and checked this. I didn't count the exact number of occurrences, but it happened very often. I would say once in 2-3 slots there was a header that didn't have the relay parent digest and only had the storage root information.

I could not find any place within the code that sets the relay parent digest item. Maybe it's something historical.

Although you mentioned that you also saw blocks that did have it?

@skunert do you know more about these?

Although you mentioned that you also saw blocks that did have it?

Hmmm actually I didn't look specifically for that. I only put logs on the storage root path and they were triggered quite often. I was under the impression that the relay parent digest item is also present sometimes, but now that you mention it, I'm not so sure. I'll check.

alindima · 2026-06-09T12:34:26Z

+					break None;
+				};
+				if current_hash == para_best_hash {
+					para_best_header = Some(current_header.clone());


agree about the simplification ideas.

moreover, we don't even need a loop if we only do a check on the start_header + 1 and the best hash

alindima · 2026-06-09T12:38:42Z

+			let included_header = match v3_enabled {
+				false => parent_search_result.included_header,
+				true => {
+					let Ok(Some((_, included_header))) = fetch_included_from_relay_chain(


it's not at all obvious why we use a different included header here than what is returned by the find_parent for V3

Indeed, this also quite stumped me. It seems technically correct, but is far from obvious. It also feels like a hack, (ignoring the returned one, but fetching it our own ... why is the function not returning the correct one already? ..) TL;DR: This deserves some docs/comments + also some cleanup/simplification: Let the function already do the correct thing in an obviously correct way.

The parent search result contains the included header at the scheduling parent. For the logic that follows this we need the included header at the relay parent since it will be used in can_build_upon(). Because some of the cheks done in can_build_upon() are also done by the runtime in the set_validation_data inherent. And the inherent uses the relay parent context.

Renamed this included to included_at_execution and the one in the parent search result to included_at_scheduling. Also added a comment. PTAL !

alindima · 2026-06-09T14:04:57Z

Let's also update the v3 throughput in zombienet tests

I don't see an improvement of the v3 throughput locally (tested with functional::scheduling_v3::scheduling_v2_and_v3_collator_with_v3_validators::case_1_non_zero_relay_parent_offset)

serban300 · 2026-06-09T18:36:44Z

Let's also update the v3 throughput in zombienet tests

I don't see an improvement of the v3 throughput locally (tested with functional::scheduling_v3::scheduling_v2_and_v3_collator_with_v3_validators::case_1_non_zero_relay_parent_offset)

Yes, exactly. On my local machine the throughput was constantly 10 previously and now it's between 9-11 . So no change on average.

LE: I forgot to mention. With the previous approach + the cumulus test runtime fix the throughput would increase to 14

alindima · 2026-06-10T06:38:20Z

With the previous approach

what do you mean by that?

on the current state of the PR, I don't get that increase in the throughput.

The purpose of this task was to fix the relay parent usage such that we manage to achieve the same throughput as with v2 (which is how I discovered the problems in the first place). As discussed, we should not be able to get to the same value as v2 without resubmissions (because we don't know the scheduling parent of candidates so that we can stop building on top of them if a session change occurs).

But the end goal of this PR would be to increase the throughput if possible and have an explanation for the remaining lag (ideally by validating the hypothesis via some temporary hack/workaround).

Otherwise, we don't know if we're fully solving the problem or not

serban300 · 2026-06-10T06:41:15Z

what do you mean by that?

on the current state of the PR, I don't get that increase in the throughput.

I mean without the changes to find_parent() from this PR. If you just cherry pick the cumulus test runtime fix and put it on master, you will get throughput 14

But the end goal of this PR would be to increase the throughput if possible and have an explanation for the remaining lag (ideally by validating the hypothesis via some temporary hack/workaround).

I will try to reapply your changes or something similar

alindima · 2026-06-10T07:18:34Z

what do you mean by that?
on the current state of the PR, I don't get that increase in the throughput.

I mean without the changes to find_parent() from this PR. If you just cherry pick the cumulus test runtime fix and put it on master, you will get throughput 14

I'm very confused by this

I will try to reapply your changes or something similar

I added a hack for determining the scheduling session and stopping building if it's from an old session: 78fc228

throughput goes to 15. still confused why it's not 20 as with v2 (I believe my old fixes achieved that)

iulianbarbu · 2026-06-10T07:22:44Z

what do you mean by that?
on the current state of the PR, I don't get that increase in the throughput.

I mean without the changes to find_parent() from this PR. If you just cherry pick the cumulus test runtime fix and put it on master, you will get throughput 14

I'm very confused by this

I will try to reapply your changes or something similar

I added a hack for determining the scheduling session and stopping building if it's from an old session: 78fc228

throughput goes to 15. still confused why it's not 20 as with v2 (I believe my old fixes achieved that)

Might be because you build with a relay parent offset of 2 and in the past not, or v2 was running with rpo 0?

serban300 · 2026-06-10T07:35:04Z

+	// - As we are building on `RELAY_PARENT_OFFSET` old relay parents, the included block from the
+	//   parachain is also `RELAY_PARENT_OFFSET` relay blocks older (one relay block may contains
+	//   multiple parachain blocks).
+	block_processing_velocity() * (3 + relay_parent_offset())


what do you mean by that?
on the current state of the PR, I don't get that increase in the throughput.

I mean without the changes to find_parent() from this PR. If you just cherry pick the cumulus test runtime fix and put it on master, you will get throughput 14

I'm very confused by this

I will try to reapply your changes or something similar

I added a hack for determining the scheduling session and stopping building if it's from an old session: 78fc228
throughput goes to 15. still confused why it's not 20 as with v2 (I believe my old fixes achieved that)

Might be because you build with a relay parent offset of 2 and in the past not, or v2 was running with rpo 0?

Let's move this discussion to a thread for better tracking.

Another thing is that the branch with your old fixes lacked some later changes. For example later we added the logic in wait_for_scheduling_parent that is checking if the scheduling parent is in the same session as the relay tip.

Also block bundling wasn't merged yet.

I added a hack for determining the scheduling session and stopping building if it's from an old session: 78fc228

If has_ancestor_relay_parent_info works correctly, than I think your code will only do one thing: when the start header is the PA header, it will go back and return the included one instead. I need to check. But I wonder why this is increasing the throughput

I added a hack for determining the scheduling session and stopping building if it's from an old session: 78fc228

If has_ancestor_relay_parent_info works correctly, than I think your code will only do one thing: when the start header is the PA header, it will go back and return the included one instead. I need to check. But I wonder why this is increasing the throughput

my code does this:

if the latest scheduling parent we want to pick is from a new session, it drops building on top of candidates whose scheduling session are different (going back to the included head indeed).
Without this, we keep building on top of them but they will never be backed (because they are not resubmitted with newer scheduling parents).

I just checked and the throughput is on par with v2 with relay parent offset 2 (so as good as it can get without resubmissions). This is because once we have a larger than zero relay parent offset, we also have a larger unincluded segment, so more candidates are dropped at session boundaries

I'm testing with a"max_relay_parent_session_age": 2

I mean without the changes to find_parent() from this PR. If you just cherry pick the cumulus test runtime fix and put it on master, you will get throughput 14

I think this makse sense, but only if the rp-offset is smaller than the scheduling_lookahead.
Try setting the rp offset to 8 and the "max_relay_parent_session_age": 2. This PR with my hack should achieve better throughput

serban300 self-assigned this Jun 8, 2026

serban300 added the T9-cumulus This PR/Issue is related to cumulus. label Jun 8, 2026

serban300 added 4 commits June 8, 2026 11:32

Define RelayChainInterface::ancestor_relay_parent_info

b73b185

cosmetics

7817750

- avoid querying the relay_client for detecting session changes - reorder some arguments - avoid passing the ancestry_lookback as an argument (read it from the relay_client)

cumulus test runtime fix

ce779ba

Add V3 logic

ef77928

serban300 force-pushed the cumulus-v3-find-parent branch from f0cfc16 to ef77928 Compare June 8, 2026 08:41

serban300 requested review from alindima and iulianbarbu June 8, 2026 09:25

eskimor reviewed Jun 8, 2026

View reviewed changes

skunert self-requested a review June 9, 2026 08:20

iulianbarbu reviewed Jun 9, 2026

View reviewed changes

alindima reviewed Jun 9, 2026

View reviewed changes

CR comments

5e05a97

serban300 commented Jun 10, 2026

View reviewed changes

Conversation

serban300 commented Jun 8, 2026

Uh oh!

eskimor left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alindima Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

iulianbarbu left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alindima left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alindima commented Jun 9, 2026

Uh oh!

serban300 commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alindima Jun 10, 2026 •

edited

Loading

iulianbarbu left a comment •

edited

Loading

serban300 commented Jun 9, 2026 •

edited

Loading

serban300 commented Jun 10, 2026 •

edited

Loading

alindima commented Jun 10, 2026 •

edited

Loading

iulianbarbu commented Jun 10, 2026 •

edited

Loading

serban300 Jun 10, 2026 •

edited

Loading

serban300 Jun 10, 2026 •

edited

Loading

alindima Jun 10, 2026 •

edited

Loading