Cumulus v3: find_parent() adjustments#12296
Conversation
f0cfc16 to
ef77928
Compare
eskimor
left a comment
There was a problem hiding this comment.
Quick first pass. I am not sure yet personally on correctness, I find the code a bit confusing - left a few suggestions for improvements.
|
|
||
| let included_hash = included_header.hash(); | ||
| // If the included block is not locally known, we can't do anything. | ||
| let Some(_) = get_para_header(backend, included_hash) else { |
There was a problem hiding this comment.
But is this a concern of this function? It is a bit weird to do an unrelated validation, where we are not even interested in the result.
There was a problem hiding this comment.
If we need this validation, and we need it both when we fetch the included header at relay parent and at scheduling parent, I think it's good to do it here. Otherwise we would duplicate it.
| @@ -93,35 +100,27 @@ pub async fn find_parent_for_building<B: BlockT>( | |||
| // `OccupiedCoreAssumption::Included` so the candidate pending availability gets enacted | |||
| // before being returned to us. | |||
| let pending_header = relay_client | |||
There was a problem hiding this comment.
Super nit: I find it confusing that we use a helper for the included block, but none for the pending one, despite both fetches being almost 100% the same code. This either warrants a helper or not.
There was a problem hiding this comment.
Reading a function, ideally the reader stays at one abstraction layer.
There was a problem hiding this comment.
Tried to reorganize the logic. PTAL !
| break None; | ||
| }; | ||
| if current_hash == para_best_hash { | ||
| para_best_header = Some(current_header.clone()); |
There was a problem hiding this comment.
Ok, this is super confusing. Why not just initialize para_best_header before the loop already and make it immutable?
There was a problem hiding this comment.
In general, this function is quite confusing. Can we get a clearer version that is easier to follow? Try to encode your intent as directly as possible.
There was a problem hiding this comment.
Why not just initialize para_best_header before the loop already and make it immutable?
This will make it simpler to read it, indeed. Also doing the has_ancestor_relay_parent_info call for it before the loop. The loop can iterate back to start, and do the appropriate has_ancestor_relay_parent_info on the direct descendant of start and return accordingly in the end. I think the rest of this fn reads clear to me.
There was a problem hiding this comment.
agree about the simplification ideas.
moreover, we don't even need a loop if we only do a check on the start_header + 1 and the best hash
There was a problem hiding this comment.
I think the loop is useful to check that best is not on some weird fork that doesn't end with the start header.
There was a problem hiding this comment.
Ok, this is super confusing. Why not just initialize para_best_header before the loop already and make it immutable?
In general, this function is quite confusing. Can we get a clearer version that is easier to follow? Try to encode your intent as directly as possible.
My intention was to avoid duplication as much as possible. Every bit of logic that we get outside of the loop will be duplicated logic, because it's needed inside the loop as well.
Moved the para_best_header initialization outside the loop and also added some comments and did some small adjustments. Please let me know if it is easier to follow now.
Also doing the has_ancestor_relay_parent_info call for it before the loop. The loop can iterate back to start, and do the appropriate has_ancestor_relay_parent_info on the direct descendant of start and return accordingly in the end. I think the rest of this fn reads clear to me.
Apart from the code duplication, another problem with moving has_ancestor_relay_parent_info outside the loop is calling it twice unnecessarily when the start + 1 == para_best_header number . That's why I would suggest to keep it as it is. But I can move it if there are strong preferences.
I think the loop is useful to check that best is not on some weird fork that doesn't end with the start header.
Yes, exactly. The loop is there for ensuring that the start header and the para best header are on the same fork. Don't we need to check this ?
There was a problem hiding this comment.
Yes, exactly. The loop is there for ensuring that the start header and the para best header are on the same fork. Don't we need to check this ?
oh right. but what if they are not. the DFS we did for v2 would have handled that, no?
I mean if the best header is temporarily on some fork that will be discarded
There was a problem hiding this comment.
If they are not, we return the start_header. Is that correct ? I understand that because of the resubmission logic it doesn't make sense to return the deepest valid parent
There was a problem hiding this comment.
I understand that because of the resubmission logic it doesn't make sense to return the deepest valid parent
hmm why not? @iulianbarbu I think you proposed getting rid of the deepest valid parent search.
There was a problem hiding this comment.
LGTM, delta the comments from @eskimor related to get_para_header concern, or for the included/PA blocks fetching abstraction.
| break None; | ||
| }; | ||
| if current_hash == para_best_hash { | ||
| para_best_header = Some(current_header.clone()); |
There was a problem hiding this comment.
Why not just initialize para_best_header before the loop already and make it immutable?
This will make it simpler to read it, indeed. Also doing the has_ancestor_relay_parent_info call for it before the loop. The loop can iterate back to start, and do the appropriate has_ancestor_relay_parent_info on the direct descendant of start and return accordingly in the end. I think the rest of this fn reads clear to me.
alindima
left a comment
There was a problem hiding this comment.
Let's also update the v3 throughput in zombienet tests
| let scheduling_parent = *params.scheduling_parent(); | ||
|
|
||
| let Some(ParentSearchStart { included: included_header, start: (start_hash, start_header) }) = | ||
| get_parent_search_start(relay_client, backend, para_id, scheduling_parent).await? |
There was a problem hiding this comment.
IMO this function should just return the starting point. the included header if it's needed here should be retrieved here before the call to get_parent_search_start and then passed to it also as a param
There was a problem hiding this comment.
Agree. But it's probably not worth having this function now. Removed.
| // Determine the starting point for the search. | ||
| let (start_hash, start_header) = match &maybe_pending { | ||
| Some((pending_header, pending_hash)) => { | ||
| Some((pending_hash, pending_header)) => { |
There was a problem hiding this comment.
this is a needless check IMO. it guards against a potential severe bug of the relay chain (backing an invalid chain of candidates)
|
|
||
| // Search for the deepest valid parent starting from the pending/included block. | ||
| let best_parent_header = | ||
| find_deepest_valid_parent(backend, start_header, start_hash, &rp_ancestry); |
There was a problem hiding this comment.
the name does not make it obvious that this is v2-specific
There was a problem hiding this comment.
Couldn't think of a good name. Any suggestion ?
| let relay_parent = 'get_relay_parent: { | ||
| let digest = header.digest(); | ||
|
|
||
| if let Some(relay_parent) = cumulus_primitives_core::extract_relay_parent(digest) { |
There was a problem hiding this comment.
the control flow of this entire let relay_parent is hard to follow due to the usage of these named break statements.
why do we even handle the possibility of cumulus_primitives_core::extract_relay_parent failing? is that a real possibility?
There was a problem hiding this comment.
why do we even handle the possibility of cumulus_primitives_core::extract_relay_parent failing? is that a real possibility?
Yes. At some point I added logs and checked this. I didn't count the exact number of occurrences, but it happened very often. I would say once in 2-3 slots there was a header that didn't have the relay parent digest and only had the storage root information.
There was a problem hiding this comment.
I could not find any place within the code that sets the relay parent digest item. Maybe it's something historical.
Although you mentioned that you also saw blocks that did have it?
@skunert do you know more about these?
There was a problem hiding this comment.
Although you mentioned that you also saw blocks that did have it?
Hmmm actually I didn't look specifically for that. I only put logs on the storage root path and they were triggered quite often. I was under the impression that the relay parent digest item is also present sometimes, but now that you mention it, I'm not so sure. I'll check.
| break None; | ||
| }; | ||
| if current_hash == para_best_hash { | ||
| para_best_header = Some(current_header.clone()); |
There was a problem hiding this comment.
agree about the simplification ideas.
moreover, we don't even need a loop if we only do a check on the start_header + 1 and the best hash
| let included_header = match v3_enabled { | ||
| false => parent_search_result.included_header, | ||
| true => { | ||
| let Ok(Some((_, included_header))) = fetch_included_from_relay_chain( |
There was a problem hiding this comment.
it's not at all obvious why we use a different included header here than what is returned by the find_parent for V3
There was a problem hiding this comment.
Indeed, this also quite stumped me. It seems technically correct, but is far from obvious. It also feels like a hack, (ignoring the returned one, but fetching it our own ... why is the function not returning the correct one already? ..) TL;DR: This deserves some docs/comments + also some cleanup/simplification: Let the function already do the correct thing in an obviously correct way.
There was a problem hiding this comment.
The parent search result contains the included header at the scheduling parent. For the logic that follows this we need the included header at the relay parent since it will be used in can_build_upon(). Because some of the cheks done in can_build_upon() are also done by the runtime in the set_validation_data inherent. And the inherent uses the relay parent context.
Renamed this included to included_at_execution and the one in the parent search result to included_at_scheduling. Also added a comment. PTAL !
I don't see an improvement of the v3 throughput locally (tested with |
Yes, exactly. On my local machine the throughput was constantly 10 previously and now it's between 9-11 . So no change on average. LE: I forgot to mention. With the previous approach + the cumulus test runtime fix the throughput would increase to 14 |
what do you mean by that? on the current state of the PR, I don't get that increase in the throughput. The purpose of this task was to fix the relay parent usage such that we manage to achieve the same throughput as with v2 (which is how I discovered the problems in the first place). As discussed, we should not be able to get to the same value as v2 without resubmissions (because we don't know the scheduling parent of candidates so that we can stop building on top of them if a session change occurs). But the end goal of this PR would be to increase the throughput if possible and have an explanation for the remaining lag (ideally by validating the hypothesis via some temporary hack/workaround). Otherwise, we don't know if we're fully solving the problem or not |
I mean without the changes to
I will try to reapply your changes or something similar |
I'm very confused by this
I added a hack for determining the scheduling session and stopping building if it's from an old session: 78fc228 throughput goes to 15. still confused why it's not 20 as with v2 (I believe my old fixes achieved that) |
Might be because you build with a relay parent offset of 2 and in the past not, or v2 was running with rpo 0? |
| // - As we are building on `RELAY_PARENT_OFFSET` old relay parents, the included block from the | ||
| // parachain is also `RELAY_PARENT_OFFSET` relay blocks older (one relay block may contains | ||
| // multiple parachain blocks). | ||
| block_processing_velocity() * (3 + relay_parent_offset()) |
There was a problem hiding this comment.
what do you mean by that?
on the current state of the PR, I don't get that increase in the throughput.I mean without the changes to
find_parent()from this PR. If you just cherry pick the cumulus test runtime fix and put it on master, you will get throughput 14I'm very confused by this
I will try to reapply your changes or something similar
I added a hack for determining the scheduling session and stopping building if it's from an old session: 78fc228
throughput goes to 15. still confused why it's not 20 as with v2 (I believe my old fixes achieved that)Might be because you build with a relay parent offset of 2 and in the past not, or v2 was running with rpo 0?
Let's move this discussion to a thread for better tracking.
Another thing is that the branch with your old fixes lacked some later changes. For example later we added the logic in wait_for_scheduling_parent that is checking if the scheduling parent is in the same session as the relay tip.
Also block bundling wasn't merged yet.
There was a problem hiding this comment.
I added a hack for determining the scheduling session and stopping building if it's from an old session: 78fc228
If has_ancestor_relay_parent_info works correctly, than I think your code will only do one thing: when the start header is the PA header, it will go back and return the included one instead. I need to check. But I wonder why this is increasing the throughput
There was a problem hiding this comment.
I added a hack for determining the scheduling session and stopping building if it's from an old session: 78fc228
If
has_ancestor_relay_parent_infoworks correctly, than I think your code will only do one thing: when the start header is the PA header, it will go back and return the included one instead. I need to check. But I wonder why this is increasing the throughput
my code does this:
if the latest scheduling parent we want to pick is from a new session, it drops building on top of candidates whose scheduling session are different (going back to the included head indeed).
Without this, we keep building on top of them but they will never be backed (because they are not resubmitted with newer scheduling parents).
I just checked and the throughput is on par with v2 with relay parent offset 2 (so as good as it can get without resubmissions). This is because once we have a larger than zero relay parent offset, we also have a larger unincluded segment, so more candidates are dropped at session boundaries
I'm testing with a"max_relay_parent_session_age": 2
There was a problem hiding this comment.
I mean without the changes to find_parent() from this PR. If you just cherry pick the cumulus test runtime fix and put it on master, you will get throughput 14
I think this makse sense, but only if the rp-offset is smaller than the scheduling_lookahead.
Try setting the rp offset to 8 and the "max_relay_parent_session_age": 2. This PR with my hack should achieve better throughput
Related to #11624
Implements the changes discussed in #11624 .
The plan is to add unit & integration tests in follow-up PRs because: