Skip to content

MOD-14439 Abort in-flight executions when cluster topology changes#87

Merged
galcohen-redislabs merged 4 commits into
masterfrom
gal-14439-abort-in-flight-executions-upon-topology-changes
Mar 31, 2026
Merged

MOD-14439 Abort in-flight executions when cluster topology changes#87
galcohen-redislabs merged 4 commits into
masterfrom
gal-14439-abort-in-flight-executions-upon-topology-changes

Conversation

@galcohen-redislabs

Copy link
Copy Markdown
Collaborator

No description provided.

@gabsow gabsow requested a review from TalBarYakar March 19, 2026 18:51
@gabsow

gabsow commented Mar 23, 2026

Copy link
Copy Markdown
Contributor

looks like CI is failing even harder then befor

@gabsow

gabsow commented Mar 23, 2026

Copy link
Copy Markdown
Contributor

Let's add a test: cluster refresh / CLUSTER SET while a multi-shard execution is in flight → initiator gets done with “cluster topology changed” and no (or bounded) hang until max idle.

Comment thread src/mr.c

// Thread pool task: fires the done callback with a cluster topology change error
static void MR_ExecutionAbortedOnClusterChange(Execution* e, void* pd) {
e->errors = array_append(e->errors, MR_ErrorRecordCreate("cluster topology changed"));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we using new string ?
shoudent we use the exsiting SLOT_RANGES_ERROR = "Query requires unavailable slots" string ?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will use it in ts. It is different than the slot-ranges error and from the timeout error, so deserves a new description.

Comment thread src/mr.c
ectx->err = MR_ErrorRecordCreate(error);
}

LIBMR_API void MR_AbortRunningExecutions(void) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MR_AbortRunningExecutions runs on the event loop and only queues MR_ExecutionAbortedOnClusterChange on the thread pool. MR_ClusterFree then proceeds synchronously and frees nodes/cluster state while worker threads may still run the user onDone callback afterward.

That’s the same async pattern as idle timeout, but it means onDone can run when the cluster is already torn down or partially rebuilt. Worth a short comment near MR_AbortRunningExecutions or MR_ClusterFree so callers know not to assume cluster topology is stable inside onDone for this error path (unless we later add a stronger synchronization barrier).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

@gabsow

gabsow commented Mar 24, 2026

Copy link
Copy Markdown
Contributor

MR_AbortRunningExecutions only collects executions with ExecutionFlag_Initiator. Deserialized shard-side executions never get that flag, so they stay in executionsDict until max-idle timeout or a remote DROP_EXECUTION.

After a topology refresh we tear down nodes and connections; those follower executions may be stuck waiting for ACKs/invokes that will never arrive with the old graph.

Is it intentional to leave them out of this abort path? If yes, a one-line comment above the initiator check (or in the PR description) would make the scope clear. If not, we may need a follow-up to drop or abort non-initiator executions on refresh as well.

@gabsow gabsow left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wrote comments

@galcohen-redislabs

Copy link
Copy Markdown
Collaborator Author

Let's add a test: cluster refresh / CLUSTER SET while a multi-shard execution is in flight → initiator gets done with “cluster topology changed” and no (or bounded) hang until max idle.

I wanted to, but looks like all network tests (and specifically those that use CLUSTERSET) are skipped on cluster. I'm not sure why. I'll add such a test in ts (where we could also add actual data and validate the response is either the correct data or the expected error).

@galcohen-redislabs

Copy link
Copy Markdown
Collaborator Author

MR_AbortRunningExecutions only collects executions with ExecutionFlag_Initiator. Deserialized shard-side executions never get that flag, so they stay in executionsDict until max-idle timeout or a remote DROP_EXECUTION.

After a topology refresh we tear down nodes and connections; those follower executions may be stuck waiting for ACKs/invokes that will never arrive with the old graph.

Is it intentional to leave them out of this abort path? If yes, a one-line comment above the initiator check (or in the PR description) would make the scope clear. If not, we may need a follow-up to drop or abort non-initiator executions on refresh as well.

added

@galcohen-redislabs galcohen-redislabs merged commit bb13666 into master Mar 31, 2026
4 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants