MOD-14439 Abort in-flight executions when cluster topology changes by galcohen-redislabs · Pull Request #87 · RedisGears/LibMR

galcohen-redislabs · 2026-03-18T17:52:57Z

No description provided.

gabsow · 2026-03-23T14:15:49Z

looks like CI is failing even harder then befor

gabsow · 2026-03-23T14:23:29Z

Let's add a test: cluster refresh / CLUSTER SET while a multi-shard execution is in flight → initiator gets done with “cluster topology changed” and no (or bounded) hang until max idle.

gabsow · 2026-03-24T14:28:13Z


+// Thread pool task: fires the done callback with a cluster topology change error
+static void MR_ExecutionAbortedOnClusterChange(Execution* e, void* pd) {
+    e->errors = array_append(e->errors, MR_ErrorRecordCreate("cluster topology changed"));


why are we using new string ?
shoudent we use the exsiting SLOT_RANGES_ERROR = "Query requires unavailable slots" string ?

We will use it in ts. It is different than the slot-ranges error and from the timeout error, so deserves a new description.

gabsow · 2026-03-24T14:30:42Z

    ectx->err = MR_ErrorRecordCreate(error);
 }

+LIBMR_API void MR_AbortRunningExecutions(void) {


MR_AbortRunningExecutions runs on the event loop and only queues MR_ExecutionAbortedOnClusterChange on the thread pool. MR_ClusterFree then proceeds synchronously and frees nodes/cluster state while worker threads may still run the user onDone callback afterward.

That’s the same async pattern as idle timeout, but it means onDone can run when the cluster is already torn down or partially rebuilt. Worth a short comment near MR_AbortRunningExecutions or MR_ClusterFree so callers know not to assume cluster topology is stable inside onDone for this error path (unless we later add a stronger synchronization barrier).

gabsow · 2026-03-24T14:32:04Z

MR_AbortRunningExecutions only collects executions with ExecutionFlag_Initiator. Deserialized shard-side executions never get that flag, so they stay in executionsDict until max-idle timeout or a remote DROP_EXECUTION.

After a topology refresh we tear down nodes and connections; those follower executions may be stuck waiting for ACKs/invokes that will never arrive with the old graph.

Is it intentional to leave them out of this abort path? If yes, a one-line comment above the initiator check (or in the PR description) would make the scope clear. If not, we may need a follow-up to drop or abort non-initiator executions on refresh as well.

gabsow

i wrote comments

galcohen-redislabs · 2026-03-31T12:15:44Z

Let's add a test: cluster refresh / CLUSTER SET while a multi-shard execution is in flight → initiator gets done with “cluster topology changed” and no (or bounded) hang until max idle.

I wanted to, but looks like all network tests (and specifically those that use CLUSTERSET) are skipped on cluster. I'm not sure why. I'll add such a test in ts (where we could also add actual data and validate the response is either the correct data or the expected error).

galcohen-redislabs · 2026-03-31T12:37:24Z

MR_AbortRunningExecutions only collects executions with ExecutionFlag_Initiator. Deserialized shard-side executions never get that flag, so they stay in executionsDict until max-idle timeout or a remote DROP_EXECUTION.

After a topology refresh we tear down nodes and connections; those follower executions may be stuck waiting for ACKs/invokes that will never arrive with the old graph.

Is it intentional to leave them out of this abort path? If yes, a one-line comment above the initiator check (or in the PR description) would make the scope clear. If not, we may need a follow-up to drop or abort non-initiator executions on refresh as well.

added

…topology-changes

MOD-14439 Abort in-flight executions when cluster topology changes

5085920

galcohen-redislabs requested a review from gabsow March 18, 2026 17:53

gabsow requested a review from TalBarYakar March 19, 2026 18:51

gabsow reviewed Mar 24, 2026

View reviewed changes

galcohen-redislabs added 2 commits March 31, 2026 12:31

Added a comment noting assumption on execution onDone

32ea3f0

Added comment re/ aborting only executions on initiator nodes

c2d9085

Merge branch 'master' into gal-14439-abort-in-flight-executions-upon-…

c3e96ab

…topology-changes

galcohen-redislabs requested a review from gabsow March 31, 2026 12:44

gabsow approved these changes Mar 31, 2026

View reviewed changes

galcohen-redislabs merged commit bb13666 into master Mar 31, 2026
4 of 7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MOD-14439 Abort in-flight executions when cluster topology changes#87

MOD-14439 Abort in-flight executions when cluster topology changes#87
galcohen-redislabs merged 4 commits into
masterfrom
gal-14439-abort-in-flight-executions-upon-topology-changes

galcohen-redislabs commented Mar 18, 2026

Uh oh!

gabsow commented Mar 23, 2026

Uh oh!

gabsow commented Mar 23, 2026 •

edited

Loading

Uh oh!

gabsow Mar 24, 2026

Uh oh!

galcohen-redislabs Mar 31, 2026

Uh oh!

gabsow Mar 24, 2026

Uh oh!

galcohen-redislabs Mar 31, 2026

Uh oh!

gabsow commented Mar 24, 2026

Uh oh!

gabsow left a comment

Uh oh!

galcohen-redislabs commented Mar 31, 2026

Uh oh!

galcohen-redislabs commented Mar 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

galcohen-redislabs commented Mar 18, 2026

Uh oh!

gabsow commented Mar 23, 2026

Uh oh!

gabsow commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gabsow Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

galcohen-redislabs Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

gabsow Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

galcohen-redislabs Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

gabsow commented Mar 24, 2026

Uh oh!

gabsow left a comment

Choose a reason for hiding this comment

Uh oh!

galcohen-redislabs commented Mar 31, 2026

Uh oh!

galcohen-redislabs commented Mar 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gabsow commented Mar 23, 2026 •

edited

Loading