MOD-14439 Abort in-flight executions when cluster topology changes#87
Conversation
|
looks like CI is failing even harder then befor |
|
Let's add a test: cluster refresh / CLUSTER SET while a multi-shard execution is in flight → initiator gets done with “cluster topology changed” and no (or bounded) hang until max idle. |
|
|
||
| // Thread pool task: fires the done callback with a cluster topology change error | ||
| static void MR_ExecutionAbortedOnClusterChange(Execution* e, void* pd) { | ||
| e->errors = array_append(e->errors, MR_ErrorRecordCreate("cluster topology changed")); |
There was a problem hiding this comment.
why are we using new string ?
shoudent we use the exsiting SLOT_RANGES_ERROR = "Query requires unavailable slots" string ?
There was a problem hiding this comment.
We will use it in ts. It is different than the slot-ranges error and from the timeout error, so deserves a new description.
| ectx->err = MR_ErrorRecordCreate(error); | ||
| } | ||
|
|
||
| LIBMR_API void MR_AbortRunningExecutions(void) { |
There was a problem hiding this comment.
MR_AbortRunningExecutions runs on the event loop and only queues MR_ExecutionAbortedOnClusterChange on the thread pool. MR_ClusterFree then proceeds synchronously and frees nodes/cluster state while worker threads may still run the user onDone callback afterward.
That’s the same async pattern as idle timeout, but it means onDone can run when the cluster is already torn down or partially rebuilt. Worth a short comment near MR_AbortRunningExecutions or MR_ClusterFree so callers know not to assume cluster topology is stable inside onDone for this error path (unless we later add a stronger synchronization barrier).
|
MR_AbortRunningExecutions only collects executions with ExecutionFlag_Initiator. Deserialized shard-side executions never get that flag, so they stay in executionsDict until max-idle timeout or a remote DROP_EXECUTION. After a topology refresh we tear down nodes and connections; those follower executions may be stuck waiting for ACKs/invokes that will never arrive with the old graph. Is it intentional to leave them out of this abort path? If yes, a one-line comment above the initiator check (or in the PR description) would make the scope clear. If not, we may need a follow-up to drop or abort non-initiator executions on refresh as well. |
I wanted to, but looks like all network tests (and specifically those that use CLUSTERSET) are skipped on cluster. I'm not sure why. I'll add such a test in ts (where we could also add actual data and validate the response is either the correct data or the expected error). |
added |
No description provided.