Drain LibMR threads to a safe point before fork() (MOD-15307) by gabsow · Pull Request #97 · RedisGears/LibMR

gabsow · 2026-06-10T18:06:46Z

Summary

Adds MR_DrainForFork() / MR_ResumeAfterFork(). Called from the embedding module's pre-fork handler (main thread), MR_DrainForFork():

parks the event-loop thread at a between-tasks safe point via a posted task, and
bounded-waits the worker pool to idle,

so no LibMR thread holds a libc lock at fork() (which would ghost-lock the child). MR_ResumeAfterFork() releases the parked event-loop thread (called after the fork, on success or cancel).

Cooperative, bounded, fail-open. Deliberately not the existing mr_thpool_pause (SIGUSR2), which can freeze a worker mid-malloc holding the arena lock — exactly the ghost-lock this prevents. A worker still blocked acquiring the GIL is already malloc-safe, so a drain timeout there is benign.

Why / depends on

Fixes the RedisTimeSeries ASM-migration nightly hangs (MOD-14615 valgrind, MOD-14239 sanitizer). The embedding module wires this to redis core's new FORK_CHILD_PRE subevent (redis/redis#15327).

Pre-merge note

The two RedisModule_Log(..., "notice", ...) lines in MR_DrainForFork/MR_ResumeAfterFork should be downgraded to debug (kept at notice to confirm the drain fires during CI validation).

🤖 Generated with Claude Code

Note

High Risk
Changes pre-fork quiescence across the event loop and worker pool; a missed edge case could still fork with libc locks held and hang forked children (RDB/snapshots), though the design is bounded and fail-open.

Overview
Adds MR_DrainForFork() / MR_ResumeAfterFork() for the embedding module to call around Redis FORK_CHILD_PRE / post-fork: the event-loop thread is parked between tasks, then the execution pool is waited on (busy workers and queued jobs via new mr_thpool_num_jobs_in_queue) up to a 2s bounded deadline, then fork proceeds anyway if needed.

Separately improves MOD-14615 visibility when a multi-shard execution hits max-idle: per-execution sets of peers that ACK’d or sent NOTIFY_DONE, monotonic dispatch/progress timestamps, MR_ClusterFormatPendingPeers to list non-responding shards, and a warning log on timeout (plus debug on dispatch/drain).

^{Reviewed by Cursor Bugbot for commit 995f842. Bugbot is set up for automated code reviews on this repo. Configure here.}

Add MR_DrainForFork()/MR_ResumeAfterFork(). On the main thread (from the module's FORK_CHILD_PRE handler) park the event-loop thread at a between-tasks safe point via a posted task and bounded-wait the worker pool to idle, so no LibMR thread holds a libc lock at fork() (ghost-lock). Bounded + fail-open; cooperative (not the SIGUSR2 mr_thpool_pause, which can freeze a worker mid-malloc). Resume releases the parked event-loop thread. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Logs the time spent quiescing the threads (excluding the fork that follows), plus whether the event-loop thread parked and the busy-worker count, so the pre-fork drain cost can be measured. Debug level so it is silent in production. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

When a multi-shard execution hits max-idle (EXECUTION_DEFAULT_MAX_IDLE_MS), the coordinator only logged a fixed "execution max idle reached" string with no clue which peer stalled. Now, at timeout, log the non-responding peer shard id(s) and endpoint(s), replies received vs expected, the elapsed wait, and time since last progress — at warning level so it is captured at the default loglevel (it coincides with the user-visible failure and is rate-bounded by nMaxIdleReached). Track responders by recording each ACK / NOTIFY_DONE sender node-id into a heap-strings set on the Execution (freed in MR_FreeExecution, the single final owner); pending = cluster peers minus that set, formatted by a new MR_ClusterFormatPendingPeers helper. Dispatch time is stamped and logged at debug. All new state is touched only on the event-loop thread, so no extra locking. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

MR_DrainForFork only waited for mr_thpool_num_threads_working() to reach 0, ignoring jobs that are queued but not yet picked up by a worker. Such a job could be dequeued and start running (and allocating) immediately after the check, right as fork() runs -- defeating the drain. Add mr_thpool_num_jobs_in_queue() and wait for BOTH the working count and the queue to reach 0 (the same invariant thpool_wait uses). Addresses the bugbot "fork drain ignores queued jobs" finding. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit f6fe687. Configure here.}

cursor · 2026-06-18T03:14:46Z

+        }
+        struct timespec nap = { .tv_sec = 0, .tv_nsec = 1000000L }; /* 1ms */
+        nanosleep(&nap, NULL);
+    }


Shared fork drain timeout

Medium Severity

MR_DrainForFork uses one CLOCK_REALTIME deadline for both parking the event-loop thread and waiting for the worker pool. Time spent waiting for the park task (including a long event-loop backlog before it runs) is subtracted from the same budget, so phase two often gets little or no time even after a successful park, and after a park timeout the pool loop exits immediately because the deadline is already past.

^{Reviewed by Cursor Bugbot for commit f6fe687. Configure here.}

This one is intentional. MR_DrainForFork runs synchronously on redis's main thread before every fork, so the single MR_FORK_DRAIN_TIMEOUT_MS (2s) deadline deliberately bounds the total main-thread stall — we do not want a fork (RDB/AOF/ASM snapshot) to block the server longer than that.

In the normal case the park task runs in microseconds, so phase two effectively gets the full budget (measured median drain ~62us). If the park itself cannot complete within the budget, the event-loop thread is stuck off a safe point and the correct bounded fallback is to fork anyway — the drain is best-effort, not a hard barrier. Giving phase two its own fresh 2s budget would double the worst-case main-thread stall, which we deliberately avoid.

mr_thpool_num_jobs_in_queue() read jobqueue.len without holding jobqueue.rwmutex, while jobqueue_push/pull mutate it under that lock -- a data race that could make MR_DrainForFork see a stale (zero) queue and fork early. Take rwmutex for the read. Addresses the bugbot "unlocked job queue length read" finding. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

gabsow · 2026-06-18T05:17:14Z

@galcohen-redislabs wanted your opinion on the last commit (995f842) before we lean on it. It takes jobqueue.rwmutex when reading jobqueue.len in mr_thpool_num_jobs_in_queue, to address the Bugbot "unlocked job-queue length read" finding.

Two things I am not sure about:

Is it really necessary? The neighbouring mr_thpool_num_threads_working() reads num_threads_working unlocked as well, and MR_DrainForFork polls both in a 1ms loop that is best-effort (it forks anyway on timeout) — so a stale len self-corrects on the next tick. By that logic the unlocked read is arguably benign (same as the existing one), and we could just reply to the bot instead of adding a lock.
Isn't it dangerous? It adds a lock acquisition on the main thread, in the pre-fork drain path. As far as I can tell it's safe — the lock is taken and released inside the accessor (never held across fork()), and workers only hold rwmutex briefly in jobqueue_push/jobqueue_pull without needing the GIL, so there's no GIL/lock inversion or fork-time deadlock. But I would rather get your read before relying on it.

Happy to drop the lock and instead annotate the read as a benign race (matching num_threads_working) if you think that is cleaner/safer. What do you think?

This was referenced Jun 10, 2026

MOD-15307: drain LibMR threads before fork() (fixes MOD-14615 & MOD-14239) RedisTimeSeries/RedisTimeSeries#2055

Open

Add FORK_CHILD_PRE/CANCELLED subevents for pre-fork module coordination redis/redis#15327

Merged

gabsow and others added 2 commits June 11, 2026 13:20

gabsow marked this pull request as ready for review June 17, 2026 18:19

cursor Bot reviewed Jun 17, 2026

View reviewed changes

Comment thread src/mr.c

gabsow and others added 2 commits June 17, 2026 21:22

Merge branch 'master' into mod-15307-prefork-drain

03aa31e

cursor Bot reviewed Jun 18, 2026

View reviewed changes

gabsow requested a review from galcohen-redislabs June 18, 2026 03:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Drain LibMR threads to a safe point before fork() (MOD-15307)#97

Drain LibMR threads to a safe point before fork() (MOD-15307)#97
gabsow wants to merge 6 commits into
masterfrom
mod-15307-prefork-drain

gabsow commented Jun 10, 2026 •

edited by cursor Bot

Loading

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Jun 18, 2026

Uh oh!

gabsow Jun 18, 2026

Uh oh!

Uh oh!

gabsow commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gabsow commented Jun 10, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why / depends on

Pre-merge note

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jun 18, 2026

Choose a reason for hiding this comment

Shared fork drain timeout

Uh oh!

gabsow Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gabsow commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

gabsow commented Jun 10, 2026 •

edited by cursor Bot

Loading