Drain LibMR threads to a safe point before fork() (MOD-15307)#97
Drain LibMR threads to a safe point before fork() (MOD-15307)#97gabsow wants to merge 6 commits into
Conversation
Add MR_DrainForFork()/MR_ResumeAfterFork(). On the main thread (from the module's FORK_CHILD_PRE handler) park the event-loop thread at a between-tasks safe point via a posted task and bounded-wait the worker pool to idle, so no LibMR thread holds a libc lock at fork() (ghost-lock). Bounded + fail-open; cooperative (not the SIGUSR2 mr_thpool_pause, which can freeze a worker mid-malloc). Resume releases the parked event-loop thread. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Logs the time spent quiescing the threads (excluding the fork that follows), plus whether the event-loop thread parked and the busy-worker count, so the pre-fork drain cost can be measured. Debug level so it is silent in production. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
When a multi-shard execution hits max-idle (EXECUTION_DEFAULT_MAX_IDLE_MS), the coordinator only logged a fixed "execution max idle reached" string with no clue which peer stalled. Now, at timeout, log the non-responding peer shard id(s) and endpoint(s), replies received vs expected, the elapsed wait, and time since last progress — at warning level so it is captured at the default loglevel (it coincides with the user-visible failure and is rate-bounded by nMaxIdleReached). Track responders by recording each ACK / NOTIFY_DONE sender node-id into a heap-strings set on the Execution (freed in MR_FreeExecution, the single final owner); pending = cluster peers minus that set, formatted by a new MR_ClusterFormatPendingPeers helper. Dispatch time is stamped and logged at debug. All new state is touched only on the event-loop thread, so no extra locking. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
MR_DrainForFork only waited for mr_thpool_num_threads_working() to reach 0, ignoring jobs that are queued but not yet picked up by a worker. Such a job could be dequeued and start running (and allocating) immediately after the check, right as fork() runs -- defeating the drain. Add mr_thpool_num_jobs_in_queue() and wait for BOTH the working count and the queue to reach 0 (the same invariant thpool_wait uses). Addresses the bugbot "fork drain ignores queued jobs" finding. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit f6fe687. Configure here.
| } | ||
| struct timespec nap = { .tv_sec = 0, .tv_nsec = 1000000L }; /* 1ms */ | ||
| nanosleep(&nap, NULL); | ||
| } |
There was a problem hiding this comment.
Shared fork drain timeout
Medium Severity
MR_DrainForFork uses one CLOCK_REALTIME deadline for both parking the event-loop thread and waiting for the worker pool. Time spent waiting for the park task (including a long event-loop backlog before it runs) is subtracted from the same budget, so phase two often gets little or no time even after a successful park, and after a park timeout the pool loop exits immediately because the deadline is already past.
Reviewed by Cursor Bugbot for commit f6fe687. Configure here.
There was a problem hiding this comment.
This one is intentional. MR_DrainForFork runs synchronously on redis's main thread before every fork, so the single MR_FORK_DRAIN_TIMEOUT_MS (2s) deadline deliberately bounds the total main-thread stall — we do not want a fork (RDB/AOF/ASM snapshot) to block the server longer than that.
In the normal case the park task runs in microseconds, so phase two effectively gets the full budget (measured median drain ~62us). If the park itself cannot complete within the budget, the event-loop thread is stuck off a safe point and the correct bounded fallback is to fork anyway — the drain is best-effort, not a hard barrier. Giving phase two its own fresh 2s budget would double the worst-case main-thread stall, which we deliberately avoid.
mr_thpool_num_jobs_in_queue() read jobqueue.len without holding jobqueue.rwmutex, while jobqueue_push/pull mutate it under that lock -- a data race that could make MR_DrainForFork see a stale (zero) queue and fork early. Take rwmutex for the read. Addresses the bugbot "unlocked job queue length read" finding. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
@galcohen-redislabs wanted your opinion on the last commit ( Two things I am not sure about:
Happy to drop the lock and instead annotate the read as a benign race (matching |


Summary
Adds
MR_DrainForFork()/MR_ResumeAfterFork(). Called from the embedding module's pre-fork handler (main thread),MR_DrainForFork():so no LibMR thread holds a libc lock at
fork()(which would ghost-lock the child).MR_ResumeAfterFork()releases the parked event-loop thread (called after the fork, on success or cancel).Cooperative, bounded, fail-open. Deliberately not the existing
mr_thpool_pause(SIGUSR2), which can freeze a worker mid-mallocholding the arena lock — exactly the ghost-lock this prevents. A worker still blocked acquiring the GIL is already malloc-safe, so a drain timeout there is benign.Why / depends on
Fixes the RedisTimeSeries ASM-migration nightly hangs (MOD-14615 valgrind, MOD-14239 sanitizer). The embedding module wires this to redis core's new
FORK_CHILD_PREsubevent (redis/redis#15327).Pre-merge note
The two
RedisModule_Log(..., "notice", ...)lines inMR_DrainForFork/MR_ResumeAfterForkshould be downgraded todebug(kept atnoticeto confirm the drain fires during CI validation).🤖 Generated with Claude Code
Note
High Risk
Changes pre-fork quiescence across the event loop and worker pool; a missed edge case could still fork with libc locks held and hang forked children (RDB/snapshots), though the design is bounded and fail-open.
Overview
Adds
MR_DrainForFork()/MR_ResumeAfterFork()for the embedding module to call around RedisFORK_CHILD_PRE/ post-fork: the event-loop thread is parked between tasks, then the execution pool is waited on (busy workers and queued jobs via newmr_thpool_num_jobs_in_queue) up to a 2s bounded deadline, then fork proceeds anyway if needed.Separately improves MOD-14615 visibility when a multi-shard execution hits max-idle: per-execution sets of peers that ACK’d or sent NOTIFY_DONE, monotonic dispatch/progress timestamps,
MR_ClusterFormatPendingPeersto list non-responding shards, and a warning log on timeout (plus debug on dispatch/drain).Reviewed by Cursor Bugbot for commit 995f842. Bugbot is set up for automated code reviews on this repo. Configure here.