Skip to content

Drain LibMR threads to a safe point before fork() (MOD-15307)#97

Open
gabsow wants to merge 6 commits into
masterfrom
mod-15307-prefork-drain
Open

Drain LibMR threads to a safe point before fork() (MOD-15307)#97
gabsow wants to merge 6 commits into
masterfrom
mod-15307-prefork-drain

Conversation

@gabsow

@gabsow gabsow commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds MR_DrainForFork() / MR_ResumeAfterFork(). Called from the embedding module's pre-fork handler (main thread), MR_DrainForFork():

  • parks the event-loop thread at a between-tasks safe point via a posted task, and
  • bounded-waits the worker pool to idle,

so no LibMR thread holds a libc lock at fork() (which would ghost-lock the child). MR_ResumeAfterFork() releases the parked event-loop thread (called after the fork, on success or cancel).

Cooperative, bounded, fail-open. Deliberately not the existing mr_thpool_pause (SIGUSR2), which can freeze a worker mid-malloc holding the arena lock — exactly the ghost-lock this prevents. A worker still blocked acquiring the GIL is already malloc-safe, so a drain timeout there is benign.

Why / depends on

Fixes the RedisTimeSeries ASM-migration nightly hangs (MOD-14615 valgrind, MOD-14239 sanitizer). The embedding module wires this to redis core's new FORK_CHILD_PRE subevent (redis/redis#15327).

Pre-merge note

The two RedisModule_Log(..., "notice", ...) lines in MR_DrainForFork/MR_ResumeAfterFork should be downgraded to debug (kept at notice to confirm the drain fires during CI validation).

🤖 Generated with Claude Code


Note

High Risk
Changes pre-fork quiescence across the event loop and worker pool; a missed edge case could still fork with libc locks held and hang forked children (RDB/snapshots), though the design is bounded and fail-open.

Overview
Adds MR_DrainForFork() / MR_ResumeAfterFork() for the embedding module to call around Redis FORK_CHILD_PRE / post-fork: the event-loop thread is parked between tasks, then the execution pool is waited on (busy workers and queued jobs via new mr_thpool_num_jobs_in_queue) up to a 2s bounded deadline, then fork proceeds anyway if needed.

Separately improves MOD-14615 visibility when a multi-shard execution hits max-idle: per-execution sets of peers that ACK’d or sent NOTIFY_DONE, monotonic dispatch/progress timestamps, MR_ClusterFormatPendingPeers to list non-responding shards, and a warning log on timeout (plus debug on dispatch/drain).

Reviewed by Cursor Bugbot for commit 995f842. Bugbot is set up for automated code reviews on this repo. Configure here.

Add MR_DrainForFork()/MR_ResumeAfterFork(). On the main thread (from the module's
FORK_CHILD_PRE handler) park the event-loop thread at a between-tasks safe point via
a posted task and bounded-wait the worker pool to idle, so no LibMR thread holds a
libc lock at fork() (ghost-lock). Bounded + fail-open; cooperative (not the SIGUSR2
mr_thpool_pause, which can freeze a worker mid-malloc). Resume releases the parked
event-loop thread.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
gabsow and others added 2 commits June 11, 2026 13:20
Logs the time spent quiescing the threads (excluding the fork that follows),
plus whether the event-loop thread parked and the busy-worker count, so the
pre-fork drain cost can be measured. Debug level so it is silent in production.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
When a multi-shard execution hits max-idle (EXECUTION_DEFAULT_MAX_IDLE_MS), the
coordinator only logged a fixed "execution max idle reached" string with no clue
which peer stalled. Now, at timeout, log the non-responding peer shard id(s) and
endpoint(s), replies received vs expected, the elapsed wait, and time since last
progress — at warning level so it is captured at the default loglevel (it
coincides with the user-visible failure and is rate-bounded by nMaxIdleReached).

Track responders by recording each ACK / NOTIFY_DONE sender node-id into a
heap-strings set on the Execution (freed in MR_FreeExecution, the single final
owner); pending = cluster peers minus that set, formatted by a new
MR_ClusterFormatPendingPeers helper. Dispatch time is stamped and logged at debug.
All new state is touched only on the event-loop thread, so no extra locking.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@gabsow gabsow marked this pull request as ready for review June 17, 2026 18:19
Comment thread src/mr.c
gabsow and others added 2 commits June 17, 2026 21:22
MR_DrainForFork only waited for mr_thpool_num_threads_working() to reach 0,
ignoring jobs that are queued but not yet picked up by a worker. Such a job
could be dequeued and start running (and allocating) immediately after the
check, right as fork() runs -- defeating the drain. Add
mr_thpool_num_jobs_in_queue() and wait for BOTH the working count and the queue
to reach 0 (the same invariant thpool_wait uses). Addresses the bugbot
"fork drain ignores queued jobs" finding.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit f6fe687. Configure here.

Comment thread src/mr.c
}
struct timespec nap = { .tv_sec = 0, .tv_nsec = 1000000L }; /* 1ms */
nanosleep(&nap, NULL);
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shared fork drain timeout

Medium Severity

MR_DrainForFork uses one CLOCK_REALTIME deadline for both parking the event-loop thread and waiting for the worker pool. Time spent waiting for the park task (including a long event-loop backlog before it runs) is subtracted from the same budget, so phase two often gets little or no time even after a successful park, and after a park timeout the pool loop exits immediately because the deadline is already past.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit f6fe687. Configure here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is intentional. MR_DrainForFork runs synchronously on redis's main thread before every fork, so the single MR_FORK_DRAIN_TIMEOUT_MS (2s) deadline deliberately bounds the total main-thread stall — we do not want a fork (RDB/AOF/ASM snapshot) to block the server longer than that.

In the normal case the park task runs in microseconds, so phase two effectively gets the full budget (measured median drain ~62us). If the park itself cannot complete within the budget, the event-loop thread is stuck off a safe point and the correct bounded fallback is to fork anyway — the drain is best-effort, not a hard barrier. Giving phase two its own fresh 2s budget would double the worst-case main-thread stall, which we deliberately avoid.

Comment thread src/utils/thpool.c
mr_thpool_num_jobs_in_queue() read jobqueue.len without holding
jobqueue.rwmutex, while jobqueue_push/pull mutate it under that lock -- a data
race that could make MR_DrainForFork see a stale (zero) queue and fork early.
Take rwmutex for the read. Addresses the bugbot "unlocked job queue length
read" finding.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@gabsow

gabsow commented Jun 18, 2026

Copy link
Copy Markdown
Contributor Author

@galcohen-redislabs wanted your opinion on the last commit (995f842) before we lean on it. It takes jobqueue.rwmutex when reading jobqueue.len in mr_thpool_num_jobs_in_queue, to address the Bugbot "unlocked job-queue length read" finding.

Two things I am not sure about:

  1. Is it really necessary? The neighbouring mr_thpool_num_threads_working() reads num_threads_working unlocked as well, and MR_DrainForFork polls both in a 1ms loop that is best-effort (it forks anyway on timeout) — so a stale len self-corrects on the next tick. By that logic the unlocked read is arguably benign (same as the existing one), and we could just reply to the bot instead of adding a lock.

  2. Isn't it dangerous? It adds a lock acquisition on the main thread, in the pre-fork drain path. As far as I can tell it's safe — the lock is taken and released inside the accessor (never held across fork()), and workers only hold rwmutex briefly in jobqueue_push/jobqueue_pull without needing the GIL, so there's no GIL/lock inversion or fork-time deadlock. But I would rather get your read before relying on it.

Happy to drop the lock and instead annotate the read as a benign race (matching num_threads_working) if you think that is cleaner/safer. What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant