feat(slurm): cleanup stale node-local state before launch by mikasenghaas · Pull Request #2331 · PrimeIntellect-ai/prime-rl

mikasenghaas · 2026-04-19T22:33:31Z

Summary

Add a pre-workload srun step to multi_node_rl.sbatch.j2, multi_node_sft.sbatch.j2, and inference.sbatch.j2 that runs once per allocated node and wipes stale state left over from prior jobs.
Specifically kills orphan python/torchrun/vllm/prime_rl processes and removes IPC state under /dev/shm/vllm-*, /tmp/vllm-*, /tmp/torch-*, /tmp/torchelastic_*.
Prints one line per node so the sbatch log confirms procs=N gpu_mem=MMiB on startup.

Motivation

When a SLURM job wedges in the CG (completing) state for a long time after scancel — which happens to us regularly on large multi-node RL runs that die in mid-NCCL/collective — SLURM doesn't always reap every process. The next job scheduled onto the same nodes inherits:

orphan Python/vLLM/torchrun processes still holding GPU memory and sockets, and
stale /dev/shm/vllm-* segments and /tmp/torchelastic_* rendezvous files.

The concrete failure mode we hit in production was decode engines hanging on startup with:

Waiting for READY message from DP Coordinator...

for 1800s until the orchestrator aborted. A manual pdsh cleanup of the allocated nodes (identical to what this PR adds inside the sbatch template) fixed it immediately on the next launch, and has continued to.

The cleanup is a no-op on clean nodes — it just prints procs=1 gpu_mem=0MiB.

Notes

Relies on #SBATCH --exclusive, which all three templates already set, so we're only killing our own user's stale processes and never touching other users' workloads.
Runs before srun's workload launch, so it can't race with the current job's own processes.
Intentionally broad pkill -9 -f "vllm" etc. — anything matching prime_rl/torchrun/vllm in the process name on an exclusively-allocated node is by definition a leftover.

🤖 Generated with Claude Code

Note

Medium Risk
Adds an aggressive per-node preflight cleanup (pkill -9 and rm -rf in /dev/shm//tmp) to SLURM templates; while intended for --exclusive nodes, incorrect matching or environment assumptions could still terminate unintended processes or remove needed temp state.

Overview
Adds a pre-workload srun cleanup step to the inference, multi_node_rl, and multi_node_sft SLURM sbatch templates to proactively remove stale node-local state left behind by prior jobs.

The new step force-kills leftover python/torchrun/vllm/prime_rl processes (including comm-name matches like vllm::...), deletes related IPC/temp files under /dev/shm and /tmp, and logs a per-node [node-cleanup] line with remaining process count and GPU memory usage.

^{Reviewed by Cursor Bugbot for commit 11a4e7a. Bugbot is set up for automated code reviews on this repo. Configure here.}

Add a pre-workload srun step to the multi-node RL, multi-node SFT and inference sbatch templates. It runs once per node and: - kills orphan python/torchrun/vllm/prime_rl processes left over from a prior job that wedged after scancel (SLURM doesn't always reap cleanly when a job sits in CG for hours) - removes stale vLLM and torch IPC state under /dev/shm/vllm-*, /tmp/vllm-*, /tmp/torch-*, /tmp/torchelastic_* Without this, decode engines on previously-used nodes can hang at "Waiting for READY message from DP Coordinator" because the new vLLM process finds a stale /dev/shm segment or port holder from the dead run. Symptom we hit: a fresh job timing out after 1800s because 4 decode engines never became READY; a manual pdsh cleanup of the same nodes fixed it immediately. Each node prints one line (hostname, residual proc count, total GPU memory in use) so the sbatch log shows the nodes came up clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

S1ro1 · 2026-04-19T23:07:05Z

+    pkill -9 -f "prime_rl" 2>/dev/null
+    sleep 2
+    rm -rf /dev/shm/vllm-* /dev/shm/vllm_* /tmp/vllm-* /tmp/vllm_* /tmp/torch-* /tmp/torchelastic_* 2>/dev/null
+    procs=$(ps -ef | grep -E "python|torchrun|vllm" | grep -v grep | wc -l)


can we add vllm router processes as well?

Good call — added explicit vllm-router entries to both the pkill list and the procs-count regex in all three templates in 79e72b1. Functionally a no-op since the broader vllm pattern already matched vllm-router as a substring, but better to be explicit.

Good point — vllm::router is the kernel comm (prctl'd process name), not the cmdline, so pkill -f misses it. Just pushed 11a4e7a adding pkill -9 "vllm" and pkill -9 "vllm::.*" (no -f) to match against comm, plus broadened the procs count to ps -eo comm,args so we see these in the per-node status line.

Address review feedback: add vllm-router to the pkill list and the procs-count regex so the intent is explicit, even though the broader "vllm" patterns already match it as a substring. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

pkill -f only matches the command line, so the vllm router's worker processes — which set their kernel process name (comm) to "vllm::router" via prctl but keep a different cmdline — slip through. Add process-name pkill for "vllm" and "vllm::.*" to catch them. Also broaden the post-cleanup procs count to look at both comm and args (ps -eo comm,args) so we see these if any survive. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

S1ro1

lgtm 🐐

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 11a4e7a. Configure here.}

cursor · 2026-04-19T23:25:14Z

+    procs=$(ps -eo comm,args | grep -E "python|torchrun|vllm|vllm::" | grep -v grep | wc -l)
+    gpu=$(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits | awk "{s+=\$1} END {print s}")
+    echo "[node-cleanup] $(hostname) procs=$procs gpu_mem=${gpu}MiB"
+'


Cleanup script kills its own shell via pkill -f

High Severity

The srun bash -c '<script>' approach makes the entire script text part of the bash process's /proc/<pid>/cmdline. When pkill -9 -f "python.*prime_rl" (or any subsequent pkill -f pattern) runs as a child of that bash process, it matches the parent bash shell's own cmdline — which contains every pattern as a literal substring — and sends SIGKILL to it. This kills the cleanup shell on the very first pkill -f command, making the entire cleanup block non-functional. The job continues because there's no set -e, but no stale processes are killed, no temp files are removed, and no status line is printed.

Additional Locations (2)

src/prime_rl/templates/multi_node_rl.sbatch.j2#L157-L173

src/prime_rl/templates/multi_node_sft.sbatch.j2#L64-L80

^{Reviewed by Cursor Bugbot for commit 11a4e7a. Configure here.}

S1ro1 reviewed Apr 19, 2026

View reviewed changes

mikasenghaas and others added 2 commits April 20, 2026 04:39

S1ro1 approved these changes Apr 19, 2026

View reviewed changes

samsja marked this pull request as ready for review April 19, 2026 23:19

samsja merged commit c3a24c3 into main Apr 19, 2026
11 checks passed

cursor bot reviewed Apr 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(slurm): cleanup stale node-local state before launch#2331

feat(slurm): cleanup stale node-local state before launch#2331
samsja merged 3 commits intomainfrom
feat/slurm-cleanup-stale-node-state

mikasenghaas commented Apr 19, 2026 •

edited by cursor bot

Loading

Uh oh!

S1ro1 Apr 19, 2026

Uh oh!

mikasenghaas Apr 19, 2026

Uh oh!

mikasenghaas Apr 19, 2026

Uh oh!

S1ro1 left a comment

Uh oh!

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mikasenghaas commented Apr 19, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Notes

Uh oh!

S1ro1 Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

mikasenghaas Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

mikasenghaas Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

S1ro1 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Apr 19, 2026

Choose a reason for hiding this comment

Cleanup script kills its own shell via pkill -f

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mikasenghaas commented Apr 19, 2026 •

edited by cursor bot

Loading

Cleanup script kills its own shell via `pkill -f`