feat(slurm): cleanup stale node-local state before launch#2331
Conversation
Add a pre-workload srun step to the multi-node RL, multi-node SFT and inference sbatch templates. It runs once per node and: - kills orphan python/torchrun/vllm/prime_rl processes left over from a prior job that wedged after scancel (SLURM doesn't always reap cleanly when a job sits in CG for hours) - removes stale vLLM and torch IPC state under /dev/shm/vllm-*, /tmp/vllm-*, /tmp/torch-*, /tmp/torchelastic_* Without this, decode engines on previously-used nodes can hang at "Waiting for READY message from DP Coordinator" because the new vLLM process finds a stale /dev/shm segment or port holder from the dead run. Symptom we hit: a fresh job timing out after 1800s because 4 decode engines never became READY; a manual pdsh cleanup of the same nodes fixed it immediately. Each node prints one line (hostname, residual proc count, total GPU memory in use) so the sbatch log shows the nodes came up clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| pkill -9 -f "prime_rl" 2>/dev/null | ||
| sleep 2 | ||
| rm -rf /dev/shm/vllm-* /dev/shm/vllm_* /tmp/vllm-* /tmp/vllm_* /tmp/torch-* /tmp/torchelastic_* 2>/dev/null | ||
| procs=$(ps -ef | grep -E "python|torchrun|vllm" | grep -v grep | wc -l) |
There was a problem hiding this comment.
can we add vllm router processes as well?
There was a problem hiding this comment.
Good call — added explicit vllm-router entries to both the pkill list and the procs-count regex in all three templates in 79e72b1. Functionally a no-op since the broader vllm pattern already matched vllm-router as a substring, but better to be explicit.
There was a problem hiding this comment.
Good point — vllm::router is the kernel comm (prctl'd process name), not the cmdline, so pkill -f misses it. Just pushed 11a4e7a adding pkill -9 "vllm" and pkill -9 "vllm::.*" (no -f) to match against comm, plus broadened the procs count to ps -eo comm,args so we see these in the per-node status line.
Address review feedback: add vllm-router to the pkill list and the procs-count regex so the intent is explicit, even though the broader "vllm" patterns already match it as a substring. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
pkill -f only matches the command line, so the vllm router's worker processes — which set their kernel process name (comm) to "vllm::router" via prctl but keep a different cmdline — slip through. Add process-name pkill for "vllm" and "vllm::.*" to catch them. Also broaden the post-cleanup procs count to look at both comm and args (ps -eo comm,args) so we see these if any survive. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 11a4e7a. Configure here.
| procs=$(ps -eo comm,args | grep -E "python|torchrun|vllm|vllm::" | grep -v grep | wc -l) | ||
| gpu=$(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits | awk "{s+=\$1} END {print s}") | ||
| echo "[node-cleanup] $(hostname) procs=$procs gpu_mem=${gpu}MiB" | ||
| ' |
There was a problem hiding this comment.
Cleanup script kills its own shell via pkill -f
High Severity
The srun bash -c '<script>' approach makes the entire script text part of the bash process's /proc/<pid>/cmdline. When pkill -9 -f "python.*prime_rl" (or any subsequent pkill -f pattern) runs as a child of that bash process, it matches the parent bash shell's own cmdline — which contains every pattern as a literal substring — and sends SIGKILL to it. This kills the cleanup shell on the very first pkill -f command, making the entire cleanup block non-functional. The job continues because there's no set -e, but no stale processes are killed, no temp files are removed, and no status line is printed.
Additional Locations (2)
Reviewed by Cursor Bugbot for commit 11a4e7a. Configure here.


Summary
srunstep tomulti_node_rl.sbatch.j2,multi_node_sft.sbatch.j2, andinference.sbatch.j2that runs once per allocated node and wipes stale state left over from prior jobs.python/torchrun/vllm/prime_rlprocesses and removes IPC state under/dev/shm/vllm-*,/tmp/vllm-*,/tmp/torch-*,/tmp/torchelastic_*.procs=N gpu_mem=MMiBon startup.Motivation
When a SLURM job wedges in the
CG(completing) state for a long time afterscancel— which happens to us regularly on large multi-node RL runs that die in mid-NCCL/collective — SLURM doesn't always reap every process. The next job scheduled onto the same nodes inherits:/dev/shm/vllm-*segments and/tmp/torchelastic_*rendezvous files.The concrete failure mode we hit in production was decode engines hanging on startup with:
for 1800s until the orchestrator aborted. A manual
pdshcleanup of the allocated nodes (identical to what this PR adds inside the sbatch template) fixed it immediately on the next launch, and has continued to.The cleanup is a no-op on clean nodes — it just prints
procs=1 gpu_mem=0MiB.Notes
#SBATCH --exclusive, which all three templates already set, so we're only killing our own user's stale processes and never touching other users' workloads.srun's workload launch, so it can't race with the current job's own processes.pkill -9 -f "vllm"etc. — anything matchingprime_rl/torchrun/vllmin the process name on an exclusively-allocated node is by definition a leftover.🤖 Generated with Claude Code
Note
Medium Risk
Adds an aggressive per-node preflight cleanup (
pkill -9andrm -rfin/dev/shm//tmp) to SLURM templates; while intended for--exclusivenodes, incorrect matching or environment assumptions could still terminate unintended processes or remove needed temp state.Overview
Adds a pre-workload
sruncleanup step to theinference,multi_node_rl, andmulti_node_sftSLURM sbatch templates to proactively remove stale node-local state left behind by prior jobs.The new step force-kills leftover
python/torchrun/vllm/prime_rlprocesses (includingcomm-name matches likevllm::...), deletes related IPC/temp files under/dev/shmand/tmp, and logs a per-node[node-cleanup]line with remaining process count and GPU memory usage.Reviewed by Cursor Bugbot for commit 11a4e7a. Bugbot is set up for automated code reviews on this repo. Configure here.