Skip to content

feat(slurm): cleanup stale node-local state before launch#2331

Merged
samsja merged 3 commits intomainfrom
feat/slurm-cleanup-stale-node-state
Apr 19, 2026
Merged

feat(slurm): cleanup stale node-local state before launch#2331
samsja merged 3 commits intomainfrom
feat/slurm-cleanup-stale-node-state

Conversation

@mikasenghaas
Copy link
Copy Markdown
Member

@mikasenghaas mikasenghaas commented Apr 19, 2026

Summary

  • Add a pre-workload srun step to multi_node_rl.sbatch.j2, multi_node_sft.sbatch.j2, and inference.sbatch.j2 that runs once per allocated node and wipes stale state left over from prior jobs.
  • Specifically kills orphan python/torchrun/vllm/prime_rl processes and removes IPC state under /dev/shm/vllm-*, /tmp/vllm-*, /tmp/torch-*, /tmp/torchelastic_*.
  • Prints one line per node so the sbatch log confirms procs=N gpu_mem=MMiB on startup.

Motivation

When a SLURM job wedges in the CG (completing) state for a long time after scancel — which happens to us regularly on large multi-node RL runs that die in mid-NCCL/collective — SLURM doesn't always reap every process. The next job scheduled onto the same nodes inherits:

  • orphan Python/vLLM/torchrun processes still holding GPU memory and sockets, and
  • stale /dev/shm/vllm-* segments and /tmp/torchelastic_* rendezvous files.

The concrete failure mode we hit in production was decode engines hanging on startup with:

Waiting for READY message from DP Coordinator...

for 1800s until the orchestrator aborted. A manual pdsh cleanup of the allocated nodes (identical to what this PR adds inside the sbatch template) fixed it immediately on the next launch, and has continued to.

The cleanup is a no-op on clean nodes — it just prints procs=1 gpu_mem=0MiB.

Notes

  • Relies on #SBATCH --exclusive, which all three templates already set, so we're only killing our own user's stale processes and never touching other users' workloads.
  • Runs before srun's workload launch, so it can't race with the current job's own processes.
  • Intentionally broad pkill -9 -f "vllm" etc. — anything matching prime_rl/torchrun/vllm in the process name on an exclusively-allocated node is by definition a leftover.

🤖 Generated with Claude Code


Note

Medium Risk
Adds an aggressive per-node preflight cleanup (pkill -9 and rm -rf in /dev/shm//tmp) to SLURM templates; while intended for --exclusive nodes, incorrect matching or environment assumptions could still terminate unintended processes or remove needed temp state.

Overview
Adds a pre-workload srun cleanup step to the inference, multi_node_rl, and multi_node_sft SLURM sbatch templates to proactively remove stale node-local state left behind by prior jobs.

The new step force-kills leftover python/torchrun/vllm/prime_rl processes (including comm-name matches like vllm::...), deletes related IPC/temp files under /dev/shm and /tmp, and logs a per-node [node-cleanup] line with remaining process count and GPU memory usage.

Reviewed by Cursor Bugbot for commit 11a4e7a. Bugbot is set up for automated code reviews on this repo. Configure here.

Add a pre-workload srun step to the multi-node RL, multi-node SFT and
inference sbatch templates. It runs once per node and:

- kills orphan python/torchrun/vllm/prime_rl processes left over from a
  prior job that wedged after scancel (SLURM doesn't always reap cleanly
  when a job sits in CG for hours)
- removes stale vLLM and torch IPC state under /dev/shm/vllm-*,
  /tmp/vllm-*, /tmp/torch-*, /tmp/torchelastic_*

Without this, decode engines on previously-used nodes can hang at
"Waiting for READY message from DP Coordinator" because the new vLLM
process finds a stale /dev/shm segment or port holder from the dead run.
Symptom we hit: a fresh job timing out after 1800s because 4 decode
engines never became READY; a manual pdsh cleanup of the same nodes
fixed it immediately.

Each node prints one line (hostname, residual proc count, total GPU
memory in use) so the sbatch log shows the nodes came up clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
pkill -9 -f "prime_rl" 2>/dev/null
sleep 2
rm -rf /dev/shm/vllm-* /dev/shm/vllm_* /tmp/vllm-* /tmp/vllm_* /tmp/torch-* /tmp/torchelastic_* 2>/dev/null
procs=$(ps -ef | grep -E "python|torchrun|vllm" | grep -v grep | wc -l)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add vllm router processes as well?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call — added explicit vllm-router entries to both the pkill list and the procs-count regex in all three templates in 79e72b1. Functionally a no-op since the broader vllm pattern already matched vllm-router as a substring, but better to be explicit.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point — vllm::router is the kernel comm (prctl'd process name), not the cmdline, so pkill -f misses it. Just pushed 11a4e7a adding pkill -9 "vllm" and pkill -9 "vllm::.*" (no -f) to match against comm, plus broadened the procs count to ps -eo comm,args so we see these in the per-node status line.

mikasenghaas and others added 2 commits April 20, 2026 04:39
Address review feedback: add vllm-router to the pkill list and the
procs-count regex so the intent is explicit, even though the broader
"vllm" patterns already match it as a substring.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
pkill -f only matches the command line, so the vllm router's worker
processes — which set their kernel process name (comm) to "vllm::router"
via prctl but keep a different cmdline — slip through. Add process-name
pkill for "vllm" and "vllm::.*" to catch them.

Also broaden the post-cleanup procs count to look at both comm and args
(ps -eo comm,args) so we see these if any survive.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@S1ro1 S1ro1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm 🐐

@samsja samsja marked this pull request as ready for review April 19, 2026 23:19
@samsja samsja merged commit c3a24c3 into main Apr 19, 2026
11 checks passed
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 11a4e7a. Configure here.

procs=$(ps -eo comm,args | grep -E "python|torchrun|vllm|vllm::" | grep -v grep | wc -l)
gpu=$(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits | awk "{s+=\$1} END {print s}")
echo "[node-cleanup] $(hostname) procs=$procs gpu_mem=${gpu}MiB"
'
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleanup script kills its own shell via pkill -f

High Severity

The srun bash -c '<script>' approach makes the entire script text part of the bash process's /proc/<pid>/cmdline. When pkill -9 -f "python.*prime_rl" (or any subsequent pkill -f pattern) runs as a child of that bash process, it matches the parent bash shell's own cmdline — which contains every pattern as a literal substring — and sends SIGKILL to it. This kills the cleanup shell on the very first pkill -f command, making the entire cleanup block non-functional. The job continues because there's no set -e, but no stale processes are killed, no temp files are removed, and no status line is printed.

Additional Locations (2)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 11a4e7a. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants