Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 16 additions & 0 deletions src/prime_rl/templates/inference.sbatch.j2
Original file line number Diff line number Diff line change
Expand Up @@ -107,6 +107,22 @@ echo "INFER_URLS=${INFER_URLS}"
{{ pre_run_command }}
{% endif %}

# Cleanup stale node-local state from prior jobs. Orphan python/torchrun/vllm
# processes and vLLM/torch IPC files under /dev/shm and /tmp can survive SLURM
# termination and cause the next launch to hang (e.g. decode engines stuck at
# "Waiting for READY message from DP Coordinator"). Harmless on clean nodes.
srun bash -c '
pkill -9 -f "python.*prime_rl" 2>/dev/null
pkill -9 -f "torchrun" 2>/dev/null
pkill -9 -f "vllm" 2>/dev/null
pkill -9 -f "prime_rl" 2>/dev/null
sleep 2
rm -rf /dev/shm/vllm-* /dev/shm/vllm_* /tmp/vllm-* /tmp/vllm_* /tmp/torch-* /tmp/torchelastic_* 2>/dev/null
procs=$(ps -ef | grep -E "python|torchrun|vllm" | grep -v grep | wc -l)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add vllm router processes as well?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call — added explicit vllm-router entries to both the pkill list and the procs-count regex in all three templates in 79e72b1. Functionally a no-op since the broader vllm pattern already matched vllm-router as a substring, but better to be explicit.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point — vllm::router is the kernel comm (prctl'd process name), not the cmdline, so pkill -f misses it. Just pushed 11a4e7a adding pkill -9 "vllm" and pkill -9 "vllm::.*" (no -f) to match against comm, plus broadened the procs count to ps -eo comm,args so we see these in the per-node status line.

gpu=$(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits | awk "{s+=\$1} END {print s}")
echo "[node-cleanup] $(hostname) procs=$procs gpu_mem=${gpu}MiB"
'
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleanup script kills its own shell via pkill -f

High Severity

The srun bash -c '<script>' approach makes the entire script text part of the bash process's /proc/<pid>/cmdline. When pkill -9 -f "python.*prime_rl" (or any subsequent pkill -f pattern) runs as a child of that bash process, it matches the parent bash shell's own cmdline — which contains every pattern as a literal substring — and sends SIGKILL to it. This kills the cleanup shell on the very first pkill -f command, making the entire cleanup block non-functional. The job continues because there's no set -e, but no stale processes are killed, no temp files are removed, and no status line is printed.

Additional Locations (2)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 11a4e7a. Configure here.


# Run inference
srun bash -c '
cd $PROJECT_DIR
Expand Down
16 changes: 16 additions & 0 deletions src/prime_rl/templates/multi_node_rl.sbatch.j2
Original file line number Diff line number Diff line change
Expand Up @@ -151,6 +151,22 @@ uv sync --all-extras
{{ pre_run_command }}
{% endif %}

# Cleanup stale node-local state from prior jobs. Orphan python/torchrun/vllm
# processes and vLLM/torch IPC files under /dev/shm and /tmp can survive SLURM
# termination and cause the next launch to hang (e.g. decode engines stuck at
# "Waiting for READY message from DP Coordinator"). Harmless on clean nodes.
srun bash -c '
pkill -9 -f "python.*prime_rl" 2>/dev/null
pkill -9 -f "torchrun" 2>/dev/null
pkill -9 -f "vllm" 2>/dev/null
pkill -9 -f "prime_rl" 2>/dev/null
sleep 2
rm -rf /dev/shm/vllm-* /dev/shm/vllm_* /tmp/vllm-* /tmp/vllm_* /tmp/torch-* /tmp/torchelastic_* 2>/dev/null
procs=$(ps -ef | grep -E "python|torchrun|vllm" | grep -v grep | wc -l)
gpu=$(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits | awk "{s+=\$1} END {print s}")
echo "[node-cleanup] $(hostname) procs=$procs gpu_mem=${gpu}MiB"
'

# Run RL
srun bash -c '
# Source environment
Expand Down
15 changes: 15 additions & 0 deletions src/prime_rl/templates/multi_node_sft.sbatch.j2
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,21 @@ cd $PROJECT_DIR && uv sync --all-extras
{{ pre_run_command }}
{% endif %}

# Cleanup stale node-local state from prior jobs. Orphan python/torchrun/vllm
# processes and vLLM/torch IPC files under /dev/shm and /tmp can survive SLURM
# termination and cause the next launch to hang. Harmless on clean nodes.
srun bash -c '
pkill -9 -f "python.*prime_rl" 2>/dev/null
pkill -9 -f "torchrun" 2>/dev/null
pkill -9 -f "vllm" 2>/dev/null
pkill -9 -f "prime_rl" 2>/dev/null
sleep 2
rm -rf /dev/shm/vllm-* /dev/shm/vllm_* /tmp/vllm-* /tmp/vllm_* /tmp/torch-* /tmp/torchelastic_* 2>/dev/null
procs=$(ps -ef | grep -E "python|torchrun|vllm" | grep -v grep | wc -l)
gpu=$(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits | awk "{s+=\$1} END {print s}")
echo "[node-cleanup] $(hostname) procs=$procs gpu_mem=${gpu}MiB"
'

# Run SFT
srun bash -c '
# Setup environment
Expand Down
Loading