Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 21 additions & 0 deletions src/prime_rl/templates/inference.sbatch.j2
Original file line number Diff line number Diff line change
Expand Up @@ -107,6 +107,27 @@ echo "INFER_URLS=${INFER_URLS}"
{{ pre_run_command }}
{% endif %}

# Cleanup stale node-local state from prior jobs. Orphan python/torchrun/vllm
# processes and vLLM/torch IPC files under /dev/shm and /tmp can survive SLURM
# termination and cause the next launch to hang (e.g. decode engines stuck at
# "Waiting for READY message from DP Coordinator"). Harmless on clean nodes.
srun bash -c '
pkill -9 -f "python.*prime_rl" 2>/dev/null
pkill -9 -f "torchrun" 2>/dev/null
pkill -9 -f "vllm-router" 2>/dev/null
pkill -9 -f "vllm" 2>/dev/null
pkill -9 -f "prime_rl" 2>/dev/null
# Also match prctl-set process names (kernel comm) like "vllm::router"
# which pkill -f does not see because -f only matches the cmdline.
pkill -9 "vllm" 2>/dev/null
pkill -9 "vllm::.*" 2>/dev/null
sleep 2
rm -rf /dev/shm/vllm-* /dev/shm/vllm_* /tmp/vllm-* /tmp/vllm_* /tmp/torch-* /tmp/torchelastic_* 2>/dev/null
procs=$(ps -eo comm,args | grep -E "python|torchrun|vllm|vllm::" | grep -v grep | wc -l)
gpu=$(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits | awk "{s+=\$1} END {print s}")
echo "[node-cleanup] $(hostname) procs=$procs gpu_mem=${gpu}MiB"
'
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleanup script kills its own shell via pkill -f

High Severity

The srun bash -c '<script>' approach makes the entire script text part of the bash process's /proc/<pid>/cmdline. When pkill -9 -f "python.*prime_rl" (or any subsequent pkill -f pattern) runs as a child of that bash process, it matches the parent bash shell's own cmdline — which contains every pattern as a literal substring — and sends SIGKILL to it. This kills the cleanup shell on the very first pkill -f command, making the entire cleanup block non-functional. The job continues because there's no set -e, but no stale processes are killed, no temp files are removed, and no status line is printed.

Additional Locations (2)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 11a4e7a. Configure here.


# Run inference
srun bash -c '
cd $PROJECT_DIR
Expand Down
21 changes: 21 additions & 0 deletions src/prime_rl/templates/multi_node_rl.sbatch.j2
Original file line number Diff line number Diff line change
Expand Up @@ -151,6 +151,27 @@ uv sync --all-extras
{{ pre_run_command }}
{% endif %}

# Cleanup stale node-local state from prior jobs. Orphan python/torchrun/vllm
# processes and vLLM/torch IPC files under /dev/shm and /tmp can survive SLURM
# termination and cause the next launch to hang (e.g. decode engines stuck at
# "Waiting for READY message from DP Coordinator"). Harmless on clean nodes.
srun bash -c '
pkill -9 -f "python.*prime_rl" 2>/dev/null
pkill -9 -f "torchrun" 2>/dev/null
pkill -9 -f "vllm-router" 2>/dev/null
pkill -9 -f "vllm" 2>/dev/null
pkill -9 -f "prime_rl" 2>/dev/null
# Also match prctl-set process names (kernel comm) like "vllm::router"
# which pkill -f does not see because -f only matches the cmdline.
pkill -9 "vllm" 2>/dev/null
pkill -9 "vllm::.*" 2>/dev/null
sleep 2
rm -rf /dev/shm/vllm-* /dev/shm/vllm_* /tmp/vllm-* /tmp/vllm_* /tmp/torch-* /tmp/torchelastic_* 2>/dev/null
procs=$(ps -eo comm,args | grep -E "python|torchrun|vllm|vllm::" | grep -v grep | wc -l)
gpu=$(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits | awk "{s+=\$1} END {print s}")
echo "[node-cleanup] $(hostname) procs=$procs gpu_mem=${gpu}MiB"
'

# Run RL
srun bash -c '
# Source environment
Expand Down
20 changes: 20 additions & 0 deletions src/prime_rl/templates/multi_node_sft.sbatch.j2
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,26 @@ cd $PROJECT_DIR && uv sync --all-extras
{{ pre_run_command }}
{% endif %}

# Cleanup stale node-local state from prior jobs. Orphan python/torchrun/vllm
# processes and vLLM/torch IPC files under /dev/shm and /tmp can survive SLURM
# termination and cause the next launch to hang. Harmless on clean nodes.
srun bash -c '
pkill -9 -f "python.*prime_rl" 2>/dev/null
pkill -9 -f "torchrun" 2>/dev/null
pkill -9 -f "vllm-router" 2>/dev/null
pkill -9 -f "vllm" 2>/dev/null
pkill -9 -f "prime_rl" 2>/dev/null
# Also match prctl-set process names (kernel comm) like "vllm::router"
# which pkill -f does not see because -f only matches the cmdline.
pkill -9 "vllm" 2>/dev/null
pkill -9 "vllm::.*" 2>/dev/null
sleep 2
rm -rf /dev/shm/vllm-* /dev/shm/vllm_* /tmp/vllm-* /tmp/vllm_* /tmp/torch-* /tmp/torchelastic_* 2>/dev/null
procs=$(ps -eo comm,args | grep -E "python|torchrun|vllm|vllm::" | grep -v grep | wc -l)
gpu=$(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits | awk "{s+=\$1} END {print s}")
echo "[node-cleanup] $(hostname) procs=$procs gpu_mem=${gpu}MiB"
'

# Run SFT
srun bash -c '
# Setup environment
Expand Down
Loading