-
Notifications
You must be signed in to change notification settings - Fork 262
feat(slurm): cleanup stale node-local state before launch #2331
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -107,6 +107,22 @@ echo "INFER_URLS=${INFER_URLS}" | |
| {{ pre_run_command }} | ||
| {% endif %} | ||
|
|
||
| # Cleanup stale node-local state from prior jobs. Orphan python/torchrun/vllm | ||
| # processes and vLLM/torch IPC files under /dev/shm and /tmp can survive SLURM | ||
| # termination and cause the next launch to hang (e.g. decode engines stuck at | ||
| # "Waiting for READY message from DP Coordinator"). Harmless on clean nodes. | ||
| srun bash -c ' | ||
| pkill -9 -f "python.*prime_rl" 2>/dev/null | ||
| pkill -9 -f "torchrun" 2>/dev/null | ||
| pkill -9 -f "vllm" 2>/dev/null | ||
| pkill -9 -f "prime_rl" 2>/dev/null | ||
| sleep 2 | ||
| rm -rf /dev/shm/vllm-* /dev/shm/vllm_* /tmp/vllm-* /tmp/vllm_* /tmp/torch-* /tmp/torchelastic_* 2>/dev/null | ||
| procs=$(ps -ef | grep -E "python|torchrun|vllm" | grep -v grep | wc -l) | ||
| gpu=$(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits | awk "{s+=\$1} END {print s}") | ||
| echo "[node-cleanup] $(hostname) procs=$procs gpu_mem=${gpu}MiB" | ||
| ' | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Cleanup script kills its own shell via
|
||
|
|
||
| # Run inference | ||
| srun bash -c ' | ||
| cd $PROJECT_DIR | ||
|
|
||


There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add vllm router processes as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call — added explicit
vllm-routerentries to both the pkill list and the procs-count regex in all three templates in 79e72b1. Functionally a no-op since the broadervllmpattern already matchedvllm-routeras a substring, but better to be explicit.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point —
vllm::routeris the kernel comm (prctl'd process name), not the cmdline, sopkill -fmisses it. Just pushed 11a4e7a addingpkill -9 "vllm"andpkill -9 "vllm::.*"(no -f) to match against comm, plus broadened the procs count tops -eo comm,argsso we see these in the per-node status line.