You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[bug/perf] V4-Pro hangs ~60 min in post-shard-load weight materialization without --safetensors-load-strategy prefetch on EXT4
Summary
On an 8-Spark cluster (TP=8, 8× NVIDIA GB10 / sm_121 / 128 GiB unified memory each, EXT4 on local NVMe), serving deepseek-ai/DeepSeek-V4-Pro (1.6T MoE, 805 GiB checkpoint, ~102 GiB / rank shard) deterministically stalls in the post-shard-load weight materialization phase without --safetensors-load-strategy prefetch. With the flag, full cold-start completes in ~12 minutes and inference is coherent.
Repro
Image: built from nvcr.io/nvidia/pytorch:25.11-py3 + jasl/vllm@ds4-sm120-prototype + jasl/DeepGEMM@sm120 + tonyliu312's PR #40923 sm_12x Marlin patch + TORCH_CUDA_ARCH_LIST="12.0;12.1".
Observed: each rank reports Loaded shard X/8 (XX.X GiB) within ~3-5 minutes, then 5 of 8 workers complete weight materialization in ~6-9 min while 3 of 8 workers stall in safetensors._safetensors_rust.safe_open random reads for >60 min (no progress in iotop, but ~30-50 MB/s random-read load on the NVMe). The collective barrier never closes; vllm serve never reports Application startup complete.
Adding --safetensors-load-strategy prefetch to the same command unblocks it; full cold-start in ~12 min, coherent inference, ~7 tok/s decode.
Why I think prefetch should be the default for large models on POSIX filesystems
The current default appears to be "let safetensors lazy-load tensors via mmap, faulting in pages on demand." For V4-Pro this means each rank does ~13,000 random reads scattered across a 102 GiB shard during post-load weight registration (likely the quantization_config / per-expert pointer fixup phase). On EXT4 with no readahead window large enough to cover MoE expert layouts, this manifests as multi-thread random-read storms and head contention — three of eight ranks happen to draw a tail-latency distribution that never finishes within an order of magnitude of the median.
Prefetch sidesteps this by sequentially streaming each shard fully into the page cache before materialization. The cost is ~one extra disk-bandwidth-bound pass; the benefit is eliminating per-tensor disk-fault tail latency.
Suggested heuristic
A conservative auto-default could be:
if estimated_per_rank_shard_bytes > 50 GiB
AND checkpoint_filesystem_type in {ext2, ext3, ext4, xfs, btrfs, zfs}
AND not _running_inside_tmpfs():
default load_strategy = "prefetch"
The threshold is the part I'm least sure about — happy to gather data points from V4-Flash (~37 GiB / rank, doesn't seem to need it) and intermediate sizes if useful.
Workaround documented
Until a better default lands, I've added a # REQUIRES --safetensors-load-strategy prefetch comment to our V4-Pro recipe template. Production V4-Pro has been stable with this flag across ~30 cold-start cycles.
cc @tonyliu312 (asked me to file this separately from PR #40899 thread)
Bonus data point from the same cluster
Running V4-Flash (~37 GiB / rank shard) on the same 8-Spark setup withoutprefetch does not reproduce the hang — it loads in ~290s via the default lazy mmap path, no straggler, no failure. So the threshold is somewhere between V4-Flash's per-rank size and V4-Pro's. With InstantTensor loader (--load-format instanttensor), V4-Flash loads in ~24s (~12× speedup vs prefetch), and the question of which default mmap-vs-prefetch to pick becomes moot — but as long as InstantTensor isn't the universal default, the prefetch heuristic above seems like a reasonable safety net for the 100 GiB/rank+ regime.
[bug/perf] V4-Pro hangs ~60 min in post-shard-load weight materialization without
--safetensors-load-strategy prefetchon EXT4Summary
On an 8-Spark cluster (TP=8, 8× NVIDIA GB10 / sm_121 / 128 GiB unified memory each, EXT4 on local NVMe), serving
deepseek-ai/DeepSeek-V4-Pro(1.6T MoE, 805 GiB checkpoint, ~102 GiB / rank shard) deterministically stalls in the post-shard-load weight materialization phase without--safetensors-load-strategy prefetch. With the flag, full cold-start completes in ~12 minutes and inference is coherent.Repro
Image: built from
nvcr.io/nvidia/pytorch:25.11-py3+ jasl/vllm@ds4-sm120-prototype + jasl/DeepGEMM@sm120 + tonyliu312's PR #40923 sm_12x Marlin patch +TORCH_CUDA_ARCH_LIST="12.0;12.1".Launch (8 nodes, TP=8, Ray distributed executor):
vllm serve deepseek-ai/DeepSeek-V4-Pro \ --trust-remote-code --kv-cache-dtype fp8 --block-size 256 \ -tp 8 --pipeline-parallel-size 1 --max-model-len 1024 \ --gpu-memory-utilization 0.88 \ --tokenizer-mode deepseek_v4 --tool-call-parser deepseek_v4 \ --enable-auto-tool-choice --reasoning-parser deepseek_v4 \ --distributed-executor-backend ray --enforce-eager # NOTE: NO --safetensors-load-strategyObserved: each rank reports
Loaded shard X/8 (XX.X GiB)within ~3-5 minutes, then 5 of 8 workers complete weight materialization in ~6-9 min while 3 of 8 workers stall insafetensors._safetensors_rust.safe_openrandom reads for >60 min (no progress iniotop, but ~30-50 MB/s random-read load on the NVMe). The collective barrier never closes;vllm servenever reportsApplication startup complete.Adding
--safetensors-load-strategy prefetchto the same command unblocks it; full cold-start in ~12 min, coherent inference, ~7 tok/s decode.Why I think prefetch should be the default for large models on POSIX filesystems
The current default appears to be "let
safetensorslazy-load tensors via mmap, faulting in pages on demand." For V4-Pro this means each rank does ~13,000 random reads scattered across a 102 GiB shard during post-load weight registration (likely thequantization_config/ per-expert pointer fixup phase). On EXT4 with no readahead window large enough to cover MoE expert layouts, this manifests as multi-thread random-read storms and head contention — three of eight ranks happen to draw a tail-latency distribution that never finishes within an order of magnitude of the median.Prefetch sidesteps this by sequentially streaming each shard fully into the page cache before materialization. The cost is ~one extra disk-bandwidth-bound pass; the benefit is eliminating per-tensor disk-fault tail latency.
Suggested heuristic
A conservative auto-default could be:
The threshold is the part I'm least sure about — happy to gather data points from V4-Flash (~37 GiB / rank, doesn't seem to need it) and intermediate sizes if useful.
Workaround documented
Until a better default lands, I've added a
# REQUIRES --safetensors-load-strategy prefetchcomment to our V4-Pro recipe template. Production V4-Pro has been stable with this flag across ~30 cold-start cycles.Environment
v0.1.dev1+g1523228e6.d20260427(jasl/vllmds4-sm120-prototype+ PR [Kernel] Marlin MoE: include SM 12.x in default arch list #40923 patch)nvcr.io/nvidia/pytorch:25.11-py3basedeepseek-ai/DeepSeek-V4-Pro(805 GiB FP8/MXFP4 mixed quantization)cc @tonyliu312 (asked me to file this separately from PR #40899 thread)
Bonus data point from the same cluster
Running V4-Flash (~37 GiB / rank shard) on the same 8-Spark setup without
prefetchdoes not reproduce the hang — it loads in ~290s via the default lazy mmap path, no straggler, no failure. So the threshold is somewhere between V4-Flash's per-rank size and V4-Pro's. With InstantTensor loader (--load-format instanttensor), V4-Flash loads in ~24s (~12× speedup vs prefetch), and the question of which default mmap-vs-prefetch to pick becomes moot — but as long as InstantTensor isn't the universal default, the prefetch heuristic above seems like a reasonable safety net for the 100 GiB/rank+ regime.