Skip to content

[bug/perf] V4-Pro hangs ~60 min in post-shard-load weight materialization without --safetensors-load-strategy prefetch on EXT4 #40988

Description

@idonati

[bug/perf] V4-Pro hangs ~60 min in post-shard-load weight materialization without --safetensors-load-strategy prefetch on EXT4

Summary

On an 8-Spark cluster (TP=8, 8× NVIDIA GB10 / sm_121 / 128 GiB unified memory each, EXT4 on local NVMe), serving deepseek-ai/DeepSeek-V4-Pro (1.6T MoE, 805 GiB checkpoint, ~102 GiB / rank shard) deterministically stalls in the post-shard-load weight materialization phase without --safetensors-load-strategy prefetch. With the flag, full cold-start completes in ~12 minutes and inference is coherent.

Repro

Image: built from nvcr.io/nvidia/pytorch:25.11-py3 + jasl/vllm@ds4-sm120-prototype + jasl/DeepGEMM@sm120 + tonyliu312's PR #40923 sm_12x Marlin patch + TORCH_CUDA_ARCH_LIST="12.0;12.1".

Launch (8 nodes, TP=8, Ray distributed executor):

vllm serve deepseek-ai/DeepSeek-V4-Pro \
  --trust-remote-code --kv-cache-dtype fp8 --block-size 256 \
  -tp 8 --pipeline-parallel-size 1 --max-model-len 1024 \
  --gpu-memory-utilization 0.88 \
  --tokenizer-mode deepseek_v4 --tool-call-parser deepseek_v4 \
  --enable-auto-tool-choice --reasoning-parser deepseek_v4 \
  --distributed-executor-backend ray --enforce-eager
  # NOTE: NO --safetensors-load-strategy

Observed: each rank reports Loaded shard X/8 (XX.X GiB) within ~3-5 minutes, then 5 of 8 workers complete weight materialization in ~6-9 min while 3 of 8 workers stall in safetensors._safetensors_rust.safe_open random reads for >60 min (no progress in iotop, but ~30-50 MB/s random-read load on the NVMe). The collective barrier never closes; vllm serve never reports Application startup complete.

Adding --safetensors-load-strategy prefetch to the same command unblocks it; full cold-start in ~12 min, coherent inference, ~7 tok/s decode.

Why I think prefetch should be the default for large models on POSIX filesystems

The current default appears to be "let safetensors lazy-load tensors via mmap, faulting in pages on demand." For V4-Pro this means each rank does ~13,000 random reads scattered across a 102 GiB shard during post-load weight registration (likely the quantization_config / per-expert pointer fixup phase). On EXT4 with no readahead window large enough to cover MoE expert layouts, this manifests as multi-thread random-read storms and head contention — three of eight ranks happen to draw a tail-latency distribution that never finishes within an order of magnitude of the median.

Prefetch sidesteps this by sequentially streaming each shard fully into the page cache before materialization. The cost is ~one extra disk-bandwidth-bound pass; the benefit is eliminating per-tensor disk-fault tail latency.

Suggested heuristic

A conservative auto-default could be:

if estimated_per_rank_shard_bytes > 50 GiB
  AND checkpoint_filesystem_type in {ext2, ext3, ext4, xfs, btrfs, zfs}
  AND not _running_inside_tmpfs():
  default load_strategy = "prefetch"

The threshold is the part I'm least sure about — happy to gather data points from V4-Flash (~37 GiB / rank, doesn't seem to need it) and intermediate sizes if useful.

Workaround documented

Until a better default lands, I've added a # REQUIRES --safetensors-load-strategy prefetch comment to our V4-Pro recipe template. Production V4-Pro has been stable with this flag across ~30 cold-start cycles.

Environment

  • Hardware: 8× NVIDIA DGX Spark (GB10, sm_121, 128 GiB unified memory, ARM64), dual-rail 200G RoCE multi-switch fabric
  • Storage: per-node local NVMe, EXT4, single-disk (no RAID)
  • vLLM commit: v0.1.dev1+g1523228e6.d20260427 (jasl/vllm ds4-sm120-prototype + PR [Kernel] Marlin MoE: include SM 12.x in default arch list #40923 patch)
  • Container: nvcr.io/nvidia/pytorch:25.11-py3 base
  • Model: deepseek-ai/DeepSeek-V4-Pro (805 GiB FP8/MXFP4 mixed quantization)

cc @tonyliu312 (asked me to file this separately from PR #40899 thread)


Bonus data point from the same cluster

Running V4-Flash (~37 GiB / rank shard) on the same 8-Spark setup without prefetch does not reproduce the hang — it loads in ~290s via the default lazy mmap path, no straggler, no failure. So the threshold is somewhere between V4-Flash's per-rank size and V4-Pro's. With InstantTensor loader (--load-format instanttensor), V4-Flash loads in ~24s (~12× speedup vs prefetch), and the question of which default mmap-vs-prefetch to pick becomes moot — but as long as InstantTensor isn't the universal default, the prefetch heuristic above seems like a reasonable safety net for the 100 GiB/rank+ regime.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions