[bug/perf] V4-Pro hangs ~60 min in post-shard-load weight materialization without --safetensors-load-strategy prefetch on EXT4

# [bug/perf] V4-Pro hangs ~60 min in post-shard-load weight materialization without `--safetensors-load-strategy prefetch` on EXT4

## Summary

On an 8-Spark cluster (TP=8, 8× NVIDIA GB10 / sm_121 / 128 GiB unified memory each, EXT4 on local NVMe), serving `deepseek-ai/DeepSeek-V4-Pro` (1.6T MoE, 805 GiB checkpoint, ~102 GiB / rank shard) **deterministically stalls** in the post-shard-load weight materialization phase without `--safetensors-load-strategy prefetch`. With the flag, full cold-start completes in ~12 minutes and inference is coherent.

## Repro

Image: built from `nvcr.io/nvidia/pytorch:25.11-py3` + jasl/vllm@ds4-sm120-prototype + jasl/DeepGEMM@sm120 + tonyliu312's PR #40923 sm_12x Marlin patch + `TORCH_CUDA_ARCH_LIST="12.0;12.1"`.

Launch (8 nodes, TP=8, Ray distributed executor):

```bash
vllm serve deepseek-ai/DeepSeek-V4-Pro \
  --trust-remote-code --kv-cache-dtype fp8 --block-size 256 \
  -tp 8 --pipeline-parallel-size 1 --max-model-len 1024 \
  --gpu-memory-utilization 0.88 \
  --tokenizer-mode deepseek_v4 --tool-call-parser deepseek_v4 \
  --enable-auto-tool-choice --reasoning-parser deepseek_v4 \
  --distributed-executor-backend ray --enforce-eager
  # NOTE: NO --safetensors-load-strategy
```

Observed: each rank reports `Loaded shard X/8 (XX.X GiB)` within ~3-5 minutes, then 5 of 8 workers complete weight materialization in ~6-9 min while **3 of 8 workers stall in `safetensors._safetensors_rust.safe_open` random reads for >60 min** (no progress in `iotop`, but ~30-50 MB/s random-read load on the NVMe). The collective barrier never closes; `vllm serve` never reports `Application startup complete`.

Adding `--safetensors-load-strategy prefetch` to the same command unblocks it; full cold-start in ~12 min, coherent inference, ~7 tok/s decode.

## Why I think prefetch should be the default for large models on POSIX filesystems

The current default appears to be "let `safetensors` lazy-load tensors via mmap, faulting in pages on demand." For V4-Pro this means each rank does ~13,000 random reads scattered across a 102 GiB shard during post-load weight registration (likely the `quantization_config` / per-expert pointer fixup phase). On EXT4 with no readahead window large enough to cover MoE expert layouts, this manifests as multi-thread random-read storms and head contention — three of eight ranks happen to draw a tail-latency distribution that never finishes within an order of magnitude of the median.

Prefetch sidesteps this by sequentially streaming each shard fully into the page cache before materialization. The cost is ~one extra disk-bandwidth-bound pass; the benefit is eliminating per-tensor disk-fault tail latency.

## Suggested heuristic

A conservative auto-default could be:

```
if estimated_per_rank_shard_bytes > 50 GiB
  AND checkpoint_filesystem_type in {ext2, ext3, ext4, xfs, btrfs, zfs}
  AND not _running_inside_tmpfs():
  default load_strategy = "prefetch"
```

The threshold is the part I'm least sure about — happy to gather data points from V4-Flash (~37 GiB / rank, doesn't seem to need it) and intermediate sizes if useful.

## Workaround documented

Until a better default lands, I've added a `# REQUIRES --safetensors-load-strategy prefetch` comment to our V4-Pro recipe template. Production V4-Pro has been stable with this flag across ~30 cold-start cycles.

## Environment

- Hardware: 8× NVIDIA DGX Spark (GB10, sm_121, 128 GiB unified memory, ARM64), dual-rail 200G RoCE multi-switch fabric
- Storage: per-node local NVMe, EXT4, single-disk (no RAID)
- vLLM commit: `v0.1.dev1+g1523228e6.d20260427` (jasl/vllm `ds4-sm120-prototype` + PR #40923 patch)
- Container: `nvcr.io/nvidia/pytorch:25.11-py3` base
- Model: `deepseek-ai/DeepSeek-V4-Pro` (805 GiB FP8/MXFP4 mixed quantization)

cc @tonyliu312 (asked me to file this separately from PR #40899 thread)

---

## Bonus data point from the same cluster

Running V4-Flash (~37 GiB / rank shard) on the same 8-Spark setup *without* `prefetch` does **not** reproduce the hang — it loads in ~290s via the default lazy mmap path, no straggler, no failure. So the threshold is somewhere between V4-Flash's per-rank size and V4-Pro's. With InstantTensor loader (`--load-format instanttensor`), V4-Flash loads in ~24s (~12× speedup vs prefetch), and the question of which default mmap-vs-prefetch to pick becomes moot — but as long as InstantTensor isn't the universal default, the prefetch heuristic above seems like a reasonable safety net for the 100 GiB/rank+ regime.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[bug/perf] V4-Pro hangs ~60 min in post-shard-load weight materialization without --safetensors-load-strategy prefetch on EXT4 #40988

[bug/perf] V4-Pro hangs ~60 min in post-shard-load weight materialization without `--safetensors-load-strategy prefetch` on EXT4

Summary

Repro

Why I think prefetch should be the default for large models on POSIX filesystems

Suggested heuristic

Workaround documented

Environment

Bonus data point from the same cluster

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Uh oh!

[bug/perf] V4-Pro hangs ~60 min in post-shard-load weight materialization without --safetensors-load-strategy prefetch on EXT4 #40988

Description

[bug/perf] V4-Pro hangs ~60 min in post-shard-load weight materialization without --safetensors-load-strategy prefetch on EXT4

Summary

Repro

Why I think prefetch should be the default for large models on POSIX filesystems

Suggested heuristic

Workaround documented

Environment

Bonus data point from the same cluster

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

[bug/perf] V4-Pro hangs ~60 min in post-shard-load weight materialization without `--safetensors-load-strategy prefetch` on EXT4