[New Model][Nvidia] Add SM12x support for DeepSeek V4 Flash with essential fixes by jasl · Pull Request #41834 · vllm-project/vllm

jasl · 2026-05-06T15:17:15Z

Summary

This PR enables DeepSeek V4 Flash on SM120/SM121 Blackwell client hardware by carrying the SM12x fallback and tuning stack needed for the current vLLM V1 path. It is intended for RTX PRO 6000 Blackwell Workstation Edition, RTX 5090-class SM120, and GB10 / DGX Spark SM121 users who cannot use SM100-only TMEM / tcgen05 kernels.

As of 2026-06-23 this branch is reconciled on top of the merged #43477 and provides the stock-deps path: DeepSeek V4 on SM120/121 that builds and serves on released FlashInfer / DeepGEMM wheels, complementing #43477's route that needs the unreleased FlashInfer #3395 + DeepGEMM #324 dependency branches. As of 2026-06-26 the branch is additionally synced onto current upstream/main (198 commits since the #43477 merge); latest validated head is tag sm120-pr-41834-stable-preview-20260626 (c766cbc6ff). See Update 2026-06-26 — upstream sync and Update 2026-06-23 — #43477 reconciliation below.

Change footprint — model kernels vs. core-vLLM touch points

The branch splits cleanly into model/kernel code and a small set of core-vLLM integration points (116 files, +15.5k/−0.4k, of which ~+3.5k is tests):

DeepSeek-V4 model + SM12x kernels (~49 files, ~+10.2k) — the enablement itself. Everything under vllm/models/deepseek_v4/** plus the SM12x sparse-MLA decode / indexer / DeepGEMM kernels that live in shared dirs (v1/attention/backends/mla/sparse_mla_kernels.py, model_executor/layers/sparse_attn_indexer.py, v1/attention/backends/mla/{indexer,sparse_swa}.py, utils/deep_gemm.py, kernels/mhc/tilelang.py), the new DSv4 reasoning parser / tokenizer, and device tuning JSONs.
C128A metadata device→host sync removed (models/deepseek_v4/sparse_mla.py, perf, 2026-06-26) — _c128a_effective_topk_width now takes the max position from the CPU-side CommonAttentionMetadata.max_seq_len instead of a per-step int(positions.max().item()) device sync, dropping a launch-stream stall on every C128A metadata step (reported via gdb native stacks by a GB10/TP4 user). Decode is identical (max_seq_len-1 == positions.max()); only chunked prefill sees a safe, slightly-wider 128-aligned top-k.
Core-vLLM integration (~36 files, +1.8k/−0.2k) — the hooks below. Almost all are gated by model architecture / quant config / an env flag and are inert for other models.

Subsystem	Files (≈ lines)	What it does
KV-cache core	`single_type_kv_cache_manager.py` (+243), `kv_cache_coordinator.py` (+67), `kv_cache_manager.py` (+60), `sched/scheduler.py` (+1)	prefix-cache correctness for DSv4 sparse-MLA + MTP: an MLA cache-manager with prompt-block protection, a hybrid-coordinator `cache_blocks` tail-block-reuse rewrite. (Our earlier `block_pool` stale-hash reset is dropped in the reconcile — subsumed by upstream's own unconditional reset, which arrived via the `upstream/main` merge.)
MTP spec-decode	`v1/spec_decode/llm_base_proposer.py` (+173)	DSv4 MTP probabilistic draft sampling + per-step MTP-layer routing in the shared proposer base
MoE quantization	`fused_moe.py` (+65), `oracle/mxfp4.py` (+43), `routed_experts.py` (+33), `experts/flashinfer_cutlass_moe.py` (+27), `quantization/mxfp4.py` (+12), `oracle/nvfp4.py` (+1)	MXFP4 / NVFP4 backend selection; the one-line NVFP4 fix (FLASHINFER_CUTLASS into the SwiGLU-clamp allow-list) lets DSv4-Flash-NVFP4 serve
FP8 / Marlin GEMM	`quantization/utils/fp8_utils.py` (+99), `linear/scaled_mm/{cutlass,marlin}.py` (+45/+16), `csrc/.../marlin_moe_wna16/ops.cu` (+10, the only C++)	SM12x e8m0→fp32 upcast + Marlin MoE SM12.0a cudagraph hardening (mirrors open upstream #43730 / #43722)
cudagraph / compile / config	`config/vllm.py` (+44), `compilation/breakable_cudagraph.py` (+22), `passes/utility/fix_functionalization.py` (+12), `config/compilation.py` (+11)	breakable-cudagraph auto-enable gate (MiniMax-only; DSv4 deliberately excluded), DSv4 custom-op defunctionalization + splitting-op registration
OpenAI entrypoints / parsers	`chat_completion/protocol.py` (+101), `serve/render/serving.py` (+28), `tool_parsers/structural_tag_registry.py` (+16), `chat_utils.py` (+11), `engine/protocol.py` (+9), `chat_completion/{serving,batch_serving}.py` (+8/+6), `reasoning/__init__.py` (+4)	expose DSv4 API semantics — `reasoning_content` / `thinking` param / tool-call streaming (jasl#19 instruction-following)
Kernel warmup	`model_executor/warmup/kernel_warmup.py` (+617)	additive DSv4 warmup (D512-split prefill precompile + MTP) to avoid JIT-during-inference wedges
Weight loading	`weight_utils.py` (+43), `default_loader.py` (+16)	fast-safetensors weight filter + EP-skip (lowers DSv4 load overhead on GB10)
env / utils	`envs.py` (+63), `utils/flashinfer.py` (+16), `utils/import_utils.py` (+9), `v1/worker/{gpu_model_runner,ubatch_utils}.py` (+12/+12)	`VLLM_DEEPSEEK_V4_*` flags + `has_cutedsl` / `has_flashinfer_trtllm_sparse_mla` probes

Two notes for review:

The most invasive generic edits were removed in the 2026-06-21 audit cleanup (below): the scheduler now carries a single +1-line change (the prefill-fairness heuristics were dropped) and the prefix-cache write-fence is gone.
A few hooks do touch code paths shared with non-DSv4 models and are the ones worth a closer look: the kv_cache_coordinator cache_blocks rewrite (affects hybrid-KV models; validated ≥ prior behavior), the MTP proposer base-class change, and the OpenAI-entrypoint plumbing. Everything else (MoE oracle, fp8_utils, cudagraph gate, warmup, envs) is arch / quant / env-gated and inert for other models.

Duplicate-work check

Open PR search was refreshed on 2026-06-12 for SM120 / SM12x / DeepSeek V4 / GB10 terms. The nearest open PRs are related but not duplicates:

PR	Difference
#43477	Merged 2026-06-22. Enables DeepSeek V4 + GLM-5.1 on SM120 via the FlashInfer-SM120 sparse-MLA route, but on its merged form requires the unreleased FlashInfer #3395 + DeepGEMM #324 dependency branches — on released/stock wheels its SM12x path raises at model construction (and the pinned DeepGEMM ref asserts on SM120). This PR is now reconciled on top of #43477 (merge `42657aca65`) and carries the stock-deps DSv4 SM120/121 path that runs on released wheels, complementing #43477's fork-deps route. See Update 2026-06-23 — #43477 reconciliation below.
#40929	Earlier WIP Triton fallback effort. This PR is the maintained replacement branch with the broader scheduler, prefix-cache, parser, quant, warmup, and harness-validated fixes carried forward.
#42856	Focused workspace-bound fix that explicitly depends on / references this PR; it is a subset-style bugfix, not the full DeepSeek V4 SM12x enablement branch.

Fixed preview tags

These tags are in jasl/vllm and give users stable pins while the PR is still moving:

Tag	Commit	Use
`sm120-pr-41834-stable-preview-20260626`	`c766cbc6ff`	latest validated head: synced onto `upstream/main` (198 commits since the #43477 merge; our NVFP4 `FLASHINFER_CUTLASS`-clamp fix landed upstream as #46492 → fork patch dropped) + the C128A metadata device-sync removal. 6 conflicts resolved; upstream's new `cooperative_topk` (#43008) gated off SM12x (capability family 120) to keep the validated decode path byte-identical. Validated dual-arch — RTX SM120 GSM8K-200 0.97 + #19 PASS; GB10 SM121 GSM8K-200 0.945 + arthur 64/64 + Forum53 PASS + llama-benchy prefill +80% / decode flat. See Update 2026-06-26 below.
`sm120-pr-41834-stable-preview-20260623`	`f7b4b425b0`	reconciled on top of the now-merged #43477 (merge `42657aca65` of `upstream/main`), keeping the stock-deps DSv4 SM120/121 path that runs on released wheels. Two reconcile regressions fixed — DeepGEMM no longer auto-enabled on SM120 (`a94657e601`, the pinned ref asserts), and #43477's prefill-SWA launch is gated + the kernel OOB clamped (`f7b4b425b0`). Validated dual-arch (see Update 2026-06-23 — #43477 reconciliation below).
`sm120-pr-41834-stable-preview-20260622b`	`5ba0f19f02`	pre-reconcile head: the 2026-06-21 audit head + two long-context (256k+) crash fixes — (1) packed-prefill output now sliced symmetrically with the query under MTP/cudagraph padding (fixes `output.size(0)==num_tokens (84 vs 83)` when `VLLM_DEEPSEEK_V4_FLASHINFER_SM120_PREFILL=1`), (2) MTP draft logits cast to float32 before top-k/top-p sampling (fixes an engine-killing assertion on any MTP + non-greedy sampling, gate-independent). See Update 2026-06-22 below.
`sm120-pr-41834-stable-preview-20260621`	`72261a7af149fa5d3fe2ed2b9956e92590731012`	post-audit cleanup head — breakable-cudagraph default OFF (MiniMax-only gate), long-context recall fixed by the int64 block-offset cast (the redundant write-fence + scheduler prefill-fairness heuristics + NVFP4 b12x lever removed), on top of the jasl#19 + #45309-revert correctness fixes. Validated metrics-flat on SM120 + SM121.
`sm120-pr-41834-stable-preview-20260620`	`a743ef5dfbd16cad0b9a628773c0c1d1841f1790`	prior head (write-fence / COW recall approach, since superseded by the int64-cast fix)
`sm120-pr-41834-stable-preview-20260612075245`	`f32247a5a695fa8979d61837bf6b87da897dcb7d`	earlier validated rebased PR branch preview
`sm120-pr-41834-fallback-before-replacement-20260612053720`	`5d1584e2de2b3c64540e70dfc370b0211eb6b2fc`	fallback tag for the old PR head before branch replacement

Update 2026-06-26 — synced onto upstream/main + dual-arch revalidation

Re-synced the branch onto current upstream/main (merge c7a4386a45, then the C128A device-sync hoist → c766cbc6ff; 198 upstream commits since the #43477 merge). 6 conflicts resolved:

oracle/nvfp4.py — union the SwiGLU-clamp backend set to {TRTLLM, CUTLASS, MARLIN}. Our FLASHINFER_CUTLASS clamp fix landed upstream as [Bugfix] Allow flashinfer_cutlass as a clamped NVFP4 MoE backend #46492, so the fork patch is now redundant.
routed_experts.py — combine the two per-tensor-scale loaders into one helper (our e8m0 bitwise view and upstream's 0-D/shape-(1,) _to_scalar normalization).
serve/render/serving.py + renderers/online_renderer.py — upstream's [Frontend] Split ServingRender into renderer and entrypoint. #44285 split ServingRender into renderer + entrypoint; our DSv4 thinking→template-kwargs threading is re-homed onto the new structure (sampling-params site in ServingRender.render_chat_request, prompt-render site in OnlineRenderer.render_chat).
sparse_attn_indexer.py — preserve our SM120 short-row / persistent top-k path, and add upstream's new cooperative_topk ([Perf][DSv4/DSv3.2] Add cluster-cooperative topK kernel for low-latency scenarios #43008) gated to exclude capability family 120, so SM12x decode is byte-identical to the validated path (enabling cooperative_topk on SM12x is a separate, to-be-validated perf experiment).
engine/protocol.py (keep both DeltaMessage hooks) and tests/models/test_deepseek_v4_mega_moe.py (keep CompilationConfig() fixture).

Inherited for free from the sync: deepseek_v2 redundant-clone removal (#46651), sampler int32-overflow fix (#46560), spec-decode correctness (#45956 / #46533).

Validation — full matrix, both arches, on c766cbc6ff:

arch	correctness	perf
RTX SM120 (TP=2)	GSM8K-200 0.97, #19 instruction-following PASS	throughput 8000×1000 captured
GB10 SM121 (2-node TP=2)	GSM8K-200 0.945, arthur coherence 64/64 + 24/24, Forum53 multi-user PASS	llama-benchy ctx_pp +80% (1709/1667/1588 t/s @ d8192/16384/32768 vs the prior pinned baseline 943/928/883, reproduced across two runs), decode flat (ctx_tg 38.6/38.5/37.4)

The MoE backend on both arches is Marlin (W4A16) — is_deep_gemm_supported() is False on stock SM12x wheels, so the DeepGEMM/W4A8 path isn't selected and Marlin is the default; it is GSM8K-correct on both SM120 and SM121. The prefill +80% is inherited from upstream's prefill / scheduler / block-pool work (not an SM12x change of ours); decode is flat because our SM12x decode path is preserved unchanged.

Update 2026-06-23 — #43477 reconciliation (stock-deps path) + dual-arch revalidation

Upstream merged #43477 (DeepSeek V4 + GLM-5.1 on SM120 via the FlashInfer-SM120 sparse-MLA route) on 2026-06-22. As merged it does not run on released wheels: its SM12x attention class raises at model construction unless the unreleased FlashInfer #3395 fork symbols are present, and it auto-enables a DeepGEMM MXFP4 path whose pinned ref (#324) asserts on SM120. This PR is now reconciled on top of it so the two coexist: #43477 is the fork-deps route; this PR is the stock-deps route that runs on released FlashInfer / DeepGEMM wheels.

Reconciliation is a merge of upstream/main into the PR branch (42657aca65, 6 conflicts resolved — kept our env/availability-gated SM120 decode route + both FlashInfer probes + #43477's gated prefill-SWA mechanism; dropped our now-redundant block_pool stale-hash reset in favour of upstream's), plus two fixes for regressions the merge introduced:

DeepGEMM no longer auto-enabled on SM120 (a94657e601). Enable DeepSeek V4 and GLM-5.1 on SM120 #43477 added SM120 to support_deep_gemm, so the engine selected a DeepGEMM MXFP4 kernel whose pinned/released ref aborts at init (Assertion sf.size(-2)==ceil_div(mn,gran_mn)). SM120 now falls back to Marlin/cutlass as before (needs the unmerged DeepGEMM GPTBigCodeForCasualLM support doesn't work #324 to enable).
Enable DeepSeek V4 and GLM-5.1 on SM120 #43477's prefill-SWA launch gated + kernel OOB clamped (f7b4b425b0). The merged paged prefill-SWA index kernel launched unconditionally and computed block_table addresses for masked-off tail lanes of deep (32k) prefill rows, which SM12x + Triton 3.6 faults as an illegal address even though the load is masked → cudaErrorLaunchFailure under concurrent load. The launch is now gated behind VLLM_DEEPSEEK_V4_FLASHINFER_SM120_PREFILL (default off → the stock decode-only path never launches it) and the kernel's masked lanes are clamped.

Validation — metrics flat vs the pre-reconcile head (5ba0f19f02), both arches, DeepSeek-V4-Flash, fp8 KV, MTP=2:

Check	RTX SM120 (2× PRO 6000)	GB10 SM121 (2-node)
Builds + serves on stock deps	✅	✅ (2-node TP=2, NCCL `2.30.7`)
GSM8K strict	0.9537528 — bit-identical to pre-reconcile	0.945 (5-shot, limit 200)
Default-path baseline phases	identical exit codes to pre-reconcile	—
Prefill throughput (random 8192×512)	1638.4 tok/s — bit-identical	—
Long-context coherence (conc 2 / 12)	(RTX 352/352 prior)	24/24
`--async-scheduling` A/B (off vs on), incl. 64k concurrency + sampling stress	safe + equivalent, 0 crashes, 0 wedges	—

The earlier "--async-scheduling regression" note is withdrawn — that crash was this same prefill-SWA OOB, fixed above; with the fix, --async-scheduling on/off are equivalent and crash-free through a 64k concurrency-8 sampling soak.

Update 2026-06-23 — GB10 / DGX Spark (SM121) long-context frontier, 256k–1M

Cold-prefill capability sweep on 2× GB10 / DGX Spark (SM121), TP=2 over RoCE, max-model-len 1048576, gpu-memory-utilization 0.75, MTP=2, fp8 KV, EP-off, prefix-cache disabled, FULL_AND_PIECEWISE, greedy, C=1 (post-audit head). All five points complete cleanly (0 failures, no OOM / crash). KV cache = 1,868,754 tokens (5,964 bytes/token); a 1M-token request admits at 1.78× concurrency at this utilization.

Context	Prompt tokens	Cold TTFT
256k	261,588	392.6 s
384k	392,648	648.5 s
512k	523,728	963.7 s
768k	785,868	1769.0 s
1M	1,048,011	2789.0 s

The multi-minute cost is the cold prefill (TTFT), which is GPU-bound (GPU ~~96% throughout each TTFT window) and scales super-linearly (~~O(N^1.4): 256k→512k = 2.45×, 512k→1M = 2.89×).

A dedicated decode-vs-context sweep (256-token generation at 16k / 64k / 256k / 512k) shows the opposite for generation: steady-state decode is essentially flat with depth — median inter-step latency ~61–69 ms across 16k→512k (≈30–45 tok/s effective with MTP), i.e. throughput does not meaningfully degrade as context grows. That matches the per-step cost being dominated by fixed MoE GEMM + the 2-node RoCE all-reduce rather than the depth-dependent (O(N)) indexer. (Decode rates from a 16-token TTFT-only run are not reliable — too short, plus MTP bundling — so they are not used here.)

So the long-context penalty is concentrated entirely in the one-time cold prefill (TTFT above), not in generation: prefill is GPU-bound compute/bandwidth (LPDDR5X ~273 GB/s) plus the per-chunk 2-node RoCE all-reduce (no inter-node NVLink), while decode stays ~constant per token. MTP keeps ~2.0 acceptance at depth. Practically: GB10 suits large-context-in → generation-out when the one-time cold first token (minutes at 384k+) is acceptable or amortized by prefix caching; once generating, throughput is depth-independent. The 1M cold TTFT is ~20% faster than the 2026-06-06 baseline (2789 s vs 3504 s).

Update 2026-06-22 — long-context (256k+) crash fixes (latest validated head)

Two distinct crashes were reported at long context (≥256k, MTP=2). Both are fixed on sm120-pr-41834-stable-preview-20260622b (5ba0f19f02); the rest of the branch is unchanged from the 2026-06-21 audit head:

Packed-prefill output slice (output.size(0)==num_tokens, 84 vs 83). With VLLM_DEEPSEEK_V4_FLASHINFER_SM120_PREFILL=1, the packed _forward_prefill path sliced the query to num_prefill_tokens under MTP/cudagraph padding but passed the unsliced padded output to the kernel, which derives num_tokens from q.shape and asserts output.size(0)==num_tokens → crash, cascading to an illegal-memory-access. The output is now sliced symmetrically. This path is gated (default FlashMLA prefill loops over q.shape and has no such assert → was never affected). Interim workaround for older builds: set VLLM_DEEPSEEK_V4_FLASHINFER_SM120_PREFILL=0.
MTP draft-sampler float32 cast (engine-killing assertion on non-greedy sampling). The MTP probabilistic draft sampler fed bf16 draft-head logits into the Triton top-k/top-p kernel, which asserts logits.dtype==torch.float32 → AssertionError kills the worker (and cascades to CUDA errors on TP peers). This fires on any MTP + non-greedy (top-k/top-p/temperature) request and is independent of the FlashInfer gates — so it reproduces "even without FlashInfer". Greedy decoding returns before the sampler, which is why greedy GSM8K validation never surfaced it. The draft logits are now cast to float32 (matching the main sampler). Validated on 2x RTX PRO 6000 (SM120): a 256k sustained sampling soak (temperature 0.7 / top_p 0.9, 9 concurrent workers, MTP=2, EP) that crashed on the first sampled request now runs clean (200+ requests).

Note for very long contexts (e.g. --max-model-len 500000): the sparse-MLA / indexer workspaces are sized by max_model_len and are not yet SM12x-arch-gated, so they consume a large fixed share of VRAM and leave a thin KV budget — see #42856 for the workspace-shrink fix. This is a memory-headroom concern, separate from the two crashes above.

Update 2026-06-21 — post-audit cleanup

This supersedes the 2026-06-20 and 2026-06-18 heads and the earlier validation data below. An audit of the SM12x branch against current upstream removed redundant, disproven, and experimental deltas; the cleaned head is validated metrics-flat (marginally better) on 2x RTX PRO 6000 Blackwell (SM120) and 2-node GB10 / DGX Spark (SM121), DeepSeek-V4-Flash, fp8 KV, MTP=2. The jasl#19 (instruction-following) and #45309 breakable-cudagraph-garbage revert (#45972) correctness fixes are retained. Five changes:

Breakable-cudagraph stays default OFF (FULL_AND_PIECEWISE). DeepSeek-V4 is deliberately excluded from breakable-cudagraph auto-enable — on real 2x GB10 MTP decode breakable regressed throughput and degraded as output length grew (≈31→19 tok/s at 400→800 max-tokens vs a flat ≈40). The gate is now a single MiniMax-only helper instead of a dead always-False stub; behavior is unchanged. (VLLM_USE_BREAKABLE_CUDAGRAPH=1 still opts in.)
The long-context recall fix is the int64 block-offset cast, not a cache fence. The 2026-06-20 head attributed the MTP high-concurrency recall/garble bug to a missing copy-on-write on writable caches and added a prefix-cache write-completion fence. Further investigation showed that hypothesis was wrong: the actual cause is an int32 overflow of the packed-KV block offset in the SM12x paged-MQA-logits indexer kernels, fixed by an int64 cast (retained). With the int64 fix in place the write fence is redundant — a fence-OFF recall gate holds 8/8 @ conc=8 and 16/16 @ conc=16 (0 miss) on RTX, and 8/8 on GB10. The write-completion fence and the COW broadening paired with it were therefore removed.
Removed the scheduler prefill-fairness heuristics (ungated, generic very-long-prefill / mixed-decode chunk-limiting). They targeted a decode cliff later re-diagnosed as MoE-GEMM + NCCL-all-reduce bound (not schedulable) and were not load-bearing: a cleanup-vs-prior A/B shows an identical mixed prefill/decode fairness ratio (0.716 vs 0.714) and equal inter-chunk latency.
Moved the experimental VLLM_NVFP4_GEMM_BACKEND b12x research lever out of the PR (off-by-default, unused on the shipped path) and dropped a tool-calling-env diff-reflow churn.

Net vs the 2026-06-20 head: 6 files, −833 lines (the removed fence + scheduler heuristics + their tests). The decode/prefill CUDA kernels are byte-identical across the cleanup, so the gated-decode-optimization profile and the 2026-06-12 throughput baselines below are unchanged.

Validation, 2026-06-21

Trivial-prompt generation (cudagraph sanity), both platforms: 2+2 → 4, 7*8 → 56, capital of France → Paris — no garbage.

Default decode path, MTP=2:

Gate	RTX SM120	GB10 SM121
GSM8K strict (8-shot full · 5-shot limit-200 · limit-100)	0.954 (full) · 0.96 (l200)	0.96 (l100)
Long-context recall, fence OFF, conc 8 / 16, MTP2	8/8 + 16/16, 0 miss	8/8
Instruction-following (jasl#19)	pass (JSON-only)	—
tool-call (15-case suite)	87%	—
Scheduler-removal A/B — mixed prefill/decode fairness ratio (cleanup vs prior)	0.716 vs 0.714	—
random 8192×512 TPOT (cleanup vs prior, ms)	6.27 vs 6.5	—
indexed-D512 min-token gate 4096 vs 8192 — prefill @4k	9,687 vs 6,203 tok/s	—

The GB10 SM121 run is a from-scratch 2-node rebuild of the cleaned head (NCCL 2.30.7 re-pinned per node); arithmetic, GSM8K, and the long-context recall gate all pass, confirming the fence removal holds recall on SM121 as well. The recall fix is the int64 cast in the SM12x indexer kernel, so the 2026-06-12 throughput baselines below are unchanged.

llama-benchy (eugr format), GB10 2-node / SM121, MTP=2, prefix-cache on (GB10 MTP decode/prefill profile — unchanged across the 2026-06-21 cleanup, decode kernel byte-identical):

test	t/s	peak t/s	ttfr (ms)
pp2048 (cold)	1205.5 ± 22		1705
tg128 (C=1)	40.0 ± 0.4	45.7
ctx_pp @ d8192	1722.5 ± 5		4762
ctx_tg @ d8192	38.5 ± 1.5	43.3
ctx_pp @ d16384	1674.8 ± 2		9788
ctx_tg @ d16384	39.2 ± 2.3	44.3
ctx_pp @ d32768	1595.3 ± 1		20547
ctx_tg @ d32768	41.6 ± 1.5	46.3

Prefill 1595–1722 tok/s at depth; decode 40 tok/s @ C=1 holding 38–42 out to 32K context (no decode cliff); prefix-cache hit 42–46% under MTP.

Gated SM120 decode optimization (`VLLM_DEEPSEEK_V4_FLASHINFER_SM120_DECODE=1`)

The decode gate uses flashinfer.mla._sparse_mla_sm120 (in FlashInfer main / 0.6.13; absent from the 0.6.12 release). Installing it correctly matters: a bare pip install --upgrade flashinfer-python @ git+main bumps flashinfer-python but leaves a stale flashinfer-cubin / flashinfer-jit-cache, and FlashInfer then raises a version-mismatch error at startup (and re-JITs kernels). Uninstall the precompiled packages first, then upgrade:

pip uninstall -y flashinfer-jit-cache flashinfer-cubin
pip install --upgrade "flashinfer-python @ git+https://github.com/flashinfer-ai/flashinfer.git"

For a reproducible pin instead of tracking moving main, install matching flashinfer-python + flashinfer-cubin nightlies (e.g. 0.6.13.dev20260619, a bit-identical decode kernel to the validated build) — again uninstalling flashinfer-jit-cache first.

RTX SM120, decode gate ON vs OFF, ctx0 decode (aggregate tok/s, 0 errors all rows):

C	gate OFF	gate ON	gain
1	189.7	201.4	+6%
2	311.3	334.7	+8%
4	483.1	531.6	+10%
8	707.6	801.9	+13%
16	990.5	1164.7	+18%
32	1545.0	1849.6	+20%
64	2132.5	2814.8	+32%

gate-ON @C64 = 2814.8 tok/s matches the community target (~2815). The decode CUDA kernel is byte-identical across the rebase, so this profile is unchanged.

The default path needs no FlashInfer update. With the gate off (default), the import is lazy/gated, so FlashInfer 0.6.12 (official) works unchanged. On GB10 / 2-node, also pin nvidia-nccl-cu13==2.30.7 (a rebuild reverts it; a per-node mismatch hangs the NCCL handshake).

Gated SM120 prefill optimization (`VLLM_DEEPSEEK_V4_FLASHINFER_SM120_PREFILL=1`)

Symmetric to the decode gate, prefill has an opt-in packed FlashInfer sparse-MLA path: VLLM_DEEPSEEK_V4_FLASHINFER_SM120_PREFILL=1 (default off; ~+5–6% single-stream prefill). With it off, prefill defers to the default FlashMLA indexed-D512 path. It routes through the same flashinfer.mla._sparse_mla_sm120 kernels as the decode gate, so it carries the identical FlashInfer version requirement — the install/pin steps above apply unchanged (FlashInfer main / 0.6.13; the default-off path needs no FlashInfer update and runs on 0.6.12). Decode and prefill share one FlashInfer build; there is no separate version to track for prefill.

Branch validation, 2026-06-12

Base and head:

upstream base: 8a91228dbe363d1d113deb2a82e289429130dd01
PR head: f32247a5a695fa8979d61837bf6b87da897dcb7d
branch range: 96 commits over upstream/main

Commands run on the final head:

Command	Result
`git diff --check upstream/main...HEAD`	pass
DCO scan over `upstream/main..HEAD`	pass; every commit has `Signed-off-by`
`VLLM_TARGET_DEVICE=empty .venv/bin/python -m compileall -q vllm/envs.py vllm/model_executor/warmup/kernel_warmup.py vllm/models/deepseek_v4 vllm/v1/core vllm/v1/attention/backends/mla vllm/reasoning/deepseek_v4_reasoning_parser.py tests/test_envs.py tests/v1/core/test_prefix_caching.py tests/v1/core/test_scheduler.py tests/reasoning/test_deepseekv4_reasoning_parser.py tests/quantization/test_sm12x_tuned_config_lookup.py`	pass
`.venv/bin/python -m pytest tests/test_envs.py::test_deepseek_v4_sparse_mla_stats_path_env -q` on the remote vLLM environment	`1 passed, 16 warnings`
`python3 -m pytest tests/test_scripts.py -q` in the public harness	`128 passed in 14.41s`

Local vLLM pytest/ruff were not run on the Mac checkout because its .venv does not currently include torch or ruff. GPU-path validation remains remote SM120/SM121-only.

Latest clean SM120 RTX PRO 6000 x2 data, 2026-06-12

Artifact roots:

artifacts/codex_pr_stable_preview_f32247a/2x_rtx_pro_6000_sm120/rtx_current_pr_short_throughput_mtp_noep_20260612084721
artifacts/codex_pr_stable_preview_f32247a/2x_rtx_pro_6000_sm120/rtx_current_pr_clean_mtp_noep_20260612080629

Short-throughput profile:

TP=2, MTP=2, expert parallel off, FP8 KV, block size 256.
max_model_len=131072, gpu_memory_utilization=0.975, max_num_batched_tokens=4096, max_num_seqs=24.
Prefix cache disabled, FULL_AND_PIECEWISE, 80 prompts per concurrency.
Phase exits: server_startup=0, bench_hf_mt_bench=0, bench_random_prefill_sweep=0.
Regression check: output/input throughput ratios are against the previous accepted same-profile EP-off reference; all are above the 0.95 floor.

HF MT-bench, 80 prompts:

C	output tok/s	ratio vs reference	mean TTFT ms	p99 ITL ms	MTP acceptance %
1	180.94	1.009	49.59	13.08	68.36
2	284.53	1.003	70.04	32.35	68.19
4	427.10	0.999	82.70	38.83	68.25
8	600.33	1.005	110.97	86.19	67.91
16	840.46	1.019	156.73	86.50	67.34
24	987.77	1.030	209.05	86.71	68.20

Random prefill sweep, C=1, output length 128, 8 requests per case:

Prompt / output tokens	input tok/s	ratio vs reference	mean TTFT ms	requests
4K / 128	3123.74	0.996	660.21	8 / 8
16K / 128	6209.00	1.005	2030.49	8 / 8
64K / 128	7049.72	0.999	8715.51	8 / 8

Correctness and reliability profile:

TP=2, MTP=2, expert parallel off, FP8 KV, prefix cache disabled, max_model_len=131072, max_num_seqs=4, max_num_batched_tokens=4096.
Phase exits: server_startup=0, bench_hf_mt_bench=0, eval_gsm8k=0, bench_random_prefill_sweep=0, bench_random_8000x1000=0, bench_random_256x256=0.
Post-run current-boot driver scan found no Xid, UVM, NV_ERR, GPU-lost, illegal-access, unspecified-launch, or fatal GPU signals; no vLLM compute processes were left running.

GSM8K 5-shot, limit-200, /v1/completions, MTP=2, concurrency 4:

Metric	Value	Floor	Result
flexible exact match	0.965	0.940	pass
strict exact match	0.940	0.925	pass

Additional 128K-profile random checks:

Shape	C	output tok/s	mean TTFT ms	p99 ITL ms	MTP acceptance %
8K / 1K	1	130.93	1367.03	13.44	52.56
8K / 1K	2	191.19	1586.64	17.44	50.28
8K / 1K	4	260.72	1666.96	199.75	51.76
256 / 256	1	153.07	88.80	13.17	51.46
256 / 256	4	369.86	127.80	84.44	52.50

Latest clean GB10 / SM121 data, 2026-06-12

Artifact root:

artifacts/codex_pr_stable_preview_f32247a/2x_gb10_sm121/gb10_forum53_mtp2_epoff_c2_gmem0685_mml81920/20260612074113

Profile:

TP=2, MTP=2, expert parallel off, FP8 KV, block size 256.
max_model_len=81920, max_num_seqs=2, max_num_batched_tokens=4096, gpu_memory_utilization=0.685.
Prefix cache enabled; Forum Refactor attention kernels #53 C=2 shape: forum53_c2:2:2:3200:256.
This covers the 80K-token prompt case on the final PR head. Failed, interrupted, or driver-signal artifacts are intentionally excluded from this PR body.

Gate result:

Gate	Result
summary `ok`	`true`
`serve_start.exit_code`	`0`
`streaming_pressure.exit_code`	`0`
driver health	`ok=true`, signal count `0`
request failures	`0 / 4`
preemptions	`0`

Timing and runtime summary:

Metric	Value
max prompt tokens	80,127
max TTFT	124.045698 s
max elapsed	124.949141 s
avg inter-chunk latency	0.056711 s
p95 inter-chunk latency	0.064278 s
p99 inter-chunk latency	0.144954 s
max inter-chunk latency	0.144954 s
GPU KV usage avg / max	65.81% / 86.40%
prefix-cache hits / queries	79,872 / 3,444,165

Running the NVFP4 checkpoint

This branch also serves nvidia/DeepSeek-V4-Flash-NVFP4 on SM12x (RTX PRO 6000 / GB10). The NVFP4 MoE auto-selects the FlashInfer CUTLASS backend (the SwiGLU-clamp model gate now accepts it), so no --moe-backend flag is required, and no special FlashInfer build is needed (the 0.6.12 release works):

vllm serve nvidia/DeepSeek-V4-Flash-NVFP4 \
  --trust-remote-code --tensor-parallel-size 2 \
  --kv-cache-dtype fp8 \
  --tokenizer-mode deepseek_v4

--kv-cache-dtype fp8 is mandatory: DeepSeek-V4's fp8_ds_mla attention asserts an fp8 KV layout, so the default auto fails at model construction (this is not NVFP4-specific). Expert-parallel off (plain TP) is the supported path.

Accuracy matches MXFP4 (GSM8K 8-shot ~0.96 on both SM120 and SM121). Note that on SM12x NVFP4 is not a memory or throughput win versus MXFP4: NVFP4 weights are ~4 GiB/GPU larger (~78 vs ~74 GiB), leaving less KV-cache room (lower max concurrency); single-stream prefill is marginally faster and aggregate decode marginally slower. Its value here is checkpoint availability / parity with the SM100 datacenter path, not an SM12x performance advantage — MXFP4 remains the better practical choice on consumer Blackwell.

AI assistance disclosure

AI assistants, including OpenAI Codex/GPT models and Anthropic Claude models, were used for code review, refactoring support, regression-script writing, and benchmark analysis. The branch was validated through human review plus the commands and harness artifacts listed above.

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

jasl · 2026-05-06T15:21:50Z

@zyongye
I've cleaned up the old PR, could you help review this one?

gemini-code-assist

Code Review

This pull request implements support for DeepSeek V4 on SM12x (Blackwell) architectures by providing Triton-based fallbacks for DeepGEMM-dependent operations. Key enhancements include the introduction of specialized Triton kernels for sparse MLA, FP8 einsum, and MQA logits, as well as memory optimizations in the sparse attention indexer to compute top-k indices without materializing full logits. Additionally, the PR updates the model loader to support weight name filtering for skipping MTP weights and handles Blackwell-specific FP8 quantization scales. I have no feedback to provide.

chatgpt-codex-connector · 2026-05-06T15:25:09Z

💡 Codex Review

vllm/vllm/model_executor/layers/sparse_attn_indexer.py

Lines 86 to 89 in 9596dbf

    
           def _sparse_indexer_requires_deep_gemm() -> bool: 
        
               return current_platform.is_cuda() and not ( 
        
                   current_platform.is_device_capability_family(120) 
        
               )

Keep DeepGEMM requirement for SM120 FP4 indexer path

This helper now disables the DeepGEMM requirement for every SM120 run, but the FP4 indexer cache path still depends on DeepGEMM kernels (fp8_fp4_*) because the new SM120 fallback only handles q_scale is None (FP8 Q). With use_fp4_cache=True on SM120 and no DeepGEMM installed, construction succeeds and the first prefill/decode call fails at runtime with the DeepGEMM _missing() error instead of being rejected up front.

vllm/vllm/model_executor/model_loader/default_loader.py

Lines 236 to 240 in 9596dbf

    
           if self.load_config.load_format == "fastsafetensors": 
        
               weights_iterator = fastsafetensors_weights_iterator( 
        
                   hf_weights_files, 
        
                   self.load_config.use_tqdm_on_load, 
        
               )

Propagate weight_name_filter to fast safetensor loaders

The new pre-load weight_name_filter is only wired into safetensors_weights_iterator; this branch still loads all tensors for fastsafetensors (and similarly other non-default safetensor iterators), so skipped tensors are still materialized. For DeepSeek V4 this defeats the intended early skip of MTP weights and can reintroduce high transient memory use/OOM when these load formats are enabled.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

… set Multi-dimensional review hardening of the DeepSeek-V4 SM12x path. All changes are pure-Python and no-ops on the canonical serve configs: - flashmla: _prefill_workspace_topk_bound now reserves the C128A prefill top-k width from the full compressed region (c128a_max_compressed, the same bound the metadata builder allocates c128a_prefill_buffer with) instead of index_topk. Above ~262144 context the C128A effective_topk exceeds index_topk (2048); the locked prefill workspace previously fit only because the lightning indexer's much larger reservation incidentally absorbed the gap. No-op at max_model_len <= 262144. - kernel_warmup: share _DEEPSEEK_V4_SPARSE_MLA_BACKENDS with flashinfer_sparse_mla_warmup. The local copy still listed the old "V4_FLASHMLA_SPARSE" backend name (renamed "FLASHMLA_SPARSE_DSV4" upstream); the warmup gate only kept matching via DEEPSEEK_SPARSE_SWA. - config: drop the inert "vllm::deepseek_v4_fp8_einsum" splitting_ops entry (a plain function, never registered as a custom op, so it can never match an FX node). - o_proj: correct the fp8-einsum recipe docstring (only SM100 takes the packed scale path; SM110 and SM12x use the legacy FP32 block-scale layout). - flashmla: drop the dead prefill_gather_lens_cpu binding. Signed-off-by: jasl <jasl9187@hotmail.com>

wingcomm · 2026-06-29T19:24:08Z

Built and deployed your candidate — here's arm #2, with an honest caveat on the bisect.

Arm #2 is live (patch + FlashInfer unchanged). Built 4214cea78b from source (parented exactly on 367ed7c66, so it's our current build + just your cu_base drop), FlashInfer held at 49f2abf. Running on the 4-node TP=4/EP=4 box now. Sanity: FlashInfer SM120 packed sparse-MLA decode runner engages cleanly (no Triton fallback), real-gen 200, 0 startup errors, all 4 ranks up.

Correctness at the trigger condition looks good. We hit it with concurrent unique long-context prefills (~48K tokens, conc=12 — unique prompts to defeat prefix-cache so every request is a real prefill → sustained num_prefills>1). Needle retrieval was 100% (9/9 completed) with 0% CJK — i.e. on exactly the concurrent-long-context path where the mis-index would corrupt req>0 indices, retrieval is correct. (For reference, unpatched 367ed7c66 scored 0.980 on nc=8 GSM8K, our best build to date.)

The honest caveat: we can't synthetically reproduce the freeze on either build, so this isn't a clean before/after yet.

Our aggressive hammer saturated rather than wedged: 12×48K against max_num_seqs=3 floods the queue, so most requests hit the client 300s timeout while waiting — but the engine stayed ADVANCING the whole time (prompt-tokens counter climbed 446K→1.6M, KV moving, no freeze, no restart, container never bounced).
The earlier nc=8 GSM8K on the unpatched build also ran clean. Incident-14 itself was a rare organic event (one freeze across ~35h then ~50h uptimes), so we don't have a reliable on-demand trigger.

So our signal is an extended organic soak on arm #2. If the cu_base double-subtract was the cause, the wedge should simply not recur on the patched build. We've got a monitor watching for recurrence and will report either a clean multi-day run or a fresh freeze (with py-spy + gdb across all ranks).

On the 3-arm bisect: because we can't trigger it synthetically, a single-shot of arm #1 (PREFILL=0) or arm #3 (FlashInfer ≥nightly-20260619) would be equally inconclusive on our side — a clean run could just be the intermittency. We're happy to run those as data points if useful, but flagging the same caveat. The live 4-node session offer stands once your switch is in — driving the EP/top-k interaction deterministically on the actual TP=4 topology is the cleanest way to nail the "bad indices vs top-k tolerance" split, since that's the variable we can't reproduce away from. Dumps preserved and available anytime.

wingcomm · 2026-06-30T05:33:14Z

Update from the soak — a significant one, plus some commit archaeology that narrows it.

The cu_base fix did not eliminate the hang class. Arm #2 (367ed7c66 + 4214cea78b, FlashInfer 49f2abf) ran clean ~16h, then froze with the identical silent-wedge signature (real-gen timeout, gen/iter counters frozen, shm_broadcast every 60s). Same intermittency / MTBF as before.

But it wedged in a different op. py-spy across all ranks caught:

mhc_fused_post_pre_tilelang   vllm/model_executor/kernels/mhc/tilelang.py:549
forward                       vllm/models/deepseek_v4/nvidia/model.py:940
... (inside vllm/compilation/cuda_graph.py:254)

the MHC prenorm GEMM via TileLang/TVM-FFI — not the _forward_prefill/cu_base/top-k path of incident-14. Native gdb is the same mechanism: cuLaunchKernelEx → __GI_sched_yield busy-spin (libtvm.so → libcuda), no ptxas/nvrtc/cuModuleLoad → a GPU kernel hang, not a JIT.

The reframe that matters most: the Python frame is the victim, not the culprit — it's just the next op trying to cuLaunchKernel when the GPU/stream is already hung. So _forward_prefill (inc-14) vs mhc_fused_post_pre_tilelang (inc-15) likely just reflects which op launched into an already-wedged stream. Your cu_base fix is still correct and may well have closed the prefill-path freeze, but some kernel is hanging on-device and stalling the stream, and the observed launch site varies.

Commit archaeology — where I think this lives. We scanned the PR history against the symptom:

The MHC commits (1381809 avoid MHC GEMM JIT per token count, 93400dd keep optimized MHC path, 16c1667e remove ineffective MHC warmup, aaef91a drop CustomOp wrapper) are all about compilation. Ours is a runtime hang with no JIT in the stack — so these don't apply, even though mhc_* is the victim frame.
The genuinely relevant lineage is the CUDA-graph + MTP hang work, all already in our build: 95fc9073b "Fix DeepSeek V4 MTP small-batch graph hangs", 285b542b8 "eager-break DeepSeek-V4 attention under FULL cudagraph for spec-decode", 6c92b0972 default FULL_AND_PIECEWISE. Our freeze is inside a CUDA-graph replay, under MTP spec-decode + FULL(_AND_PIECEWISE) cudagraph — exactly that regime. Those fixes are present and didn't fully close it, which points at a residual cudagraph + MTP + attention stream-ordering hang rather than a specific MHC or top-k kernel bug.
For completeness: nothing newer helps either. HEAD has moved 367ed7c66 → a5ebb5f66 (2 commits: a 06-29 upstream merge + "self-size C128A prefill workspace + de-dup warmup backend set") — neither touches the MHC/TileLang path or stream/launch ordering.

Net: the evidence points away from any single kernel (top-k or MHC) and toward a residual cudagraph + MTP + attention stream-ordering hang. The natural discriminators would be MTP-off vs enforce-eager/cudagraph_mode:NONE, plus a GPU-side capture (NCCL flight-recorder / CUDA trace) to name the actual in-flight kernel rather than the Python victim frame — happy to run whichever is most useful to you on the 4-node box, and the fresh py-spy + gdb dumps (worker + EngineCore, all ranks) are saved and available. The live 4-node session offer also stands once your switch is in.

jasl · 2026-06-30T09:34:55Z

Update from the soak — a significant one, plus some commit archaeology that narrows it.

The cu_base fix did not eliminate the hang class. Arm #2 (367ed7c66 + 4214cea78b, FlashInfer 49f2abf) ran clean ~16h, then froze with the identical silent-wedge signature (real-gen timeout, gen/iter counters frozen, shm_broadcast every 60s). Same intermittency / MTBF as before.

But it wedged in a different op. py-spy across all ranks caught:
mhc_fused_post_pre_tilelang   vllm/model_executor/kernels/mhc/tilelang.py:549
forward                       vllm/models/deepseek_v4/nvidia/model.py:940
... (inside vllm/compilation/cuda_graph.py:254)
the MHC prenorm GEMM via TileLang/TVM-FFI — not the _forward_prefill/cu_base/top-k path of incident-14. Native gdb is the same mechanism: cuLaunchKernelEx → __GI_sched_yield busy-spin (libtvm.so → libcuda), no ptxas/nvrtc/cuModuleLoad → a GPU kernel hang, not a JIT.

The reframe that matters most: the Python frame is the victim, not the culprit — it's just the next op trying to cuLaunchKernel when the GPU/stream is already hung. So _forward_prefill (inc-14) vs mhc_fused_post_pre_tilelang (inc-15) likely just reflects which op launched into an already-wedged stream. Your cu_base fix is still correct and may well have closed the prefill-path freeze, but some kernel is hanging on-device and stalling the stream, and the observed launch site varies.

Commit archaeology — where I think this lives. We scanned the PR history against the symptom:

The MHC commits (1381809 avoid MHC GEMM JIT per token count, 93400dd keep optimized MHC path, 16c1667e remove ineffective MHC warmup, aaef91a drop CustomOp wrapper) are all about compilation. Ours is a runtime hang with no JIT in the stack — so these don't apply, even though mhc_* is the victim frame.

The genuinely relevant lineage is the CUDA-graph + MTP hang work, all already in our build: 95fc9073b "Fix DeepSeek V4 MTP small-batch graph hangs", 285b542b8 "eager-break DeepSeek-V4 attention under FULL cudagraph for spec-decode", 6c92b0972 default FULL_AND_PIECEWISE. Our freeze is inside a CUDA-graph replay, under MTP spec-decode + FULL(_AND_PIECEWISE) cudagraph — exactly that regime. Those fixes are present and didn't fully close it, which points at a residual cudagraph + MTP + attention stream-ordering hang rather than a specific MHC or top-k kernel bug.

For completeness: nothing newer helps either. HEAD has moved 367ed7c66 → a5ebb5f66 (2 commits: a 06-29 upstream merge + "self-size C128A prefill workspace + de-dup warmup backend set") — neither touches the MHC/TileLang path or stream/launch ordering.

Net: the evidence points away from any single kernel (top-k or MHC) and toward a residual cudagraph + MTP + attention stream-ordering hang. The natural discriminators would be MTP-off vs enforce-eager/cudagraph_mode:NONE, plus a GPU-side capture (NCCL flight-recorder / CUDA trace) to name the actual in-flight kernel rather than the Python victim frame — happy to run whichever is most useful to you on the 4-node box, and the fresh py-spy + gdb dumps (worker + EngineCore, all ranks) are saved and available. The live 4-node session offer also stands once your switch is in.

Let me try! :)

jasl · 2026-06-30T10:14:59Z

Thanks for the incident-15 detail and for running arm #2 to ~16h — that's exactly the data point we needed. Two findings from tracing the EP path in a5ebb5f66b, one of which rules a hypothesis out.

We can rule out the cudagraph-mode / collective-desync angle. The tempting story was: at --enable-expert-parallel + DP=1, the per-layer combine is tensor_model_parallel_all_reduce (moe_runner.py:447-453, fires for ep_size>1); on 4 non-NVLink GB10s should_custom_ar() is False (custom_all_reduce.py:150,241) so it falls to raw pynccl on current_stream() (pynccl.py:186) and gets captured into the FULL graph; meanwhile the cross-rank cudagraph-mode MIN-sync is gated on data_parallel_size>1 (gpu_model_runner.py:3902, dp_utils.py:197) and skipped at your DP=1 — so a single divergent step would replay an unmatched collective and wedge. The structural facts all check out, but the trigger cannot fire: every cudagraph-dispatch input (num_tokens, num_reqs, uniform_decode, mode) is a pure function of the SchedulerOutput that's broadcast once and dequeued byte-identically by every worker (multiproc_executor.py:374/387), so all SPMD ranks pick the same (mode, BatchDescriptor) and issue the same captured collectives in the same order. The one architectural per-rank branch — cascade attention — is dead for MLA (use_cascade_attention() returns False in the base backend, no MLA backend overrides it). And MTP's draft loop is a static range(num_speculative_tokens-1) (llm_base_proposer.py:704); accepted/rejected counts only change masking contents, never the dispatch num_tokens or the collective count. So the dp>1 gate is genuinely not a correctness gap for EP — the divergence it guards against is created by independent DP scheduling, which DP=1 doesn't have. Net: not worth chasing the DP-sync / captured-collective-desync path.

What survives. With dispatch divergence out, the floating victim frame (your reframe is right — it's the next launch onto an already-dead stream) plus cuLaunchKernelEx → sched_yield, no JIT, and EP-only points at an on-device multi-CTA cooperative-kernel hang of the same class as FI #3615, but in a kernel #3615 didn't touch — triggered by the extra SM contention EP=4 adds. #3615 fixed the FlashInfer sampler radix-topk (which vLLM disables on SM120 anyway), but the same GB10 context-time-slice starvation (a peer CTA preempted past its slice while others spin on an arrival counter, no timeout) can hit any hand-rolled cooperative kernel. Two concrete candidates on the decode path: the NCCL collective kernels themselves (now far more numerous per step under EP), and our own lightning-indexer top-k persistent_topk.cuh (cooperative_topk is gated off for family-120, so decode top-k routes there; its GMEM spin-barrier has no co-residency guarantee or timeout). We can't name the kernel from code alone — it needs a GPU-side capture at an actual wedge.

To name it, the two cheap discriminators + the capture:

Does the wedge survive cudagraph_mode=PIECEWISE (or enforce-eager)? That removes the captured collectives from the FULL decode graph. If it still wedges, the hang is a plain on-device kernel race independent of capture (favors the cooperative-kernel hypothesis); if it stops, capture is implicated after all.
Does it survive with --enable-expert-parallel off (pure TP=4)? Isolates whether the EP combine collectives are the contention source vs. just a victim.
GPU-side capture to NAME it: NCCL flight recorder — TORCH_NCCL_TRACE_BUFFER_SIZE=20000, TORCH_NCCL_DUMP_ON_TIMEOUT=1, TORCH_NCCL_ENABLE_MONITORING=1, NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=COLL — on a real collective deadlock the watchdog dump lists the last enqueued collective per rank and any op-count mismatch directly. Pair it with a health-freeze watchdog (poll /health + the iter counter; on 200-but-frozen, fire py-spy dump --native + gdb -p … 'thread apply all bt' + trigger the NCCL dump on all ranks). If the NCCL flight recorder shows all ranks with matched, completed collectives and one stuck compute kernel, that points at persistent_topk/an indexer cooperative kernel rather than NCCL.

Could you confirm DP=1 (no --data-parallel-size), and if you get a cycle, the single most decisive data point is whether the wedge survives cudagraph_mode=PIECEWISE. We're bringing our own 4-node GB10 box up to TP=4/EP=4 now and will run the same capture; we'll share whatever the flight recorder names.

wingcomm · 2026-06-30T15:00:20Z

Confirmed: DP=1. We pass --tensor-parallel-size 4 --enable-expert-parallel with no --data-parallel-size — the distributed launcher only injects --nnodes 4 / --node-rank / --master-addr / --master-port, so it's a single 4-rank TP/EP group, DP=1.
So your SPMD reasoning applies directly and the dp>1 cudagraph-sync gate isn't in play for us.

Thanks for tracing the EP path and ruling the captured-collective-desync angle out so thoroughly — that's a useful elimination, and the shift to an EP-induced multi-CTA cooperative-kernel time-slice hang (same class as #3615, different kernel)
fits everything we've seen: floating victim frame, cuLaunchKernelEx → sched_yield, no JIT, and it only shows under real concurrency.

On the decisive cudagraph_mode=PIECEWISE discriminator — agreed it's the key data point, and we'll run it. Rather than force it now, we'll flip to PIECEWISE at the next wedge (it's intermittent — ~16h MTBF, and we couldn't trigger it
synthetically), so we keep a representative organic workload and get a clean before/after on the same traffic. We'll also arm the GPU-side capture so the next freeze is conclusive regardless: bigger flight-recorder buffer +
NCCL_DEBUG_SUBSYS=COLL, and fire the NCCL dump + gdb thread apply all bt on all ranks (our auto-capture currently does the head rank only; py-spy --native is broken on aarch64 — UNW_EBADREG — so gdb is our native path). If the flight recorder
comes back with all collectives matched/completed and one stuck compute kernel, that'll point at persistent_topk/the indexer cooperative kernel over NCCL, exactly as you laid out.

Will report the moment we catch one. Good luck bringing up your 4-node box — comparing the two captures should settle which kernel it is.

alexbi29 · 2026-07-01T05:30:16Z

@jasl I opened a stacked DSpark PR against your SM120 branch here:

jasl#25

It is intentionally based on jasl:codex/ds4-sm120-min-enable rather than upstream main, since DSpark depends on the DeepSeek V4 SM120 runtime/model support from this PR.

Validated on 2x RTX PRO 6000 Blackwell, TPS goes to ~310 vs ~200 before on 2048 tokens. Main caveat: it eats more VRAM, so there is less headroom for KV cache. Realistically max is ~512k on 2x RP6K.

Main thing to review is whether you want DSpark carried as a follow-on stacked branch after #41834, or reshaped before any upstream submission.

jasl · 2026-07-01T06:14:22Z

@jasl I opened a stacked DSpark PR against your SM120 branch here:

jasl#25

It is intentionally based on jasl:codex/ds4-sm120-min-enable rather than upstream main, since DSpark depends on the DeepSeek V4 SM120 runtime/model support from this PR.

Validated on 2x RTX PRO 6000 Blackwell, TPS goes to ~310 vs ~200 before on 2048 tokens. Main caveat: it eats more VRAM, so there is less headroom for KV cache. Realistically max is ~512k on 2x RP6K.

Main thing to review is whether you want DSpark carried as a follow-on stacked branch after #41834, or reshaped before any upstream submission.

Cool! Thank you! I'm looking for DSpark as well

alexbi29 · 2026-07-01T07:39:07Z

@jasl cleaned up a bit more buffers, ctx situation got a little better ~ 1m for 2x RP6K.
TPS did not change much: avg 304.8, median 304.8, min 299.9, max 309.1.

Having this model at 300 tps is qualitative change for some tasks.

124 upstream commits. Conflicts resolved (4 files), preserving the SM12x stack while absorbing upstream DSv4 work: - sparse_attn_indexer.py: absorb vllm-project#46076 dcp (inert unless dcp_world_size>1), keep SM120 short-row/persistent decode dispatch, dedupe the vllm-project#47164 family-120 cooperative-topk gate (upstream independently landed our fix). - sparse_swa.py: absorb vllm-project#46995 DSpark non-causal path, keep our _init_reorder_batch_threshold(supports_spec_as_decode) MTP threshold. - indexer.py: union has_prefilling_rows + dcp_local_seq_lens. - protocol.py: union DSv4 thinking-mode sampling override + vllm-project#35076 stop_token_ids. Signed-off-by: jasl <jasl9187@hotmail.com>

3 upstream bug fixes, none touching the DSv4/SM12x path: - vllm-project#47305 don't read KV cache past seq_len in triton paged attn kernels (generic triton paged-attn; DSv4 uses sparse-MLA, unaffected) - vllm-project#47308 warmup cross-attn properly in encoder-decoder case (decoder-only, no-op) - vllm-project#46482 ROCm P/D MoRIIO proxy JSON Content-Type (no-op for us) No conflicts; no DS4 files changed. Signed-off-by: jasl <jasl9187@hotmail.com>

Enable the fused probabilistic Markov sampler by default for method=dspark and remove the VLLM_DSPARK_FUSED_MARKOV_SAMPLER environment variable. The config field remains available for explicit bisects. Reuse envs.env_bool for DSpark SpeculativeConfig gates and document the same-step aliasing invariant for the shared draft-probs no-copy fast path.

Document that the Python seed counter must not be frozen by any future CUDA graph capture around sampling. Avoid an unnecessary clone on the top-k/top-p fused sampler path when dtype conversion already produced a float32 copy, while preserving a clone for existing float32 inputs.

aligningmyself · 2026-07-02T04:00:41Z

Validated PR #41834 on a 6-node GB10 (DGX Spark, sm_121a) cluster serving GLM-5.2 — extends the
PR's own 2× GB10 validation to 6×. GlmMoeDsa sparse-MLA loads, autotunes, serves, and generates
correctly. Let me know if not helpful, will stop commenting.

Setup

Hardware: 6× DGX Spark (GB10, SM121), 200 GbE ConnectX-7 fabric (NCCL over TCP, NET_PLUGIN=none).
vLLM: main @ fa24813 + this PR applied via pull/41834.diff — applied cleanly, no conflicts
(19,137 lines / 118 files). FlashInfer main, transformers 5.x, TORCH_CUDA_ARCH_LIST=12.1a.
Model: GLM-5.2-NVFP4 (GlmMoeDsa, 753B total / 40B active, NVFP4 weights + bf16 shared expert).
Parallelism: --tensor-parallel-size 2 --pipeline-parallel-size 3 (world size 6), Ray executor.
Flags: --kv-cache-dtype fp8 --enforce-eager --gpu-memory-utilization 0.80 --max-model-len 4096.

What worked

Backend selection (per rank): FLASHINFER_MLA_SPARSE_SM120 decode + FLASH_ATTN MLA prefill +
FLASHINFER_CUTLASS NVFP4 MoE, fp8_ds_mla KV-cache format.
Autotuner ran and cached on SM120: sparse_mla_sm120_decode_dsv3_2, fp4_gemm,
trtllm::fused_moe::gemm{1,2} — config-cache hits on subsequent ranks.
Memory: ~71 GiB/node model weights; KV cache 1,748,160 tokens (≈427× concurrency @ 4096 ctx).
Correctness (greedy, temp 0):
- 17 × 23 → 340+51 → 391 ✅
- 60 mi / 1.5 h → 60/1.5 → 40 mph ✅
- capital of Australia → Canberra ✅ (clean CoT, respects "one word" / "just the number")

Throughput (enforce-eager, 200-tok completions):

concurrency	aggregate	per-request
1	8.0 tok/s	8.0
8	25.4 tok/s	3.2
16	25.5 tok/s	3.2
(saturates ~25 tok/s aggregate by C=8 — compute-bound in eager mode; consistent with the PR's
"~35–45 tok/s/step, depth-independent" note.)

Observations / questions for maintainers

--enforce-eager used deliberately to sidestep the reported unaligned-prefill (num_tokens % 16)
Triton-JIT-during-cudagraph issue on SM121. Happy to test the cudagraph path with the warmup-padding
workaround and report the throughput delta — likely a meaningful gain over the eager numbers above.
Slow weight load: ranks reported model-load times of ~160–515 s from local NVMe (not NFS). Larger
than expected for ~71 GiB/node NVFP4 — possibly dequant/indexer setup. Can profile if useful.
Ran at gpu-mem 0.80 (128 GiB unified per node) with comfortable headroom; 0.85 also reached startup
but was marginal near the autotune→serve transition on this unified-memory part.

Offer: I have 6× GB10 and am happy to run larger-context (up to 1M), longer-horizon, cudagraph-on,
or higher-TP/PP-sweep validation for this PR — just say what's useful. This is the first 6-node GB10
serve of a DSA model that I'm aware of.

jasl · 2026-07-02T04:18:21Z

Thank you — this is genuinely useful, please don't stop. A 6-node PP=3 GB10 serve of a second DSA model (GLM-5.2) on this backend is exactly the generalization signal I want for the merge case.

A few notes on your open questions:

You shouldn't need --enforce-eager. The sparse-MLA warmup (VLLM_ENABLE_DEEPSEEK_V4_SPARSE_MLA_WARMUP, on by default) exists precisely to pre-compile the aligned prefill/decode shapes before cudagraph capture, so the num_tokens % 16 in-graph JIT shouldn't fire. Its gate keys off the attention backend, not the model, so GlmMoeDsa on FLASHINFER_MLA_SPARSE_SM120 should be covered. If you have cycles, please try dropping --enforce-eager — I'd expect a meaningful throughput jump over the eager ~25 tok/s (which is compute-bound as you noted). If it does JIT in-graph, that's a warmup-coverage gap on my side and your repro would let me fix it directly.
gpu-memory-utilization 0.80 is the right setting on GB10. The NVFP4 autotuner's transient workspaces during the autotune→serve transition aren't reflected in the util headroom, and on unified memory they share the same pool — so 0.85 being marginal is expected. --watermark is a cleaner lever than pushing util higher; you're not leaving much on the table at 0.80.
Slow load: the 160–515 s spread across ranks looks like NVMe/CPU contention plus per-rank NVFP4 dequant / indexer setup rather than raw I/O on the ~71 GiB itself. If you can profile disk-read vs dequant vs indexer-prep separately, that'd be interesting.

On your offer — the three most valuable to me:

The cudagraph-on run above (drop --enforce-eager) — the headline, and it doubles as a validation that the warmup generalizes to a second DSA model.
Single-sequence long-context TTFT at 256k → 1M — I can share a needle-recall coherence check so it validates correctness at extreme context, not just that it runs.
TP2 × PP3 correctness, since I mostly test TP — pipeline-parallel DSA coverage I don't otherwise exercise.

Any of those would be a big help. Thanks again for putting 6× GB10 behind this.

wingcomm · 2026-07-02T04:24:58Z

Two more wedges and the PIECEWISE discriminator is now live.

Incident-16 (~45h after the inc-15 recovery): same wedge, same frame. Recurred on the patched build (4214cea), and py-spy caught the identical mhc_fused_post_pre_tilelang victim frame as incident-15 — two
consecutive at MHC now (incident-14 was _forward_prefill). Same mechanism as always (cuLaunchKernelEx → sched_yield, no JIT).

Our hardened watchdog auto-recovered it cleanly this time — after inc-15's failed auto-restart (cancelled init from stale peer state), we made the restart verify 0 sparkrun containers across all 4 nodes
before relaunching; it worked first try, no manual intervention. So recovery is a solved problem even if the hang isn't.

PIECEWISE is running now (your discriminator). Flipped cudagraph_mode: FULL → PIECEWISE on the same 4214cea build — confirmed in the engine config, capturing only piecewise graphs (no FULL decode graph), so
the captured collectives are out of the decode path. Soaking now; I'll report whether the wedge survives it.

One caveat on running it: PIECEWISE costs us ~40% single-stream decode (≈23 tok/s vs ≈40 tok/s median at Running:1 from the pre-switch FULL snapshot). Expected here — our decode is interconnect-latency-bound
(GPUs spin-wait on NCCL), so the per-step launch overhead PIECEWISE reintroduces lands right on the critical path. So we're treating it as a time-boxed diagnostic: one MTBF window (~2 days). If the wedge
recurs under PIECEWISE → it's capture-independent (your on-device cooperative-kernel-race hypothesis), and we revert to FULL. If it stays clean that long → capture is implicated, but ~40% is too steep to
keep, so we'd still want a real fix.

On naming the kernel: we owe you the all-rank capture — the cron watchdog auto-recovered inc-16 faster than we could hold it, so we only have the head-rank py-spy+gdb (same MHC frame). We're wiring the
all-rank version now (bump TORCH_NCCL_TRACE_BUFFER_SIZE→20000, add NCCL_DEBUG_SUBSYS=COLL, fire the flight-recorder dump + gdb thread apply all bt on all 4 ranks, and hold-before-teardown) so the next freeze
names the stuck kernel — persistent_topk vs an NCCL collective. (FWIW the enforce-eager GB10 run you're discussing above is weak evidence in the same direction, though different model/topology.)

Add DeepSeek V4 DSpark proposer

@28k

…stence DSpark (PR #25) shipped a complete V1 DSparkProposer but was force-routed to the V2 runner, where DeepSeek-V4 long-context recall collapses under concurrency (arthur 3/16 @28k vs V1 16/16). Validated DSpark on the V1 runner on 2xRTX SM120: recall 16/16 @28k conc8, coherent, 218 tok/s (~1.3x MTP2). - vllm/config/vllm.py: drop the dspark->V2 force so DSpark follows normal runner selection (V1 by default for DeepSeek-V4). V2 speculator remains reachable via VLLM_USE_V2_MODEL_RUNNER=1 for A/B. - kernel_warmup.py: make _deepseek_v4_slot_mapping_warmup dual-interface (V1 input_batch/query_start_loc + V2 block_tables/input_buffers). PR #25 had rewritten it V2-only, so it silently no-opped on the V1 runner and reintroduced first-request JIT for ALL DeepSeek-V4 serving (MTP/DFlash/plain). - nvidia/model.py: gate _is_dspark_runtime_layer on method=="dspark". The flag dspark_fused_shared_experts_quant defaults True on every SpeculativeConfig and the MTP block sits at layer_idx==num_hidden_layers, so without this gate the unvalidated fused FP8 shared-experts kernel would engage on production MTP.

mergify · 2026-07-02T05:38:18Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @jasl.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

jasl requested review from 22quinn, LucasWilkinson, MatthewBonanni, ProExpertProg, WoosukKwon, hmellor, houseroad, mgoin, pavanimajety, robertgshaw2-redhat, tlrmchlsmth, yewentao256 and youkaichao as code owners May 6, 2026 15:17

claude Bot reviewed May 6, 2026

View reviewed changes

mergify Bot added deepseek Related to DeepSeek models nvidia v1 labels May 6, 2026

github-project-automation Bot added this to NVIDIA May 6, 2026

jasl mentioned this pull request May 6, 2026

[DSv4][Nvidia] SM12x DeepSeek V4 support #40991

Closed

gemini-code-assist Bot reviewed May 6, 2026

View reviewed changes

jasl changed the title ~~[New Model][Nvidia] Add SM12x support for DeepSeek V4 Flash~~ [New Model][Nvidia] Add SM12x support for DeepSeek V4 Flash with essential fixes May 6, 2026

jasl requested review from ApostaC, alexm-redhat, heheda12345, njhill, orozery and ywang96 as code owners May 6, 2026 15:54

jasl force-pushed the codex/ds4-sm120-min-enable branch from 042e366 to df2e6f8 Compare May 6, 2026 16:26

danielwoz mentioned this pull request Jun 30, 2026

[Backport][NVFP4] ds4-sm120-* hardcodes Mxfp4MoEMethod — DeepSeek-V4-Flash-NVFP4 fails to load on SM120 (works via ModelOpt→Marlin) jasl/vllm#24

Open

alexbi29 mentioned this pull request Jul 1, 2026

Add DeepSeek V4 DSpark proposer jasl/vllm#25

Merged

jasl and others added 13 commits July 2, 2026 01:29

Add DeepSeek V4 DSpark proposer

ae76acb

Share target embeddings with DSpark draft model

26a86f5

Make DSpark fast path the default

5f2f019

Trim unused speculative buffers

d55a877

Replace DSpark attention v1 kernel

71a0383

Fix DSpark rebase integration

c95e38c

Fix DSpark worker integration after rebase

1273219

Trim DSpark scratch buffers

575106c

Skip incomplete CUDA runtime stubs

ed7bdcc

jasl added 2 commits July 2, 2026 13:34

Merge pull request #25 from alexbi29/codex/dspark-pr-minimal-20260701

3d1b9b3

Add DeepSeek V4 DSpark proposer

Uh oh!

Uh oh!

Conversation

jasl commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Change footprint — model kernels vs. core-vLLM touch points

Duplicate-work check

Fixed preview tags

Update 2026-06-26 — synced onto upstream/main + dual-arch revalidation

Update 2026-06-23 — #43477 reconciliation (stock-deps path) + dual-arch revalidation

Update 2026-06-23 — GB10 / DGX Spark (SM121) long-context frontier, 256k–1M

Update 2026-06-22 — long-context (256k+) crash fixes (latest validated head)

Update 2026-06-21 — post-audit cleanup

Validation, 2026-06-21

Gated SM120 decode optimization (VLLM_DEEPSEEK_V4_FLASHINFER_SM120_DECODE=1)

Gated SM120 prefill optimization (VLLM_DEEPSEEK_V4_FLASHINFER_SM120_PREFILL=1)

Branch validation, 2026-06-12

Latest clean SM120 RTX PRO 6000 x2 data, 2026-06-12

Latest clean GB10 / SM121 data, 2026-06-12

Running the NVFP4 checkpoint

AI assistance disclosure

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

jasl commented May 6, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

chatgpt-codex-connector Bot commented May 6, 2026

💡 Codex Review

Uh oh!

wingcomm commented Jun 29, 2026

Uh oh!

wingcomm commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jasl commented Jun 30, 2026

Uh oh!

jasl commented Jun 30, 2026

Uh oh!

wingcomm commented Jun 30, 2026

Uh oh!

alexbi29 commented Jul 1, 2026

Uh oh!

jasl commented Jul 1, 2026

Uh oh!

alexbi29 commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aligningmyself commented Jul 2, 2026

Setup

What worked

Observations / questions for maintainers

Uh oh!

jasl commented Jul 2, 2026

Uh oh!

wingcomm commented Jul 2, 2026

Uh oh!

mergify Bot commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

jasl commented May 6, 2026 •

edited

Loading

Gated SM120 decode optimization (`VLLM_DEEPSEEK_V4_FLASHINFER_SM120_DECODE=1`)

Gated SM120 prefill optimization (`VLLM_DEEPSEEK_V4_FLASHINFER_SM120_PREFILL=1`)

wingcomm commented Jun 30, 2026 •

edited

Loading

alexbi29 commented Jul 1, 2026 •

edited

Loading