Skip to content

[New Model][Nvidia] Add SM12x support for DeepSeek V4 Flash with essential fixes#41834

Open
jasl wants to merge 158 commits into
vllm-project:mainfrom
jasl:codex/ds4-sm120-min-enable
Open

[New Model][Nvidia] Add SM12x support for DeepSeek V4 Flash with essential fixes#41834
jasl wants to merge 158 commits into
vllm-project:mainfrom
jasl:codex/ds4-sm120-min-enable

Conversation

@jasl

@jasl jasl commented May 6, 2026

Copy link
Copy Markdown
Contributor

Summary

This PR enables DeepSeek V4 Flash on SM120/SM121 Blackwell client hardware by carrying the SM12x fallback and tuning stack needed for the current vLLM V1 path. It is intended for RTX PRO 6000 Blackwell Workstation Edition, RTX 5090-class SM120, and GB10 / DGX Spark SM121 users who cannot use SM100-only TMEM / tcgen05 kernels.

As of 2026-06-23 this branch is reconciled on top of the merged #43477 and provides the stock-deps path: DeepSeek V4 on SM120/121 that builds and serves on released FlashInfer / DeepGEMM wheels, complementing #43477's route that needs the unreleased FlashInfer #3395 + DeepGEMM #324 dependency branches. As of 2026-06-26 the branch is additionally synced onto current upstream/main (198 commits since the #43477 merge); latest validated head is tag sm120-pr-41834-stable-preview-20260626 (c766cbc6ff). See Update 2026-06-26 — upstream sync and Update 2026-06-23 — #43477 reconciliation below.

Change footprint — model kernels vs. core-vLLM touch points

The branch splits cleanly into model/kernel code and a small set of core-vLLM integration points (116 files, +15.5k/−0.4k, of which ~+3.5k is tests):

  • DeepSeek-V4 model + SM12x kernels (~49 files, ~+10.2k) — the enablement itself. Everything under vllm/models/deepseek_v4/** plus the SM12x sparse-MLA decode / indexer / DeepGEMM kernels that live in shared dirs (v1/attention/backends/mla/sparse_mla_kernels.py, model_executor/layers/sparse_attn_indexer.py, v1/attention/backends/mla/{indexer,sparse_swa}.py, utils/deep_gemm.py, kernels/mhc/tilelang.py), the new DSv4 reasoning parser / tokenizer, and device tuning JSONs.
  • C128A metadata device→host sync removed (models/deepseek_v4/sparse_mla.py, perf, 2026-06-26) — _c128a_effective_topk_width now takes the max position from the CPU-side CommonAttentionMetadata.max_seq_len instead of a per-step int(positions.max().item()) device sync, dropping a launch-stream stall on every C128A metadata step (reported via gdb native stacks by a GB10/TP4 user). Decode is identical (max_seq_len-1 == positions.max()); only chunked prefill sees a safe, slightly-wider 128-aligned top-k.
  • Core-vLLM integration (~36 files, +1.8k/−0.2k) — the hooks below. Almost all are gated by model architecture / quant config / an env flag and are inert for other models.
Subsystem Files (≈ lines) What it does
KV-cache core single_type_kv_cache_manager.py (+243), kv_cache_coordinator.py (+67), kv_cache_manager.py (+60), sched/scheduler.py (+1) prefix-cache correctness for DSv4 sparse-MLA + MTP: an MLA cache-manager with prompt-block protection, a hybrid-coordinator cache_blocks tail-block-reuse rewrite. (Our earlier block_pool stale-hash reset is dropped in the reconcile — subsumed by upstream's own unconditional reset, which arrived via the upstream/main merge.)
MTP spec-decode v1/spec_decode/llm_base_proposer.py (+173) DSv4 MTP probabilistic draft sampling + per-step MTP-layer routing in the shared proposer base
MoE quantization fused_moe.py (+65), oracle/mxfp4.py (+43), routed_experts.py (+33), experts/flashinfer_cutlass_moe.py (+27), quantization/mxfp4.py (+12), oracle/nvfp4.py (+1) MXFP4 / NVFP4 backend selection; the one-line NVFP4 fix (FLASHINFER_CUTLASS into the SwiGLU-clamp allow-list) lets DSv4-Flash-NVFP4 serve
FP8 / Marlin GEMM quantization/utils/fp8_utils.py (+99), linear/scaled_mm/{cutlass,marlin}.py (+45/+16), csrc/.../marlin_moe_wna16/ops.cu (+10, the only C++) SM12x e8m0→fp32 upcast + Marlin MoE SM12.0a cudagraph hardening (mirrors open upstream #43730 / #43722)
cudagraph / compile / config config/vllm.py (+44), compilation/breakable_cudagraph.py (+22), passes/utility/fix_functionalization.py (+12), config/compilation.py (+11) breakable-cudagraph auto-enable gate (MiniMax-only; DSv4 deliberately excluded), DSv4 custom-op defunctionalization + splitting-op registration
OpenAI entrypoints / parsers chat_completion/protocol.py (+101), serve/render/serving.py (+28), tool_parsers/structural_tag_registry.py (+16), chat_utils.py (+11), engine/protocol.py (+9), chat_completion/{serving,batch_serving}.py (+8/+6), reasoning/__init__.py (+4) expose DSv4 API semantics — reasoning_content / thinking param / tool-call streaming (jasl#19 instruction-following)
Kernel warmup model_executor/warmup/kernel_warmup.py (+617) additive DSv4 warmup (D512-split prefill precompile + MTP) to avoid JIT-during-inference wedges
Weight loading weight_utils.py (+43), default_loader.py (+16) fast-safetensors weight filter + EP-skip (lowers DSv4 load overhead on GB10)
env / utils envs.py (+63), utils/flashinfer.py (+16), utils/import_utils.py (+9), v1/worker/{gpu_model_runner,ubatch_utils}.py (+12/+12) VLLM_DEEPSEEK_V4_* flags + has_cutedsl / has_flashinfer_trtllm_sparse_mla probes

Two notes for review:

  • The most invasive generic edits were removed in the 2026-06-21 audit cleanup (below): the scheduler now carries a single +1-line change (the prefill-fairness heuristics were dropped) and the prefix-cache write-fence is gone.
  • A few hooks do touch code paths shared with non-DSv4 models and are the ones worth a closer look: the kv_cache_coordinator cache_blocks rewrite (affects hybrid-KV models; validated ≥ prior behavior), the MTP proposer base-class change, and the OpenAI-entrypoint plumbing. Everything else (MoE oracle, fp8_utils, cudagraph gate, warmup, envs) is arch / quant / env-gated and inert for other models.

Duplicate-work check

Open PR search was refreshed on 2026-06-12 for SM120 / SM12x / DeepSeek V4 / GB10 terms. The nearest open PRs are related but not duplicates:

PR Difference
#43477 Merged 2026-06-22. Enables DeepSeek V4 + GLM-5.1 on SM120 via the FlashInfer-SM120 sparse-MLA route, but on its merged form requires the unreleased FlashInfer #3395 + DeepGEMM #324 dependency branches — on released/stock wheels its SM12x path raises at model construction (and the pinned DeepGEMM ref asserts on SM120). This PR is now reconciled on top of #43477 (merge 42657aca65) and carries the stock-deps DSv4 SM120/121 path that runs on released wheels, complementing #43477's fork-deps route. See Update 2026-06-23 — #43477 reconciliation below.
#40929 Earlier WIP Triton fallback effort. This PR is the maintained replacement branch with the broader scheduler, prefix-cache, parser, quant, warmup, and harness-validated fixes carried forward.
#42856 Focused workspace-bound fix that explicitly depends on / references this PR; it is a subset-style bugfix, not the full DeepSeek V4 SM12x enablement branch.

Fixed preview tags

These tags are in jasl/vllm and give users stable pins while the PR is still moving:

Tag Commit Use
sm120-pr-41834-stable-preview-20260626 c766cbc6ff latest validated head: synced onto upstream/main (198 commits since the #43477 merge; our NVFP4 FLASHINFER_CUTLASS-clamp fix landed upstream as #46492 → fork patch dropped) + the C128A metadata device-sync removal. 6 conflicts resolved; upstream's new cooperative_topk (#43008) gated off SM12x (capability family 120) to keep the validated decode path byte-identical. Validated dual-arch — RTX SM120 GSM8K-200 0.97 + #19 PASS; GB10 SM121 GSM8K-200 0.945 + arthur 64/64 + Forum53 PASS + llama-benchy prefill +80% / decode flat. See Update 2026-06-26 below.
sm120-pr-41834-stable-preview-20260623 f7b4b425b0 reconciled on top of the now-merged #43477 (merge 42657aca65 of upstream/main), keeping the stock-deps DSv4 SM120/121 path that runs on released wheels. Two reconcile regressions fixed — DeepGEMM no longer auto-enabled on SM120 (a94657e601, the pinned ref asserts), and #43477's prefill-SWA launch is gated + the kernel OOB clamped (f7b4b425b0). Validated dual-arch (see Update 2026-06-23 — #43477 reconciliation below).
sm120-pr-41834-stable-preview-20260622b 5ba0f19f02 pre-reconcile head: the 2026-06-21 audit head + two long-context (256k+) crash fixes — (1) packed-prefill output now sliced symmetrically with the query under MTP/cudagraph padding (fixes output.size(0)==num_tokens (84 vs 83) when VLLM_DEEPSEEK_V4_FLASHINFER_SM120_PREFILL=1), (2) MTP draft logits cast to float32 before top-k/top-p sampling (fixes an engine-killing assertion on any MTP + non-greedy sampling, gate-independent). See Update 2026-06-22 below.
sm120-pr-41834-stable-preview-20260621 72261a7af149fa5d3fe2ed2b9956e92590731012 post-audit cleanup head — breakable-cudagraph default OFF (MiniMax-only gate), long-context recall fixed by the int64 block-offset cast (the redundant write-fence + scheduler prefill-fairness heuristics + NVFP4 b12x lever removed), on top of the jasl#19 + #45309-revert correctness fixes. Validated metrics-flat on SM120 + SM121.
sm120-pr-41834-stable-preview-20260620 a743ef5dfbd16cad0b9a628773c0c1d1841f1790 prior head (write-fence / COW recall approach, since superseded by the int64-cast fix)
sm120-pr-41834-stable-preview-20260612075245 f32247a5a695fa8979d61837bf6b87da897dcb7d earlier validated rebased PR branch preview
sm120-pr-41834-fallback-before-replacement-20260612053720 5d1584e2de2b3c64540e70dfc370b0211eb6b2fc fallback tag for the old PR head before branch replacement

Update 2026-06-26 — synced onto upstream/main + dual-arch revalidation

Re-synced the branch onto current upstream/main (merge c7a4386a45, then the C128A device-sync hoist → c766cbc6ff; 198 upstream commits since the #43477 merge). 6 conflicts resolved:

  • oracle/nvfp4.py — union the SwiGLU-clamp backend set to {TRTLLM, CUTLASS, MARLIN}. Our FLASHINFER_CUTLASS clamp fix landed upstream as [Bugfix] Allow flashinfer_cutlass as a clamped NVFP4 MoE backend #46492, so the fork patch is now redundant.
  • routed_experts.py — combine the two per-tensor-scale loaders into one helper (our e8m0 bitwise view and upstream's 0-D/shape-(1,) _to_scalar normalization).
  • serve/render/serving.py + renderers/online_renderer.py — upstream's [Frontend] Split ServingRender into renderer and entrypoint. #44285 split ServingRender into renderer + entrypoint; our DSv4 thinking→template-kwargs threading is re-homed onto the new structure (sampling-params site in ServingRender.render_chat_request, prompt-render site in OnlineRenderer.render_chat).
  • sparse_attn_indexer.py — preserve our SM120 short-row / persistent top-k path, and add upstream's new cooperative_topk ([Perf][DSv4/DSv3.2] Add cluster-cooperative topK kernel for low-latency scenarios #43008) gated to exclude capability family 120, so SM12x decode is byte-identical to the validated path (enabling cooperative_topk on SM12x is a separate, to-be-validated perf experiment).
  • engine/protocol.py (keep both DeltaMessage hooks) and tests/models/test_deepseek_v4_mega_moe.py (keep CompilationConfig() fixture).

Inherited for free from the sync: deepseek_v2 redundant-clone removal (#46651), sampler int32-overflow fix (#46560), spec-decode correctness (#45956 / #46533).

Validation — full matrix, both arches, on c766cbc6ff:

arch correctness perf
RTX SM120 (TP=2) GSM8K-200 0.97, #19 instruction-following PASS throughput 8000×1000 captured
GB10 SM121 (2-node TP=2) GSM8K-200 0.945, arthur coherence 64/64 + 24/24, Forum53 multi-user PASS llama-benchy ctx_pp +80% (1709/1667/1588 t/s @ d8192/16384/32768 vs the prior pinned baseline 943/928/883, reproduced across two runs), decode flat (ctx_tg 38.6/38.5/37.4)

The MoE backend on both arches is Marlin (W4A16)is_deep_gemm_supported() is False on stock SM12x wheels, so the DeepGEMM/W4A8 path isn't selected and Marlin is the default; it is GSM8K-correct on both SM120 and SM121. The prefill +80% is inherited from upstream's prefill / scheduler / block-pool work (not an SM12x change of ours); decode is flat because our SM12x decode path is preserved unchanged.

Update 2026-06-23 — #43477 reconciliation (stock-deps path) + dual-arch revalidation

Upstream merged #43477 (DeepSeek V4 + GLM-5.1 on SM120 via the FlashInfer-SM120 sparse-MLA route) on 2026-06-22. As merged it does not run on released wheels: its SM12x attention class raises at model construction unless the unreleased FlashInfer #3395 fork symbols are present, and it auto-enables a DeepGEMM MXFP4 path whose pinned ref (#324) asserts on SM120. This PR is now reconciled on top of it so the two coexist: #43477 is the fork-deps route; this PR is the stock-deps route that runs on released FlashInfer / DeepGEMM wheels.

Reconciliation is a merge of upstream/main into the PR branch (42657aca65, 6 conflicts resolved — kept our env/availability-gated SM120 decode route + both FlashInfer probes + #43477's gated prefill-SWA mechanism; dropped our now-redundant block_pool stale-hash reset in favour of upstream's), plus two fixes for regressions the merge introduced:

  1. DeepGEMM no longer auto-enabled on SM120 (a94657e601). Enable DeepSeek V4 and GLM-5.1 on SM120 #43477 added SM120 to support_deep_gemm, so the engine selected a DeepGEMM MXFP4 kernel whose pinned/released ref aborts at init (Assertion sf.size(-2)==ceil_div(mn,gran_mn)). SM120 now falls back to Marlin/cutlass as before (needs the unmerged DeepGEMM GPTBigCodeForCasualLM support doesn't work #324 to enable).
  2. Enable DeepSeek V4 and GLM-5.1 on SM120 #43477's prefill-SWA launch gated + kernel OOB clamped (f7b4b425b0). The merged paged prefill-SWA index kernel launched unconditionally and computed block_table addresses for masked-off tail lanes of deep (32k) prefill rows, which SM12x + Triton 3.6 faults as an illegal address even though the load is masked → cudaErrorLaunchFailure under concurrent load. The launch is now gated behind VLLM_DEEPSEEK_V4_FLASHINFER_SM120_PREFILL (default off → the stock decode-only path never launches it) and the kernel's masked lanes are clamped.

Validation — metrics flat vs the pre-reconcile head (5ba0f19f02), both arches, DeepSeek-V4-Flash, fp8 KV, MTP=2:

Check RTX SM120 (2× PRO 6000) GB10 SM121 (2-node)
Builds + serves on stock deps ✅ (2-node TP=2, NCCL 2.30.7)
GSM8K strict 0.9537528 — bit-identical to pre-reconcile 0.945 (5-shot, limit 200)
Default-path baseline phases identical exit codes to pre-reconcile
Prefill throughput (random 8192×512) 1638.4 tok/s — bit-identical
Long-context coherence (conc 2 / 12) (RTX 352/352 prior) 24/24
--async-scheduling A/B (off vs on), incl. 64k concurrency + sampling stress safe + equivalent, 0 crashes, 0 wedges

The earlier "--async-scheduling regression" note is withdrawn — that crash was this same prefill-SWA OOB, fixed above; with the fix, --async-scheduling on/off are equivalent and crash-free through a 64k concurrency-8 sampling soak.

Update 2026-06-23 — GB10 / DGX Spark (SM121) long-context frontier, 256k–1M

Cold-prefill capability sweep on 2× GB10 / DGX Spark (SM121), TP=2 over RoCE, max-model-len 1048576, gpu-memory-utilization 0.75, MTP=2, fp8 KV, EP-off, prefix-cache disabled, FULL_AND_PIECEWISE, greedy, C=1 (post-audit head). All five points complete cleanly (0 failures, no OOM / crash). KV cache = 1,868,754 tokens (5,964 bytes/token); a 1M-token request admits at 1.78× concurrency at this utilization.

Context Prompt tokens Cold TTFT
256k 261,588 392.6 s
384k 392,648 648.5 s
512k 523,728 963.7 s
768k 785,868 1769.0 s
1M 1,048,011 2789.0 s

The multi-minute cost is the cold prefill (TTFT), which is GPU-bound (GPU 96% throughout each TTFT window) and scales super-linearly (O(N^1.4): 256k→512k = 2.45×, 512k→1M = 2.89×).

A dedicated decode-vs-context sweep (256-token generation at 16k / 64k / 256k / 512k) shows the opposite for generation: steady-state decode is essentially flat with depth — median inter-step latency ~61–69 ms across 16k→512k (≈30–45 tok/s effective with MTP), i.e. throughput does not meaningfully degrade as context grows. That matches the per-step cost being dominated by fixed MoE GEMM + the 2-node RoCE all-reduce rather than the depth-dependent (O(N)) indexer. (Decode rates from a 16-token TTFT-only run are not reliable — too short, plus MTP bundling — so they are not used here.)

So the long-context penalty is concentrated entirely in the one-time cold prefill (TTFT above), not in generation: prefill is GPU-bound compute/bandwidth (LPDDR5X ~273 GB/s) plus the per-chunk 2-node RoCE all-reduce (no inter-node NVLink), while decode stays ~constant per token. MTP keeps ~2.0 acceptance at depth. Practically: GB10 suits large-context-in → generation-out when the one-time cold first token (minutes at 384k+) is acceptable or amortized by prefix caching; once generating, throughput is depth-independent. The 1M cold TTFT is ~20% faster than the 2026-06-06 baseline (2789 s vs 3504 s).

Update 2026-06-22 — long-context (256k+) crash fixes (latest validated head)

Two distinct crashes were reported at long context (≥256k, MTP=2). Both are fixed on sm120-pr-41834-stable-preview-20260622b (5ba0f19f02); the rest of the branch is unchanged from the 2026-06-21 audit head:

  1. Packed-prefill output slice (output.size(0)==num_tokens, 84 vs 83). With VLLM_DEEPSEEK_V4_FLASHINFER_SM120_PREFILL=1, the packed _forward_prefill path sliced the query to num_prefill_tokens under MTP/cudagraph padding but passed the unsliced padded output to the kernel, which derives num_tokens from q.shape and asserts output.size(0)==num_tokens → crash, cascading to an illegal-memory-access. The output is now sliced symmetrically. This path is gated (default FlashMLA prefill loops over q.shape and has no such assert → was never affected). Interim workaround for older builds: set VLLM_DEEPSEEK_V4_FLASHINFER_SM120_PREFILL=0.

  2. MTP draft-sampler float32 cast (engine-killing assertion on non-greedy sampling). The MTP probabilistic draft sampler fed bf16 draft-head logits into the Triton top-k/top-p kernel, which asserts logits.dtype==torch.float32AssertionError kills the worker (and cascades to CUDA errors on TP peers). This fires on any MTP + non-greedy (top-k/top-p/temperature) request and is independent of the FlashInfer gates — so it reproduces "even without FlashInfer". Greedy decoding returns before the sampler, which is why greedy GSM8K validation never surfaced it. The draft logits are now cast to float32 (matching the main sampler). Validated on 2x RTX PRO 6000 (SM120): a 256k sustained sampling soak (temperature 0.7 / top_p 0.9, 9 concurrent workers, MTP=2, EP) that crashed on the first sampled request now runs clean (200+ requests).

Note for very long contexts (e.g. --max-model-len 500000): the sparse-MLA / indexer workspaces are sized by max_model_len and are not yet SM12x-arch-gated, so they consume a large fixed share of VRAM and leave a thin KV budget — see #42856 for the workspace-shrink fix. This is a memory-headroom concern, separate from the two crashes above.

Update 2026-06-21 — post-audit cleanup

This supersedes the 2026-06-20 and 2026-06-18 heads and the earlier validation data below. An audit of the SM12x branch against current upstream removed redundant, disproven, and experimental deltas; the cleaned head is validated metrics-flat (marginally better) on 2x RTX PRO 6000 Blackwell (SM120) and 2-node GB10 / DGX Spark (SM121), DeepSeek-V4-Flash, fp8 KV, MTP=2. The jasl#19 (instruction-following) and #45309 breakable-cudagraph-garbage revert (#45972) correctness fixes are retained. Five changes:

  1. Breakable-cudagraph stays default OFF (FULL_AND_PIECEWISE). DeepSeek-V4 is deliberately excluded from breakable-cudagraph auto-enable — on real 2x GB10 MTP decode breakable regressed throughput and degraded as output length grew (≈31→19 tok/s at 400→800 max-tokens vs a flat ≈40). The gate is now a single MiniMax-only helper instead of a dead always-False stub; behavior is unchanged. (VLLM_USE_BREAKABLE_CUDAGRAPH=1 still opts in.)

  2. The long-context recall fix is the int64 block-offset cast, not a cache fence. The 2026-06-20 head attributed the MTP high-concurrency recall/garble bug to a missing copy-on-write on writable caches and added a prefix-cache write-completion fence. Further investigation showed that hypothesis was wrong: the actual cause is an int32 overflow of the packed-KV block offset in the SM12x paged-MQA-logits indexer kernels, fixed by an int64 cast (retained). With the int64 fix in place the write fence is redundant — a fence-OFF recall gate holds 8/8 @ conc=8 and 16/16 @ conc=16 (0 miss) on RTX, and 8/8 on GB10. The write-completion fence and the COW broadening paired with it were therefore removed.

  3. Removed the scheduler prefill-fairness heuristics (ungated, generic very-long-prefill / mixed-decode chunk-limiting). They targeted a decode cliff later re-diagnosed as MoE-GEMM + NCCL-all-reduce bound (not schedulable) and were not load-bearing: a cleanup-vs-prior A/B shows an identical mixed prefill/decode fairness ratio (0.716 vs 0.714) and equal inter-chunk latency.

  4. Moved the experimental VLLM_NVFP4_GEMM_BACKEND b12x research lever out of the PR (off-by-default, unused on the shipped path) and dropped a tool-calling-env diff-reflow churn.

Net vs the 2026-06-20 head: 6 files, −833 lines (the removed fence + scheduler heuristics + their tests). The decode/prefill CUDA kernels are byte-identical across the cleanup, so the gated-decode-optimization profile and the 2026-06-12 throughput baselines below are unchanged.

Validation, 2026-06-21

Trivial-prompt generation (cudagraph sanity), both platforms: 2+2 → 4, 7*8 → 56, capital of France → Paris — no garbage.

Default decode path, MTP=2:

Gate RTX SM120 GB10 SM121
GSM8K strict (8-shot full · 5-shot limit-200 · limit-100) 0.954 (full) · 0.96 (l200) 0.96 (l100)
Long-context recall, fence OFF, conc 8 / 16, MTP2 8/8 + 16/16, 0 miss 8/8
Instruction-following (jasl#19) pass (JSON-only)
tool-call (15-case suite) 87%
Scheduler-removal A/B — mixed prefill/decode fairness ratio (cleanup vs prior) 0.716 vs 0.714
random 8192×512 TPOT (cleanup vs prior, ms) 6.27 vs 6.5
indexed-D512 min-token gate 4096 vs 8192 — prefill @4k 9,687 vs 6,203 tok/s

The GB10 SM121 run is a from-scratch 2-node rebuild of the cleaned head (NCCL 2.30.7 re-pinned per node); arithmetic, GSM8K, and the long-context recall gate all pass, confirming the fence removal holds recall on SM121 as well. The recall fix is the int64 cast in the SM12x indexer kernel, so the 2026-06-12 throughput baselines below are unchanged.

llama-benchy (eugr format), GB10 2-node / SM121, MTP=2, prefix-cache on (GB10 MTP decode/prefill profile — unchanged across the 2026-06-21 cleanup, decode kernel byte-identical):

test t/s peak t/s ttfr (ms)
pp2048 (cold) 1205.5 ± 22 1705
tg128 (C=1) 40.0 ± 0.4 45.7
ctx_pp @ d8192 1722.5 ± 5 4762
ctx_tg @ d8192 38.5 ± 1.5 43.3
ctx_pp @ d16384 1674.8 ± 2 9788
ctx_tg @ d16384 39.2 ± 2.3 44.3
ctx_pp @ d32768 1595.3 ± 1 20547
ctx_tg @ d32768 41.6 ± 1.5 46.3

Prefill 1595–1722 tok/s at depth; decode 40 tok/s @ C=1 holding 38–42 out to 32K context (no decode cliff); prefix-cache hit 42–46% under MTP.

Gated SM120 decode optimization (VLLM_DEEPSEEK_V4_FLASHINFER_SM120_DECODE=1)

The decode gate uses flashinfer.mla._sparse_mla_sm120 (in FlashInfer main / 0.6.13; absent from the 0.6.12 release). Installing it correctly matters: a bare pip install --upgrade flashinfer-python @ git+main bumps flashinfer-python but leaves a stale flashinfer-cubin / flashinfer-jit-cache, and FlashInfer then raises a version-mismatch error at startup (and re-JITs kernels). Uninstall the precompiled packages first, then upgrade:

pip uninstall -y flashinfer-jit-cache flashinfer-cubin
pip install --upgrade "flashinfer-python @ git+https://github.com/flashinfer-ai/flashinfer.git"

For a reproducible pin instead of tracking moving main, install matching flashinfer-python + flashinfer-cubin nightlies (e.g. 0.6.13.dev20260619, a bit-identical decode kernel to the validated build) — again uninstalling flashinfer-jit-cache first.

RTX SM120, decode gate ON vs OFF, ctx0 decode (aggregate tok/s, 0 errors all rows):

C gate OFF gate ON gain
1 189.7 201.4 +6%
2 311.3 334.7 +8%
4 483.1 531.6 +10%
8 707.6 801.9 +13%
16 990.5 1164.7 +18%
32 1545.0 1849.6 +20%
64 2132.5 2814.8 +32%

gate-ON @C64 = 2814.8 tok/s matches the community target (~2815). The decode CUDA kernel is byte-identical across the rebase, so this profile is unchanged.

The default path needs no FlashInfer update. With the gate off (default), the import is lazy/gated, so FlashInfer 0.6.12 (official) works unchanged. On GB10 / 2-node, also pin nvidia-nccl-cu13==2.30.7 (a rebuild reverts it; a per-node mismatch hangs the NCCL handshake).

Gated SM120 prefill optimization (VLLM_DEEPSEEK_V4_FLASHINFER_SM120_PREFILL=1)

Symmetric to the decode gate, prefill has an opt-in packed FlashInfer sparse-MLA path: VLLM_DEEPSEEK_V4_FLASHINFER_SM120_PREFILL=1 (default off; ~+5–6% single-stream prefill). With it off, prefill defers to the default FlashMLA indexed-D512 path. It routes through the same flashinfer.mla._sparse_mla_sm120 kernels as the decode gate, so it carries the identical FlashInfer version requirement — the install/pin steps above apply unchanged (FlashInfer main / 0.6.13; the default-off path needs no FlashInfer update and runs on 0.6.12). Decode and prefill share one FlashInfer build; there is no separate version to track for prefill.

Branch validation, 2026-06-12

Base and head:

  • upstream base: 8a91228dbe363d1d113deb2a82e289429130dd01
  • PR head: f32247a5a695fa8979d61837bf6b87da897dcb7d
  • branch range: 96 commits over upstream/main

Commands run on the final head:

Command Result
git diff --check upstream/main...HEAD pass
DCO scan over upstream/main..HEAD pass; every commit has Signed-off-by
VLLM_TARGET_DEVICE=empty .venv/bin/python -m compileall -q vllm/envs.py vllm/model_executor/warmup/kernel_warmup.py vllm/models/deepseek_v4 vllm/v1/core vllm/v1/attention/backends/mla vllm/reasoning/deepseek_v4_reasoning_parser.py tests/test_envs.py tests/v1/core/test_prefix_caching.py tests/v1/core/test_scheduler.py tests/reasoning/test_deepseekv4_reasoning_parser.py tests/quantization/test_sm12x_tuned_config_lookup.py pass
.venv/bin/python -m pytest tests/test_envs.py::test_deepseek_v4_sparse_mla_stats_path_env -q on the remote vLLM environment 1 passed, 16 warnings
python3 -m pytest tests/test_scripts.py -q in the public harness 128 passed in 14.41s

Local vLLM pytest/ruff were not run on the Mac checkout because its .venv does not currently include torch or ruff. GPU-path validation remains remote SM120/SM121-only.

Latest clean SM120 RTX PRO 6000 x2 data, 2026-06-12

Artifact roots:

  • artifacts/codex_pr_stable_preview_f32247a/2x_rtx_pro_6000_sm120/rtx_current_pr_short_throughput_mtp_noep_20260612084721
  • artifacts/codex_pr_stable_preview_f32247a/2x_rtx_pro_6000_sm120/rtx_current_pr_clean_mtp_noep_20260612080629

Short-throughput profile:

  • TP=2, MTP=2, expert parallel off, FP8 KV, block size 256.
  • max_model_len=131072, gpu_memory_utilization=0.975, max_num_batched_tokens=4096, max_num_seqs=24.
  • Prefix cache disabled, FULL_AND_PIECEWISE, 80 prompts per concurrency.
  • Phase exits: server_startup=0, bench_hf_mt_bench=0, bench_random_prefill_sweep=0.
  • Regression check: output/input throughput ratios are against the previous accepted same-profile EP-off reference; all are above the 0.95 floor.

HF MT-bench, 80 prompts:

C output tok/s ratio vs reference mean TTFT ms p99 ITL ms MTP acceptance %
1 180.94 1.009 49.59 13.08 68.36
2 284.53 1.003 70.04 32.35 68.19
4 427.10 0.999 82.70 38.83 68.25
8 600.33 1.005 110.97 86.19 67.91
16 840.46 1.019 156.73 86.50 67.34
24 987.77 1.030 209.05 86.71 68.20

Random prefill sweep, C=1, output length 128, 8 requests per case:

Prompt / output tokens input tok/s ratio vs reference mean TTFT ms requests
4K / 128 3123.74 0.996 660.21 8 / 8
16K / 128 6209.00 1.005 2030.49 8 / 8
64K / 128 7049.72 0.999 8715.51 8 / 8

Correctness and reliability profile:

  • TP=2, MTP=2, expert parallel off, FP8 KV, prefix cache disabled, max_model_len=131072, max_num_seqs=4, max_num_batched_tokens=4096.
  • Phase exits: server_startup=0, bench_hf_mt_bench=0, eval_gsm8k=0, bench_random_prefill_sweep=0, bench_random_8000x1000=0, bench_random_256x256=0.
  • Post-run current-boot driver scan found no Xid, UVM, NV_ERR, GPU-lost, illegal-access, unspecified-launch, or fatal GPU signals; no vLLM compute processes were left running.

GSM8K 5-shot, limit-200, /v1/completions, MTP=2, concurrency 4:

Metric Value Floor Result
flexible exact match 0.965 0.940 pass
strict exact match 0.940 0.925 pass

Additional 128K-profile random checks:

Shape C output tok/s mean TTFT ms p99 ITL ms MTP acceptance %
8K / 1K 1 130.93 1367.03 13.44 52.56
8K / 1K 2 191.19 1586.64 17.44 50.28
8K / 1K 4 260.72 1666.96 199.75 51.76
256 / 256 1 153.07 88.80 13.17 51.46
256 / 256 4 369.86 127.80 84.44 52.50

Latest clean GB10 / SM121 data, 2026-06-12

Artifact root:

  • artifacts/codex_pr_stable_preview_f32247a/2x_gb10_sm121/gb10_forum53_mtp2_epoff_c2_gmem0685_mml81920/20260612074113

Profile:

  • TP=2, MTP=2, expert parallel off, FP8 KV, block size 256.
  • max_model_len=81920, max_num_seqs=2, max_num_batched_tokens=4096, gpu_memory_utilization=0.685.
  • Prefix cache enabled; Forum Refactor attention kernels #53 C=2 shape: forum53_c2:2:2:3200:256.
  • This covers the 80K-token prompt case on the final PR head. Failed, interrupted, or driver-signal artifacts are intentionally excluded from this PR body.

Gate result:

Gate Result
summary ok true
serve_start.exit_code 0
streaming_pressure.exit_code 0
driver health ok=true, signal count 0
request failures 0 / 4
preemptions 0

Timing and runtime summary:

Metric Value
max prompt tokens 80,127
max TTFT 124.045698 s
max elapsed 124.949141 s
avg inter-chunk latency 0.056711 s
p95 inter-chunk latency 0.064278 s
p99 inter-chunk latency 0.144954 s
max inter-chunk latency 0.144954 s
GPU KV usage avg / max 65.81% / 86.40%
prefix-cache hits / queries 79,872 / 3,444,165

Running the NVFP4 checkpoint

This branch also serves nvidia/DeepSeek-V4-Flash-NVFP4 on SM12x (RTX PRO 6000 / GB10). The NVFP4 MoE auto-selects the FlashInfer CUTLASS backend (the SwiGLU-clamp model gate now accepts it), so no --moe-backend flag is required, and no special FlashInfer build is needed (the 0.6.12 release works):

vllm serve nvidia/DeepSeek-V4-Flash-NVFP4 \
  --trust-remote-code --tensor-parallel-size 2 \
  --kv-cache-dtype fp8 \
  --tokenizer-mode deepseek_v4

--kv-cache-dtype fp8 is mandatory: DeepSeek-V4's fp8_ds_mla attention asserts an fp8 KV layout, so the default auto fails at model construction (this is not NVFP4-specific). Expert-parallel off (plain TP) is the supported path.

Accuracy matches MXFP4 (GSM8K 8-shot ~0.96 on both SM120 and SM121). Note that on SM12x NVFP4 is not a memory or throughput win versus MXFP4: NVFP4 weights are ~4 GiB/GPU larger (~78 vs ~74 GiB), leaving less KV-cache room (lower max concurrency); single-stream prefill is marginally faster and aggregate decode marginally slower. Its value here is checkpoint availability / parity with the SM100 datacenter path, not an SM12x performance advantage — MXFP4 remains the better practical choice on consumer Blackwell.

AI assistance disclosure

AI assistants, including OpenAI Codex/GPT models and Anthropic Claude models, were used for code review, refactoring support, regression-script writing, and benchmark analysis. The branch was validated through human review plus the commands and harness artifacts listed above.

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@mergify mergify Bot added deepseek Related to DeepSeek models nvidia v1 labels May 6, 2026
@jasl

jasl commented May 6, 2026

Copy link
Copy Markdown
Contributor Author

@zyongye
I've cleaned up the old PR, could you help review this one?

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements support for DeepSeek V4 on SM12x (Blackwell) architectures by providing Triton-based fallbacks for DeepGEMM-dependent operations. Key enhancements include the introduction of specialized Triton kernels for sparse MLA, FP8 einsum, and MQA logits, as well as memory optimizations in the sparse attention indexer to compute top-k indices without materializing full logits. Additionally, the PR updates the model loader to support weight name filtering for skipping MTP weights and handles Blackwell-specific FP8 quantization scales. I have no feedback to provide.

@chatgpt-codex-connector

Copy link
Copy Markdown

💡 Codex Review

def _sparse_indexer_requires_deep_gemm() -> bool:
return current_platform.is_cuda() and not (
current_platform.is_device_capability_family(120)
)

P1 Badge Keep DeepGEMM requirement for SM120 FP4 indexer path

This helper now disables the DeepGEMM requirement for every SM120 run, but the FP4 indexer cache path still depends on DeepGEMM kernels (fp8_fp4_*) because the new SM120 fallback only handles q_scale is None (FP8 Q). With use_fp4_cache=True on SM120 and no DeepGEMM installed, construction succeeds and the first prefill/decode call fails at runtime with the DeepGEMM _missing() error instead of being rejected up front.


if self.load_config.load_format == "fastsafetensors":
weights_iterator = fastsafetensors_weights_iterator(
hf_weights_files,
self.load_config.use_tqdm_on_load,
)

P2 Badge Propagate weight_name_filter to fast safetensor loaders

The new pre-load weight_name_filter is only wired into safetensors_weights_iterator; this branch still loads all tensors for fastsafetensors (and similarly other non-default safetensor iterators), so skipped tensors are still materialized. For DeepSeek V4 this defeats the intended early skip of MTP weights and can reintroduce high transient memory use/OOM when these load formats are enabled.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@jasl jasl changed the title [New Model][Nvidia] Add SM12x support for DeepSeek V4 Flash [New Model][Nvidia] Add SM12x support for DeepSeek V4 Flash with essential fixes May 6, 2026
@jasl jasl force-pushed the codex/ds4-sm120-min-enable branch from 042e366 to df2e6f8 Compare May 6, 2026 16:26
… set

Multi-dimensional review hardening of the DeepSeek-V4 SM12x path. All
changes are pure-Python and no-ops on the canonical serve configs:

- flashmla: _prefill_workspace_topk_bound now reserves the C128A prefill
  top-k width from the full compressed region (c128a_max_compressed, the
  same bound the metadata builder allocates c128a_prefill_buffer with)
  instead of index_topk. Above ~262144 context the C128A effective_topk
  exceeds index_topk (2048); the locked prefill workspace previously fit
  only because the lightning indexer's much larger reservation incidentally
  absorbed the gap. No-op at max_model_len <= 262144.
- kernel_warmup: share _DEEPSEEK_V4_SPARSE_MLA_BACKENDS with
  flashinfer_sparse_mla_warmup. The local copy still listed the old
  "V4_FLASHMLA_SPARSE" backend name (renamed "FLASHMLA_SPARSE_DSV4"
  upstream); the warmup gate only kept matching via DEEPSEEK_SPARSE_SWA.
- config: drop the inert "vllm::deepseek_v4_fp8_einsum" splitting_ops entry
  (a plain function, never registered as a custom op, so it can never match
  an FX node).
- o_proj: correct the fp8-einsum recipe docstring (only SM100 takes the
  packed scale path; SM110 and SM12x use the legacy FP32 block-scale layout).
- flashmla: drop the dead prefill_gather_lens_cpu binding.

Signed-off-by: jasl <jasl9187@hotmail.com>
@wingcomm

Copy link
Copy Markdown

Built and deployed your candidate — here's arm #2, with an honest caveat on the bisect.

Arm #2 is live (patch + FlashInfer unchanged). Built 4214cea78b from source (parented exactly on 367ed7c66, so it's our current build + just your cu_base drop), FlashInfer held at 49f2abf. Running on the 4-node TP=4/EP=4 box now. Sanity: FlashInfer SM120 packed sparse-MLA decode runner engages cleanly (no Triton fallback), real-gen 200, 0 startup errors, all 4 ranks up.

Correctness at the trigger condition looks good. We hit it with concurrent unique long-context prefills (~48K tokens, conc=12 — unique prompts to defeat prefix-cache so every request is a real prefill → sustained num_prefills>1). Needle retrieval was 100% (9/9 completed) with 0% CJK — i.e. on exactly the concurrent-long-context path where the mis-index would corrupt req>0 indices, retrieval is correct. (For reference, unpatched 367ed7c66 scored 0.980 on nc=8 GSM8K, our best build to date.)

The honest caveat: we can't synthetically reproduce the freeze on either build, so this isn't a clean before/after yet.

  • Our aggressive hammer saturated rather than wedged: 12×48K against max_num_seqs=3 floods the queue, so most requests hit the client 300s timeout while waiting — but the engine stayed ADVANCING the whole time (prompt-tokens counter climbed 446K→1.6M, KV moving, no freeze, no restart, container never bounced).
  • The earlier nc=8 GSM8K on the unpatched build also ran clean. Incident-14 itself was a rare organic event (one freeze across ~35h then ~50h uptimes), so we don't have a reliable on-demand trigger.

So our signal is an extended organic soak on arm #2. If the cu_base double-subtract was the cause, the wedge should simply not recur on the patched build. We've got a monitor watching for recurrence and will report either a clean multi-day run or a fresh freeze (with py-spy + gdb across all ranks).

On the 3-arm bisect: because we can't trigger it synthetically, a single-shot of arm #1 (PREFILL=0) or arm #3 (FlashInfer ≥nightly-20260619) would be equally inconclusive on our side — a clean run could just be the intermittency. We're happy to run those as data points if useful, but flagging the same caveat. The live 4-node session offer stands once your switch is in — driving the EP/top-k interaction deterministically on the actual TP=4 topology is the cleanest way to nail the "bad indices vs top-k tolerance" split, since that's the variable we can't reproduce away from. Dumps preserved and available anytime.

@wingcomm

wingcomm commented Jun 30, 2026

Copy link
Copy Markdown

Update from the soak — a significant one, plus some commit archaeology that narrows it.

The cu_base fix did not eliminate the hang class. Arm #2 (367ed7c66 + 4214cea78b, FlashInfer 49f2abf) ran clean ~16h, then froze with the identical silent-wedge signature (real-gen timeout, gen/iter counters frozen, shm_broadcast every 60s). Same intermittency / MTBF as before.

But it wedged in a different op. py-spy across all ranks caught:

mhc_fused_post_pre_tilelang   vllm/model_executor/kernels/mhc/tilelang.py:549
forward                       vllm/models/deepseek_v4/nvidia/model.py:940
... (inside vllm/compilation/cuda_graph.py:254)

the MHC prenorm GEMM via TileLang/TVM-FFI — not the _forward_prefill/cu_base/top-k path of incident-14. Native gdb is the same mechanism: cuLaunchKernelEx → __GI_sched_yield busy-spin (libtvm.solibcuda), no ptxas/nvrtc/cuModuleLoad → a GPU kernel hang, not a JIT.

The reframe that matters most: the Python frame is the victim, not the culprit — it's just the next op trying to cuLaunchKernel when the GPU/stream is already hung. So _forward_prefill (inc-14) vs mhc_fused_post_pre_tilelang (inc-15) likely just reflects which op launched into an already-wedged stream. Your cu_base fix is still correct and may well have closed the prefill-path freeze, but some kernel is hanging on-device and stalling the stream, and the observed launch site varies.

Commit archaeology — where I think this lives. We scanned the PR history against the symptom:

  • The MHC commits (1381809 avoid MHC GEMM JIT per token count, 93400dd keep optimized MHC path, 16c1667e remove ineffective MHC warmup, aaef91a drop CustomOp wrapper) are all about compilation. Ours is a runtime hang with no JIT in the stack — so these don't apply, even though mhc_* is the victim frame.
  • The genuinely relevant lineage is the CUDA-graph + MTP hang work, all already in our build: 95fc9073b "Fix DeepSeek V4 MTP small-batch graph hangs", 285b542b8 "eager-break DeepSeek-V4 attention under FULL cudagraph for spec-decode", 6c92b0972 default FULL_AND_PIECEWISE. Our freeze is inside a CUDA-graph replay, under MTP spec-decode + FULL(_AND_PIECEWISE) cudagraph — exactly that regime. Those fixes are present and didn't fully close it, which points at a residual cudagraph + MTP + attention stream-ordering hang rather than a specific MHC or top-k kernel bug.
  • For completeness: nothing newer helps either. HEAD has moved 367ed7c66 → a5ebb5f66 (2 commits: a 06-29 upstream merge + "self-size C128A prefill workspace + de-dup warmup backend set") — neither touches the MHC/TileLang path or stream/launch ordering.

Net: the evidence points away from any single kernel (top-k or MHC) and toward a residual cudagraph + MTP + attention stream-ordering hang. The natural discriminators would be MTP-off vs enforce-eager/cudagraph_mode:NONE, plus a GPU-side capture (NCCL flight-recorder / CUDA trace) to name the actual in-flight kernel rather than the Python victim frame — happy to run whichever is most useful to you on the 4-node box, and the fresh py-spy + gdb dumps (worker + EngineCore, all ranks) are saved and available. The live 4-node session offer also stands once your switch is in.

@jasl

jasl commented Jun 30, 2026

Copy link
Copy Markdown
Contributor Author

Update from the soak — a significant one, plus some commit archaeology that narrows it.

The cu_base fix did not eliminate the hang class. Arm #2 (367ed7c66 + 4214cea78b, FlashInfer 49f2abf) ran clean ~16h, then froze with the identical silent-wedge signature (real-gen timeout, gen/iter counters frozen, shm_broadcast every 60s). Same intermittency / MTBF as before.

But it wedged in a different op. py-spy across all ranks caught:

mhc_fused_post_pre_tilelang   vllm/model_executor/kernels/mhc/tilelang.py:549
forward                       vllm/models/deepseek_v4/nvidia/model.py:940
... (inside vllm/compilation/cuda_graph.py:254)

the MHC prenorm GEMM via TileLang/TVM-FFI — not the _forward_prefill/cu_base/top-k path of incident-14. Native gdb is the same mechanism: cuLaunchKernelEx → __GI_sched_yield busy-spin (libtvm.solibcuda), no ptxas/nvrtc/cuModuleLoad → a GPU kernel hang, not a JIT.

The reframe that matters most: the Python frame is the victim, not the culprit — it's just the next op trying to cuLaunchKernel when the GPU/stream is already hung. So _forward_prefill (inc-14) vs mhc_fused_post_pre_tilelang (inc-15) likely just reflects which op launched into an already-wedged stream. Your cu_base fix is still correct and may well have closed the prefill-path freeze, but some kernel is hanging on-device and stalling the stream, and the observed launch site varies.

Commit archaeology — where I think this lives. We scanned the PR history against the symptom:

  • The MHC commits (1381809 avoid MHC GEMM JIT per token count, 93400dd keep optimized MHC path, 16c1667e remove ineffective MHC warmup, aaef91a drop CustomOp wrapper) are all about compilation. Ours is a runtime hang with no JIT in the stack — so these don't apply, even though mhc_* is the victim frame.
  • The genuinely relevant lineage is the CUDA-graph + MTP hang work, all already in our build: 95fc9073b "Fix DeepSeek V4 MTP small-batch graph hangs", 285b542b8 "eager-break DeepSeek-V4 attention under FULL cudagraph for spec-decode", 6c92b0972 default FULL_AND_PIECEWISE. Our freeze is inside a CUDA-graph replay, under MTP spec-decode + FULL(_AND_PIECEWISE) cudagraph — exactly that regime. Those fixes are present and didn't fully close it, which points at a residual cudagraph + MTP + attention stream-ordering hang rather than a specific MHC or top-k kernel bug.
  • For completeness: nothing newer helps either. HEAD has moved 367ed7c66 → a5ebb5f66 (2 commits: a 06-29 upstream merge + "self-size C128A prefill workspace + de-dup warmup backend set") — neither touches the MHC/TileLang path or stream/launch ordering.

Net: the evidence points away from any single kernel (top-k or MHC) and toward a residual cudagraph + MTP + attention stream-ordering hang. The natural discriminators would be MTP-off vs enforce-eager/cudagraph_mode:NONE, plus a GPU-side capture (NCCL flight-recorder / CUDA trace) to name the actual in-flight kernel rather than the Python victim frame — happy to run whichever is most useful to you on the 4-node box, and the fresh py-spy + gdb dumps (worker + EngineCore, all ranks) are saved and available. The live 4-node session offer also stands once your switch is in.

image

Let me try! :)

@jasl

jasl commented Jun 30, 2026

Copy link
Copy Markdown
Contributor Author

Thanks for the incident-15 detail and for running arm #2 to ~16h — that's exactly the data point we needed. Two findings from tracing the EP path in a5ebb5f66b, one of which rules a hypothesis out.

We can rule out the cudagraph-mode / collective-desync angle. The tempting story was: at --enable-expert-parallel + DP=1, the per-layer combine is tensor_model_parallel_all_reduce (moe_runner.py:447-453, fires for ep_size>1); on 4 non-NVLink GB10s should_custom_ar() is False (custom_all_reduce.py:150,241) so it falls to raw pynccl on current_stream() (pynccl.py:186) and gets captured into the FULL graph; meanwhile the cross-rank cudagraph-mode MIN-sync is gated on data_parallel_size>1 (gpu_model_runner.py:3902, dp_utils.py:197) and skipped at your DP=1 — so a single divergent step would replay an unmatched collective and wedge. The structural facts all check out, but the trigger cannot fire: every cudagraph-dispatch input (num_tokens, num_reqs, uniform_decode, mode) is a pure function of the SchedulerOutput that's broadcast once and dequeued byte-identically by every worker (multiproc_executor.py:374/387), so all SPMD ranks pick the same (mode, BatchDescriptor) and issue the same captured collectives in the same order. The one architectural per-rank branch — cascade attention — is dead for MLA (use_cascade_attention() returns False in the base backend, no MLA backend overrides it). And MTP's draft loop is a static range(num_speculative_tokens-1) (llm_base_proposer.py:704); accepted/rejected counts only change masking contents, never the dispatch num_tokens or the collective count. So the dp>1 gate is genuinely not a correctness gap for EP — the divergence it guards against is created by independent DP scheduling, which DP=1 doesn't have. Net: not worth chasing the DP-sync / captured-collective-desync path.

What survives. With dispatch divergence out, the floating victim frame (your reframe is right — it's the next launch onto an already-dead stream) plus cuLaunchKernelEx → sched_yield, no JIT, and EP-only points at an on-device multi-CTA cooperative-kernel hang of the same class as FI #3615, but in a kernel #3615 didn't touch — triggered by the extra SM contention EP=4 adds. #3615 fixed the FlashInfer sampler radix-topk (which vLLM disables on SM120 anyway), but the same GB10 context-time-slice starvation (a peer CTA preempted past its slice while others spin on an arrival counter, no timeout) can hit any hand-rolled cooperative kernel. Two concrete candidates on the decode path: the NCCL collective kernels themselves (now far more numerous per step under EP), and our own lightning-indexer top-k persistent_topk.cuh (cooperative_topk is gated off for family-120, so decode top-k routes there; its GMEM spin-barrier has no co-residency guarantee or timeout). We can't name the kernel from code alone — it needs a GPU-side capture at an actual wedge.

To name it, the two cheap discriminators + the capture:

  1. Does the wedge survive cudagraph_mode=PIECEWISE (or enforce-eager)? That removes the captured collectives from the FULL decode graph. If it still wedges, the hang is a plain on-device kernel race independent of capture (favors the cooperative-kernel hypothesis); if it stops, capture is implicated after all.
  2. Does it survive with --enable-expert-parallel off (pure TP=4)? Isolates whether the EP combine collectives are the contention source vs. just a victim.
  3. GPU-side capture to NAME it: NCCL flight recorder — TORCH_NCCL_TRACE_BUFFER_SIZE=20000, TORCH_NCCL_DUMP_ON_TIMEOUT=1, TORCH_NCCL_ENABLE_MONITORING=1, NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=COLL — on a real collective deadlock the watchdog dump lists the last enqueued collective per rank and any op-count mismatch directly. Pair it with a health-freeze watchdog (poll /health + the iter counter; on 200-but-frozen, fire py-spy dump --native + gdb -p … 'thread apply all bt' + trigger the NCCL dump on all ranks). If the NCCL flight recorder shows all ranks with matched, completed collectives and one stuck compute kernel, that points at persistent_topk/an indexer cooperative kernel rather than NCCL.

Could you confirm DP=1 (no --data-parallel-size), and if you get a cycle, the single most decisive data point is whether the wedge survives cudagraph_mode=PIECEWISE. We're bringing our own 4-node GB10 box up to TP=4/EP=4 now and will run the same capture; we'll share whatever the flight recorder names.

@wingcomm

Copy link
Copy Markdown

Confirmed: DP=1. We pass --tensor-parallel-size 4 --enable-expert-parallel with no --data-parallel-size — the distributed launcher only injects --nnodes 4 / --node-rank / --master-addr / --master-port, so it's a single 4-rank TP/EP group, DP=1.
So your SPMD reasoning applies directly and the dp>1 cudagraph-sync gate isn't in play for us.

Thanks for tracing the EP path and ruling the captured-collective-desync angle out so thoroughly — that's a useful elimination, and the shift to an EP-induced multi-CTA cooperative-kernel time-slice hang (same class as #3615, different kernel)
fits everything we've seen: floating victim frame, cuLaunchKernelEx → sched_yield, no JIT, and it only shows under real concurrency.

On the decisive cudagraph_mode=PIECEWISE discriminator — agreed it's the key data point, and we'll run it. Rather than force it now, we'll flip to PIECEWISE at the next wedge (it's intermittent — ~16h MTBF, and we couldn't trigger it
synthetically), so we keep a representative organic workload and get a clean before/after on the same traffic. We'll also arm the GPU-side capture so the next freeze is conclusive regardless: bigger flight-recorder buffer +
NCCL_DEBUG_SUBSYS=COLL, and fire the NCCL dump + gdb thread apply all bt on all ranks (our auto-capture currently does the head rank only; py-spy --native is broken on aarch64 — UNW_EBADREG — so gdb is our native path). If the flight recorder
comes back with all collectives matched/completed and one stuck compute kernel, that'll point at persistent_topk/the indexer cooperative kernel over NCCL, exactly as you laid out.

Will report the moment we catch one. Good luck bringing up your 4-node box — comparing the two captures should settle which kernel it is.

@alexbi29

alexbi29 commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

@jasl I opened a stacked DSpark PR against your SM120 branch here:

jasl#25

It is intentionally based on jasl:codex/ds4-sm120-min-enable rather than upstream main, since DSpark depends on the DeepSeek V4 SM120 runtime/model support from this PR.

Validated on 2x RTX PRO 6000 Blackwell, TPS goes to ~310 vs ~200 before on 2048 tokens. Main caveat: it eats more VRAM, so there is less headroom for KV cache. Realistically max is ~512k on 2x RP6K.

Main thing to review is whether you want DSpark carried as a follow-on stacked branch after #41834, or reshaped before any upstream submission.

@jasl

jasl commented Jul 1, 2026

Copy link
Copy Markdown
Contributor Author

@jasl I opened a stacked DSpark PR against your SM120 branch here:

jasl#25

It is intentionally based on jasl:codex/ds4-sm120-min-enable rather than upstream main, since DSpark depends on the DeepSeek V4 SM120 runtime/model support from this PR.

Validated on 2x RTX PRO 6000 Blackwell, TPS goes to ~310 vs ~200 before on 2048 tokens. Main caveat: it eats more VRAM, so there is less headroom for KV cache. Realistically max is ~512k on 2x RP6K.

Main thing to review is whether you want DSpark carried as a follow-on stacked branch after #41834, or reshaped before any upstream submission.

Cool! Thank you! I'm looking for DSpark as well

@alexbi29

alexbi29 commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

@jasl cleaned up a bit more buffers, ctx situation got a little better ~ 1m for 2x RP6K.
TPS did not change much: avg 304.8, median 304.8, min 299.9, max 309.1.

Having this model at 300 tps is qualitative change for some tasks.

jasl and others added 13 commits July 2, 2026 01:29
124 upstream commits. Conflicts resolved (4 files), preserving the SM12x stack
while absorbing upstream DSv4 work:
- sparse_attn_indexer.py: absorb vllm-project#46076 dcp (inert unless dcp_world_size>1),
  keep SM120 short-row/persistent decode dispatch, dedupe the vllm-project#47164 family-120
  cooperative-topk gate (upstream independently landed our fix).
- sparse_swa.py: absorb vllm-project#46995 DSpark non-causal path, keep our
  _init_reorder_batch_threshold(supports_spec_as_decode) MTP threshold.
- indexer.py: union has_prefilling_rows + dcp_local_seq_lens.
- protocol.py: union DSv4 thinking-mode sampling override + vllm-project#35076 stop_token_ids.

Signed-off-by: jasl <jasl9187@hotmail.com>
3 upstream bug fixes, none touching the DSv4/SM12x path:
- vllm-project#47305 don't read KV cache past seq_len in triton paged attn kernels
  (generic triton paged-attn; DSv4 uses sparse-MLA, unaffected)
- vllm-project#47308 warmup cross-attn properly in encoder-decoder case (decoder-only, no-op)
- vllm-project#46482 ROCm P/D MoRIIO proxy JSON Content-Type (no-op for us)
No conflicts; no DS4 files changed.

Signed-off-by: jasl <jasl9187@hotmail.com>
Enable the fused probabilistic Markov sampler by default for method=dspark and remove the VLLM_DSPARK_FUSED_MARKOV_SAMPLER environment variable. The config field remains available for explicit bisects.

Reuse envs.env_bool for DSpark SpeculativeConfig gates and document the same-step aliasing invariant for the shared draft-probs no-copy fast path.
Document that the Python seed counter must not be frozen by any future CUDA graph capture around sampling.

Avoid an unnecessary clone on the top-k/top-p fused sampler path when dtype conversion already produced a float32 copy, while preserving a clone for existing float32 inputs.
@aligningmyself

Copy link
Copy Markdown

Validated PR #41834 on a 6-node GB10 (DGX Spark, sm_121a) cluster serving GLM-5.2 — extends the
PR's own 2× GB10 validation to 6×.
GlmMoeDsa sparse-MLA loads, autotunes, serves, and generates
correctly. Let me know if not helpful, will stop commenting.

Setup

  • Hardware: 6× DGX Spark (GB10, SM121), 200 GbE ConnectX-7 fabric (NCCL over TCP, NET_PLUGIN=none).
  • vLLM: main @ fa24813 + this PR applied via pull/41834.diffapplied cleanly, no conflicts
    (19,137 lines / 118 files). FlashInfer main, transformers 5.x, TORCH_CUDA_ARCH_LIST=12.1a.
  • Model: GLM-5.2-NVFP4 (GlmMoeDsa, 753B total / 40B active, NVFP4 weights + bf16 shared expert).
  • Parallelism: --tensor-parallel-size 2 --pipeline-parallel-size 3 (world size 6), Ray executor.
  • Flags: --kv-cache-dtype fp8 --enforce-eager --gpu-memory-utilization 0.80 --max-model-len 4096.

What worked

  • Backend selection (per rank): FLASHINFER_MLA_SPARSE_SM120 decode + FLASH_ATTN MLA prefill +
    FLASHINFER_CUTLASS NVFP4 MoE, fp8_ds_mla KV-cache format.
  • Autotuner ran and cached on SM120: sparse_mla_sm120_decode_dsv3_2, fp4_gemm,
    trtllm::fused_moe::gemm{1,2} — config-cache hits on subsequent ranks.
  • Memory: ~71 GiB/node model weights; KV cache 1,748,160 tokens (≈427× concurrency @ 4096 ctx).
  • Correctness (greedy, temp 0):
    • 17 × 23340+51391
    • 60 mi / 1.5 h60/1.540 mph
    • capital of Australia → Canberra ✅ (clean CoT, respects "one word" / "just the number")
  • Throughput (enforce-eager, 200-tok completions):
    concurrency aggregate per-request
    1 8.0 tok/s 8.0
    8 25.4 tok/s 3.2
    16 25.5 tok/s 3.2
    (saturates ~25 tok/s aggregate by C=8 — compute-bound in eager mode; consistent with the PR's
    "~35–45 tok/s/step, depth-independent" note.)

Observations / questions for maintainers

  • --enforce-eager used deliberately to sidestep the reported unaligned-prefill (num_tokens % 16)
    Triton-JIT-during-cudagraph issue on SM121. Happy to test the cudagraph path with the warmup-padding
    workaround and report the throughput delta — likely a meaningful gain over the eager numbers above.

  • Slow weight load: ranks reported model-load times of ~160–515 s from local NVMe (not NFS). Larger
    than expected for ~71 GiB/node NVFP4 — possibly dequant/indexer setup. Can profile if useful.

  • Ran at gpu-mem 0.80 (128 GiB unified per node) with comfortable headroom; 0.85 also reached startup
    but was marginal near the autotune→serve transition on this unified-memory part.

Offer: I have 6× GB10 and am happy to run larger-context (up to 1M), longer-horizon, cudagraph-on,
or higher-TP/PP-sweep validation for this PR — just say what's useful. This is the first 6-node GB10
serve of a DSA model that I'm aware of.

@jasl

jasl commented Jul 2, 2026

Copy link
Copy Markdown
Contributor Author

Thank you — this is genuinely useful, please don't stop. A 6-node PP=3 GB10 serve of a second DSA model (GLM-5.2) on this backend is exactly the generalization signal I want for the merge case.

A few notes on your open questions:

  • You shouldn't need --enforce-eager. The sparse-MLA warmup (VLLM_ENABLE_DEEPSEEK_V4_SPARSE_MLA_WARMUP, on by default) exists precisely to pre-compile the aligned prefill/decode shapes before cudagraph capture, so the num_tokens % 16 in-graph JIT shouldn't fire. Its gate keys off the attention backend, not the model, so GlmMoeDsa on FLASHINFER_MLA_SPARSE_SM120 should be covered. If you have cycles, please try dropping --enforce-eager — I'd expect a meaningful throughput jump over the eager ~25 tok/s (which is compute-bound as you noted). If it does JIT in-graph, that's a warmup-coverage gap on my side and your repro would let me fix it directly.
  • gpu-memory-utilization 0.80 is the right setting on GB10. The NVFP4 autotuner's transient workspaces during the autotune→serve transition aren't reflected in the util headroom, and on unified memory they share the same pool — so 0.85 being marginal is expected. --watermark is a cleaner lever than pushing util higher; you're not leaving much on the table at 0.80.
  • Slow load: the 160–515 s spread across ranks looks like NVMe/CPU contention plus per-rank NVFP4 dequant / indexer setup rather than raw I/O on the ~71 GiB itself. If you can profile disk-read vs dequant vs indexer-prep separately, that'd be interesting.

On your offer — the three most valuable to me:

  1. The cudagraph-on run above (drop --enforce-eager) — the headline, and it doubles as a validation that the warmup generalizes to a second DSA model.
  2. Single-sequence long-context TTFT at 256k → 1M — I can share a needle-recall coherence check so it validates correctness at extreme context, not just that it runs.
  3. TP2 × PP3 correctness, since I mostly test TP — pipeline-parallel DSA coverage I don't otherwise exercise.

Any of those would be a big help. Thanks again for putting 6× GB10 behind this.

@wingcomm

wingcomm commented Jul 2, 2026

Copy link
Copy Markdown

Two more wedges and the PIECEWISE discriminator is now live.

Incident-16 (~45h after the inc-15 recovery): same wedge, same frame. Recurred on the patched build (4214cea), and py-spy caught the identical mhc_fused_post_pre_tilelang victim frame as incident-15 — two
consecutive at MHC now (incident-14 was _forward_prefill). Same mechanism as always (cuLaunchKernelEx → sched_yield, no JIT).

Our hardened watchdog auto-recovered it cleanly this time — after inc-15's failed auto-restart (cancelled init from stale peer state), we made the restart verify 0 sparkrun containers across all 4 nodes
before relaunching; it worked first try, no manual intervention. So recovery is a solved problem even if the hang isn't.

PIECEWISE is running now (your discriminator). Flipped cudagraph_mode: FULL → PIECEWISE on the same 4214cea build — confirmed in the engine config, capturing only piecewise graphs (no FULL decode graph), so
the captured collectives are out of the decode path. Soaking now; I'll report whether the wedge survives it.

One caveat on running it: PIECEWISE costs us ~40% single-stream decode (≈23 tok/s vs ≈40 tok/s median at Running:1 from the pre-switch FULL snapshot). Expected here — our decode is interconnect-latency-bound
(GPUs spin-wait on NCCL), so the per-step launch overhead PIECEWISE reintroduces lands right on the critical path. So we're treating it as a time-boxed diagnostic: one MTBF window (~2 days). If the wedge
recurs under PIECEWISE → it's capture-independent (your on-device cooperative-kernel-race hypothesis), and we revert to FULL. If it stays clean that long → capture is implicated, but ~40% is too steep to
keep, so we'd still want a real fix.

On naming the kernel: we owe you the all-rank capture — the cron watchdog auto-recovered inc-16 faster than we could hold it, so we only have the head-rank py-spy+gdb (same MHC frame). We're wiring the
all-rank version now (bump TORCH_NCCL_TRACE_BUFFER_SIZE→20000, add NCCL_DEBUG_SUBSYS=COLL, fire the flight-recorder dump + gdb thread apply all bt on all 4 ranks, and hold-before-teardown) so the next freeze
names the stuck kernel — persistent_topk vs an NCCL collective. (FWIW the enforce-eager GB10 run you're discussing above is weak evidence in the same direction, though different model/topology.)

jasl added 2 commits July 2, 2026 13:34
…stence

DSpark (PR #25) shipped a complete V1 DSparkProposer but was force-routed to
the V2 runner, where DeepSeek-V4 long-context recall collapses under
concurrency (arthur 3/16 @28k vs V1 16/16). Validated DSpark on the V1 runner
on 2xRTX SM120: recall 16/16 @28k conc8, coherent, 218 tok/s (~1.3x MTP2).

- vllm/config/vllm.py: drop the dspark->V2 force so DSpark follows normal
  runner selection (V1 by default for DeepSeek-V4). V2 speculator remains
  reachable via VLLM_USE_V2_MODEL_RUNNER=1 for A/B.
- kernel_warmup.py: make _deepseek_v4_slot_mapping_warmup dual-interface
  (V1 input_batch/query_start_loc + V2 block_tables/input_buffers). PR #25 had
  rewritten it V2-only, so it silently no-opped on the V1 runner and
  reintroduced first-request JIT for ALL DeepSeek-V4 serving (MTP/DFlash/plain).
- nvidia/model.py: gate _is_dspark_runtime_layer on method=="dspark". The flag
  dspark_fused_shared_experts_quant defaults True on every SpeculativeConfig and
  the MTP block sits at layer_idx==num_hidden_layers, so without this gate the
  unvalidated fused FP8 shared-experts kernel would engage on production MTP.
@mergify

mergify Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @jasl.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status
Status: No status

Development

Successfully merging this pull request may close these issues.