[New Model][Nvidia] Add SM12x support for DeepSeek V4 Flash with essential fixes#41834
[New Model][Nvidia] Add SM12x support for DeepSeek V4 Flash with essential fixes#41834jasl wants to merge 158 commits into
Conversation
|
@zyongye |
There was a problem hiding this comment.
Code Review
This pull request implements support for DeepSeek V4 on SM12x (Blackwell) architectures by providing Triton-based fallbacks for DeepGEMM-dependent operations. Key enhancements include the introduction of specialized Triton kernels for sparse MLA, FP8 einsum, and MQA logits, as well as memory optimizations in the sparse attention indexer to compute top-k indices without materializing full logits. Additionally, the PR updates the model loader to support weight name filtering for skipping MTP weights and handles Blackwell-specific FP8 quantization scales. I have no feedback to provide.
💡 Codex Reviewvllm/vllm/model_executor/layers/sparse_attn_indexer.py Lines 86 to 89 in 9596dbf This helper now disables the DeepGEMM requirement for every SM120 run, but the FP4 indexer cache path still depends on DeepGEMM kernels ( vllm/vllm/model_executor/model_loader/default_loader.py Lines 236 to 240 in 9596dbf The new pre-load ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
042e366 to
df2e6f8
Compare
… set Multi-dimensional review hardening of the DeepSeek-V4 SM12x path. All changes are pure-Python and no-ops on the canonical serve configs: - flashmla: _prefill_workspace_topk_bound now reserves the C128A prefill top-k width from the full compressed region (c128a_max_compressed, the same bound the metadata builder allocates c128a_prefill_buffer with) instead of index_topk. Above ~262144 context the C128A effective_topk exceeds index_topk (2048); the locked prefill workspace previously fit only because the lightning indexer's much larger reservation incidentally absorbed the gap. No-op at max_model_len <= 262144. - kernel_warmup: share _DEEPSEEK_V4_SPARSE_MLA_BACKENDS with flashinfer_sparse_mla_warmup. The local copy still listed the old "V4_FLASHMLA_SPARSE" backend name (renamed "FLASHMLA_SPARSE_DSV4" upstream); the warmup gate only kept matching via DEEPSEEK_SPARSE_SWA. - config: drop the inert "vllm::deepseek_v4_fp8_einsum" splitting_ops entry (a plain function, never registered as a custom op, so it can never match an FX node). - o_proj: correct the fp8-einsum recipe docstring (only SM100 takes the packed scale path; SM110 and SM12x use the legacy FP32 block-scale layout). - flashmla: drop the dead prefill_gather_lens_cpu binding. Signed-off-by: jasl <jasl9187@hotmail.com>
|
Built and deployed your candidate — here's arm #2, with an honest caveat on the bisect. Arm #2 is live (patch + FlashInfer unchanged). Built Correctness at the trigger condition looks good. We hit it with concurrent unique long-context prefills (~48K tokens, conc=12 — unique prompts to defeat prefix-cache so every request is a real prefill → sustained The honest caveat: we can't synthetically reproduce the freeze on either build, so this isn't a clean before/after yet.
So our signal is an extended organic soak on arm #2. If the On the 3-arm bisect: because we can't trigger it synthetically, a single-shot of arm #1 ( |
|
Update from the soak — a significant one, plus some commit archaeology that narrows it. The But it wedged in a different op. py-spy across all ranks caught: the MHC prenorm GEMM via TileLang/TVM-FFI — not the The reframe that matters most: the Python frame is the victim, not the culprit — it's just the next op trying to Commit archaeology — where I think this lives. We scanned the PR history against the symptom:
Net: the evidence points away from any single kernel (top-k or MHC) and toward a residual cudagraph + MTP + attention stream-ordering hang. The natural discriminators would be MTP-off vs |
Let me try! :) |
|
Thanks for the incident-15 detail and for running arm #2 to ~16h — that's exactly the data point we needed. Two findings from tracing the EP path in We can rule out the cudagraph-mode / collective-desync angle. The tempting story was: at What survives. With dispatch divergence out, the floating victim frame (your reframe is right — it's the next launch onto an already-dead stream) plus To name it, the two cheap discriminators + the capture:
Could you confirm DP=1 (no |
|
Confirmed: DP=1. We pass --tensor-parallel-size 4 --enable-expert-parallel with no --data-parallel-size — the distributed launcher only injects --nnodes 4 / --node-rank / --master-addr / --master-port, so it's a single 4-rank TP/EP group, DP=1. Thanks for tracing the EP path and ruling the captured-collective-desync angle out so thoroughly — that's a useful elimination, and the shift to an EP-induced multi-CTA cooperative-kernel time-slice hang (same class as #3615, different kernel) On the decisive cudagraph_mode=PIECEWISE discriminator — agreed it's the key data point, and we'll run it. Rather than force it now, we'll flip to PIECEWISE at the next wedge (it's intermittent — ~16h MTBF, and we couldn't trigger it Will report the moment we catch one. Good luck bringing up your 4-node box — comparing the two captures should settle which kernel it is. |
|
@jasl I opened a stacked DSpark PR against your SM120 branch here: It is intentionally based on Validated on 2x RTX PRO 6000 Blackwell, TPS goes to ~310 vs ~200 before on 2048 tokens. Main caveat: it eats more VRAM, so there is less headroom for KV cache. Realistically max is ~512k on 2x RP6K. Main thing to review is whether you want DSpark carried as a follow-on stacked branch after #41834, or reshaped before any upstream submission. |
Cool! Thank you! I'm looking for DSpark as well |
|
@jasl cleaned up a bit more buffers, ctx situation got a little better ~ 1m for 2x RP6K. Having this model at 300 tps is qualitative change for some tasks. |
124 upstream commits. Conflicts resolved (4 files), preserving the SM12x stack while absorbing upstream DSv4 work: - sparse_attn_indexer.py: absorb vllm-project#46076 dcp (inert unless dcp_world_size>1), keep SM120 short-row/persistent decode dispatch, dedupe the vllm-project#47164 family-120 cooperative-topk gate (upstream independently landed our fix). - sparse_swa.py: absorb vllm-project#46995 DSpark non-causal path, keep our _init_reorder_batch_threshold(supports_spec_as_decode) MTP threshold. - indexer.py: union has_prefilling_rows + dcp_local_seq_lens. - protocol.py: union DSv4 thinking-mode sampling override + vllm-project#35076 stop_token_ids. Signed-off-by: jasl <jasl9187@hotmail.com>
3 upstream bug fixes, none touching the DSv4/SM12x path: - vllm-project#47305 don't read KV cache past seq_len in triton paged attn kernels (generic triton paged-attn; DSv4 uses sparse-MLA, unaffected) - vllm-project#47308 warmup cross-attn properly in encoder-decoder case (decoder-only, no-op) - vllm-project#46482 ROCm P/D MoRIIO proxy JSON Content-Type (no-op for us) No conflicts; no DS4 files changed. Signed-off-by: jasl <jasl9187@hotmail.com>
Enable the fused probabilistic Markov sampler by default for method=dspark and remove the VLLM_DSPARK_FUSED_MARKOV_SAMPLER environment variable. The config field remains available for explicit bisects. Reuse envs.env_bool for DSpark SpeculativeConfig gates and document the same-step aliasing invariant for the shared draft-probs no-copy fast path.
Document that the Python seed counter must not be frozen by any future CUDA graph capture around sampling. Avoid an unnecessary clone on the top-k/top-p fused sampler path when dtype conversion already produced a float32 copy, while preserving a clone for existing float32 inputs.
|
Validated PR #41834 on a 6-node GB10 (DGX Spark, Setup
What worked
Observations / questions for maintainers
Offer: I have 6× GB10 and am happy to run larger-context (up to 1M), longer-horizon, cudagraph-on, |
|
Thank you — this is genuinely useful, please don't stop. A 6-node PP=3 GB10 serve of a second DSA model (GLM-5.2) on this backend is exactly the generalization signal I want for the merge case. A few notes on your open questions:
On your offer — the three most valuable to me:
Any of those would be a big help. Thanks again for putting 6× GB10 behind this. |
|
Two more wedges and the PIECEWISE discriminator is now live. Incident-16 (~45h after the inc-15 recovery): same wedge, same frame. Recurred on the patched build (4214cea), and py-spy caught the identical mhc_fused_post_pre_tilelang victim frame as incident-15 — two Our hardened watchdog auto-recovered it cleanly this time — after inc-15's failed auto-restart (cancelled init from stale peer state), we made the restart verify 0 sparkrun containers across all 4 nodes PIECEWISE is running now (your discriminator). Flipped cudagraph_mode: FULL → PIECEWISE on the same 4214cea build — confirmed in the engine config, capturing only piecewise graphs (no FULL decode graph), so One caveat on running it: PIECEWISE costs us ~40% single-stream decode (≈23 tok/s vs ≈40 tok/s median at Running:1 from the pre-switch FULL snapshot). Expected here — our decode is interconnect-latency-bound On naming the kernel: we owe you the all-rank capture — the cron watchdog auto-recovered inc-16 faster than we could hold it, so we only have the head-rank py-spy+gdb (same MHC frame). We're wiring the |
Add DeepSeek V4 DSpark proposer
…stence DSpark (PR #25) shipped a complete V1 DSparkProposer but was force-routed to the V2 runner, where DeepSeek-V4 long-context recall collapses under concurrency (arthur 3/16 @28k vs V1 16/16). Validated DSpark on the V1 runner on 2xRTX SM120: recall 16/16 @28k conc8, coherent, 218 tok/s (~1.3x MTP2). - vllm/config/vllm.py: drop the dspark->V2 force so DSpark follows normal runner selection (V1 by default for DeepSeek-V4). V2 speculator remains reachable via VLLM_USE_V2_MODEL_RUNNER=1 for A/B. - kernel_warmup.py: make _deepseek_v4_slot_mapping_warmup dual-interface (V1 input_batch/query_start_loc + V2 block_tables/input_buffers). PR #25 had rewritten it V2-only, so it silently no-opped on the V1 runner and reintroduced first-request JIT for ALL DeepSeek-V4 serving (MTP/DFlash/plain). - nvidia/model.py: gate _is_dspark_runtime_layer on method=="dspark". The flag dspark_fused_shared_experts_quant defaults True on every SpeculativeConfig and the MTP block sits at layer_idx==num_hidden_layers, so without this gate the unvalidated fused FP8 shared-experts kernel would engage on production MTP.
|
This pull request has merge conflicts that must be resolved before it can be |

Summary
This PR enables DeepSeek V4 Flash on SM120/SM121 Blackwell client hardware by carrying the SM12x fallback and tuning stack needed for the current vLLM V1 path. It is intended for RTX PRO 6000 Blackwell Workstation Edition, RTX 5090-class SM120, and GB10 / DGX Spark SM121 users who cannot use SM100-only TMEM /
tcgen05kernels.As of 2026-06-23 this branch is reconciled on top of the merged #43477 and provides the stock-deps path: DeepSeek V4 on SM120/121 that builds and serves on released FlashInfer / DeepGEMM wheels, complementing #43477's route that needs the unreleased FlashInfer #3395 + DeepGEMM #324 dependency branches. As of 2026-06-26 the branch is additionally synced onto current
upstream/main(198 commits since the #43477 merge); latest validated head is tagsm120-pr-41834-stable-preview-20260626(c766cbc6ff). See Update 2026-06-26 — upstream sync and Update 2026-06-23 — #43477 reconciliation below.Change footprint — model kernels vs. core-vLLM touch points
The branch splits cleanly into model/kernel code and a small set of core-vLLM integration points (116 files, +15.5k/−0.4k, of which ~+3.5k is tests):
vllm/models/deepseek_v4/**plus the SM12x sparse-MLA decode / indexer / DeepGEMM kernels that live in shared dirs (v1/attention/backends/mla/sparse_mla_kernels.py,model_executor/layers/sparse_attn_indexer.py,v1/attention/backends/mla/{indexer,sparse_swa}.py,utils/deep_gemm.py,kernels/mhc/tilelang.py), the new DSv4 reasoning parser / tokenizer, and device tuning JSONs.models/deepseek_v4/sparse_mla.py, perf, 2026-06-26) —_c128a_effective_topk_widthnow takes the max position from the CPU-sideCommonAttentionMetadata.max_seq_leninstead of a per-stepint(positions.max().item())device sync, dropping a launch-stream stall on every C128A metadata step (reported via gdb native stacks by a GB10/TP4 user). Decode is identical (max_seq_len-1 == positions.max()); only chunked prefill sees a safe, slightly-wider 128-aligned top-k.single_type_kv_cache_manager.py(+243),kv_cache_coordinator.py(+67),kv_cache_manager.py(+60),sched/scheduler.py(+1)cache_blockstail-block-reuse rewrite. (Our earlierblock_poolstale-hash reset is dropped in the reconcile — subsumed by upstream's own unconditional reset, which arrived via theupstream/mainmerge.)v1/spec_decode/llm_base_proposer.py(+173)fused_moe.py(+65),oracle/mxfp4.py(+43),routed_experts.py(+33),experts/flashinfer_cutlass_moe.py(+27),quantization/mxfp4.py(+12),oracle/nvfp4.py(+1)quantization/utils/fp8_utils.py(+99),linear/scaled_mm/{cutlass,marlin}.py(+45/+16),csrc/.../marlin_moe_wna16/ops.cu(+10, the only C++)config/vllm.py(+44),compilation/breakable_cudagraph.py(+22),passes/utility/fix_functionalization.py(+12),config/compilation.py(+11)chat_completion/protocol.py(+101),serve/render/serving.py(+28),tool_parsers/structural_tag_registry.py(+16),chat_utils.py(+11),engine/protocol.py(+9),chat_completion/{serving,batch_serving}.py(+8/+6),reasoning/__init__.py(+4)reasoning_content/thinkingparam / tool-call streaming (jasl#19 instruction-following)model_executor/warmup/kernel_warmup.py(+617)weight_utils.py(+43),default_loader.py(+16)envs.py(+63),utils/flashinfer.py(+16),utils/import_utils.py(+9),v1/worker/{gpu_model_runner,ubatch_utils}.py(+12/+12)VLLM_DEEPSEEK_V4_*flags +has_cutedsl/has_flashinfer_trtllm_sparse_mlaprobesTwo notes for review:
kv_cache_coordinatorcache_blocksrewrite (affects hybrid-KV models; validated ≥ prior behavior), the MTP proposer base-class change, and the OpenAI-entrypoint plumbing. Everything else (MoE oracle, fp8_utils, cudagraph gate, warmup, envs) is arch / quant / env-gated and inert for other models.Duplicate-work check
Open PR search was refreshed on 2026-06-12 for SM120 / SM12x / DeepSeek V4 / GB10 terms. The nearest open PRs are related but not duplicates:
42657aca65) and carries the stock-deps DSv4 SM120/121 path that runs on released wheels, complementing #43477's fork-deps route. See Update 2026-06-23 — #43477 reconciliation below.Fixed preview tags
These tags are in
jasl/vllmand give users stable pins while the PR is still moving:sm120-pr-41834-stable-preview-20260626c766cbc6ffupstream/main(198 commits since the #43477 merge; our NVFP4FLASHINFER_CUTLASS-clamp fix landed upstream as #46492 → fork patch dropped) + the C128A metadata device-sync removal. 6 conflicts resolved; upstream's newcooperative_topk(#43008) gated off SM12x (capability family 120) to keep the validated decode path byte-identical. Validated dual-arch — RTX SM120 GSM8K-200 0.97 + #19 PASS; GB10 SM121 GSM8K-200 0.945 + arthur 64/64 + Forum53 PASS + llama-benchy prefill +80% / decode flat. See Update 2026-06-26 below.sm120-pr-41834-stable-preview-20260623f7b4b425b042657aca65ofupstream/main), keeping the stock-deps DSv4 SM120/121 path that runs on released wheels. Two reconcile regressions fixed — DeepGEMM no longer auto-enabled on SM120 (a94657e601, the pinned ref asserts), and #43477's prefill-SWA launch is gated + the kernel OOB clamped (f7b4b425b0). Validated dual-arch (see Update 2026-06-23 — #43477 reconciliation below).sm120-pr-41834-stable-preview-20260622b5ba0f19f02output.size(0)==num_tokens (84 vs 83)whenVLLM_DEEPSEEK_V4_FLASHINFER_SM120_PREFILL=1), (2) MTP draft logits cast to float32 before top-k/top-p sampling (fixes an engine-killing assertion on any MTP + non-greedy sampling, gate-independent). See Update 2026-06-22 below.sm120-pr-41834-stable-preview-2026062172261a7af149fa5d3fe2ed2b9956e92590731012sm120-pr-41834-stable-preview-20260620a743ef5dfbd16cad0b9a628773c0c1d1841f1790sm120-pr-41834-stable-preview-20260612075245f32247a5a695fa8979d61837bf6b87da897dcb7dsm120-pr-41834-fallback-before-replacement-202606120537205d1584e2de2b3c64540e70dfc370b0211eb6b2fcUpdate 2026-06-26 — synced onto upstream/main + dual-arch revalidation
Re-synced the branch onto current
upstream/main(mergec7a4386a45, then the C128A device-sync hoist →c766cbc6ff; 198 upstream commits since the #43477 merge). 6 conflicts resolved:oracle/nvfp4.py— union the SwiGLU-clamp backend set to{TRTLLM, CUTLASS, MARLIN}. OurFLASHINFER_CUTLASSclamp fix landed upstream as [Bugfix] Allow flashinfer_cutlass as a clamped NVFP4 MoE backend #46492, so the fork patch is now redundant.routed_experts.py— combine the two per-tensor-scale loaders into one helper (our e8m0 bitwise view and upstream's 0-D/shape-(1,)_to_scalarnormalization).serve/render/serving.py+renderers/online_renderer.py— upstream's [Frontend] Split ServingRender into renderer and entrypoint. #44285 splitServingRenderinto renderer + entrypoint; our DSv4thinking→template-kwargs threading is re-homed onto the new structure (sampling-params site inServingRender.render_chat_request, prompt-render site inOnlineRenderer.render_chat).sparse_attn_indexer.py— preserve our SM120 short-row / persistent top-k path, and add upstream's newcooperative_topk([Perf][DSv4/DSv3.2] Add cluster-cooperative topK kernel for low-latency scenarios #43008) gated to exclude capability family 120, so SM12x decode is byte-identical to the validated path (enablingcooperative_topkon SM12x is a separate, to-be-validated perf experiment).engine/protocol.py(keep bothDeltaMessagehooks) andtests/models/test_deepseek_v4_mega_moe.py(keepCompilationConfig()fixture).Inherited for free from the sync: deepseek_v2 redundant-clone removal (#46651), sampler int32-overflow fix (#46560), spec-decode correctness (#45956 / #46533).
Validation — full matrix, both arches, on
c766cbc6ff:The MoE backend on both arches is Marlin (W4A16) —
is_deep_gemm_supported()is False on stock SM12x wheels, so the DeepGEMM/W4A8 path isn't selected and Marlin is the default; it is GSM8K-correct on both SM120 and SM121. The prefill +80% is inherited from upstream's prefill / scheduler / block-pool work (not an SM12x change of ours); decode is flat because our SM12x decode path is preserved unchanged.Update 2026-06-23 — #43477 reconciliation (stock-deps path) + dual-arch revalidation
Upstream merged #43477 (DeepSeek V4 + GLM-5.1 on SM120 via the FlashInfer-SM120 sparse-MLA route) on 2026-06-22. As merged it does not run on released wheels: its SM12x attention class raises at model construction unless the unreleased FlashInfer #3395 fork symbols are present, and it auto-enables a DeepGEMM MXFP4 path whose pinned ref (
#324) asserts on SM120. This PR is now reconciled on top of it so the two coexist: #43477 is the fork-deps route; this PR is the stock-deps route that runs on released FlashInfer / DeepGEMM wheels.Reconciliation is a merge of
upstream/maininto the PR branch (42657aca65, 6 conflicts resolved — kept our env/availability-gated SM120 decode route + both FlashInfer probes + #43477's gated prefill-SWA mechanism; dropped our now-redundantblock_poolstale-hash reset in favour of upstream's), plus two fixes for regressions the merge introduced:a94657e601). Enable DeepSeek V4 and GLM-5.1 on SM120 #43477 added SM120 tosupport_deep_gemm, so the engine selected a DeepGEMM MXFP4 kernel whose pinned/released ref aborts at init (Assertion sf.size(-2)==ceil_div(mn,gran_mn)). SM120 now falls back to Marlin/cutlass as before (needs the unmerged DeepGEMM GPTBigCodeForCasualLM support doesn't work #324 to enable).f7b4b425b0). The merged paged prefill-SWA index kernel launched unconditionally and computedblock_tableaddresses for masked-off tail lanes of deep (32k) prefill rows, which SM12x + Triton 3.6 faults as an illegal address even though the load is masked →cudaErrorLaunchFailureunder concurrent load. The launch is now gated behindVLLM_DEEPSEEK_V4_FLASHINFER_SM120_PREFILL(default off → the stock decode-only path never launches it) and the kernel's masked lanes are clamped.Validation — metrics flat vs the pre-reconcile head (
5ba0f19f02), both arches, DeepSeek-V4-Flash, fp8 KV, MTP=2:2.30.7)--async-schedulingA/B (off vs on), incl. 64k concurrency + sampling stressThe earlier "
--async-schedulingregression" note is withdrawn — that crash was this same prefill-SWA OOB, fixed above; with the fix,--async-schedulingon/off are equivalent and crash-free through a 64k concurrency-8 sampling soak.Update 2026-06-23 — GB10 / DGX Spark (SM121) long-context frontier, 256k–1M
Cold-prefill capability sweep on 2× GB10 / DGX Spark (SM121), TP=2 over RoCE,
max-model-len 1048576,gpu-memory-utilization 0.75, MTP=2, fp8 KV, EP-off, prefix-cache disabled,FULL_AND_PIECEWISE, greedy, C=1 (post-audit head). All five points complete cleanly (0 failures, no OOM / crash). KV cache = 1,868,754 tokens (5,964 bytes/token); a 1M-token request admits at 1.78× concurrency at this utilization.The multi-minute cost is the cold prefill (TTFT), which is GPU-bound (GPU
96% throughout each TTFT window) and scales super-linearly (O(N^1.4): 256k→512k = 2.45×, 512k→1M = 2.89×).A dedicated decode-vs-context sweep (256-token generation at 16k / 64k / 256k / 512k) shows the opposite for generation: steady-state decode is essentially flat with depth — median inter-step latency ~61–69 ms across 16k→512k (≈30–45 tok/s effective with MTP), i.e. throughput does not meaningfully degrade as context grows. That matches the per-step cost being dominated by fixed MoE GEMM + the 2-node RoCE all-reduce rather than the depth-dependent (
O(N)) indexer. (Decode rates from a 16-token TTFT-only run are not reliable — too short, plus MTP bundling — so they are not used here.)So the long-context penalty is concentrated entirely in the one-time cold prefill (TTFT above), not in generation: prefill is GPU-bound compute/bandwidth (LPDDR5X ~273 GB/s) plus the per-chunk 2-node RoCE all-reduce (no inter-node NVLink), while decode stays ~constant per token. MTP keeps ~2.0 acceptance at depth. Practically: GB10 suits large-context-in → generation-out when the one-time cold first token (minutes at 384k+) is acceptable or amortized by prefix caching; once generating, throughput is depth-independent. The 1M cold TTFT is ~20% faster than the 2026-06-06 baseline (2789 s vs 3504 s).
Update 2026-06-22 — long-context (256k+) crash fixes (latest validated head)
Two distinct crashes were reported at long context (≥256k, MTP=2). Both are fixed on
sm120-pr-41834-stable-preview-20260622b(5ba0f19f02); the rest of the branch is unchanged from the 2026-06-21 audit head:Packed-prefill output slice (
output.size(0)==num_tokens,84 vs 83). WithVLLM_DEEPSEEK_V4_FLASHINFER_SM120_PREFILL=1, the packed_forward_prefillpath sliced the query tonum_prefill_tokensunder MTP/cudagraph padding but passed the unsliced padded output to the kernel, which derivesnum_tokensfromq.shapeand assertsoutput.size(0)==num_tokens→ crash, cascading to an illegal-memory-access. The output is now sliced symmetrically. This path is gated (default FlashMLA prefill loops overq.shapeand has no such assert → was never affected). Interim workaround for older builds: setVLLM_DEEPSEEK_V4_FLASHINFER_SM120_PREFILL=0.MTP draft-sampler float32 cast (engine-killing assertion on non-greedy sampling). The MTP probabilistic draft sampler fed bf16 draft-head logits into the Triton top-k/top-p kernel, which asserts
logits.dtype==torch.float32→AssertionErrorkills the worker (and cascades to CUDA errors on TP peers). This fires on any MTP + non-greedy (top-k/top-p/temperature) request and is independent of the FlashInfer gates — so it reproduces "even without FlashInfer". Greedy decoding returns before the sampler, which is why greedy GSM8K validation never surfaced it. The draft logits are now cast to float32 (matching the main sampler). Validated on 2x RTX PRO 6000 (SM120): a 256k sustained sampling soak (temperature 0.7 / top_p 0.9, 9 concurrent workers, MTP=2, EP) that crashed on the first sampled request now runs clean (200+ requests).Note for very long contexts (e.g.
--max-model-len 500000): the sparse-MLA / indexer workspaces are sized bymax_model_lenand are not yet SM12x-arch-gated, so they consume a large fixed share of VRAM and leave a thin KV budget — see #42856 for the workspace-shrink fix. This is a memory-headroom concern, separate from the two crashes above.Update 2026-06-21 — post-audit cleanup
This supersedes the 2026-06-20 and 2026-06-18 heads and the earlier validation data below. An audit of the SM12x branch against current upstream removed redundant, disproven, and experimental deltas; the cleaned head is validated metrics-flat (marginally better) on 2x RTX PRO 6000 Blackwell (SM120) and 2-node GB10 / DGX Spark (SM121), DeepSeek-V4-Flash, fp8 KV, MTP=2. The jasl#19 (instruction-following) and #45309 breakable-cudagraph-garbage revert (#45972) correctness fixes are retained. Five changes:
Breakable-cudagraph stays default OFF (
FULL_AND_PIECEWISE). DeepSeek-V4 is deliberately excluded from breakable-cudagraph auto-enable — on real 2x GB10 MTP decode breakable regressed throughput and degraded as output length grew (≈31→19 tok/s at 400→800 max-tokens vs a flat ≈40). The gate is now a single MiniMax-only helper instead of a dead always-False stub; behavior is unchanged. (VLLM_USE_BREAKABLE_CUDAGRAPH=1still opts in.)The long-context recall fix is the int64 block-offset cast, not a cache fence. The 2026-06-20 head attributed the MTP high-concurrency recall/garble bug to a missing copy-on-write on writable caches and added a prefix-cache write-completion fence. Further investigation showed that hypothesis was wrong: the actual cause is an int32 overflow of the packed-KV block offset in the SM12x paged-MQA-logits indexer kernels, fixed by an int64 cast (retained). With the int64 fix in place the write fence is redundant — a fence-OFF recall gate holds 8/8 @ conc=8 and 16/16 @ conc=16 (0 miss) on RTX, and 8/8 on GB10. The write-completion fence and the COW broadening paired with it were therefore removed.
Removed the scheduler prefill-fairness heuristics (ungated, generic very-long-prefill / mixed-decode chunk-limiting). They targeted a decode cliff later re-diagnosed as MoE-GEMM + NCCL-all-reduce bound (not schedulable) and were not load-bearing: a cleanup-vs-prior A/B shows an identical mixed prefill/decode fairness ratio (0.716 vs 0.714) and equal inter-chunk latency.
Moved the experimental
VLLM_NVFP4_GEMM_BACKENDb12x research lever out of the PR (off-by-default, unused on the shipped path) and dropped a tool-calling-env diff-reflow churn.Net vs the 2026-06-20 head: 6 files, −833 lines (the removed fence + scheduler heuristics + their tests). The decode/prefill CUDA kernels are byte-identical across the cleanup, so the gated-decode-optimization profile and the 2026-06-12 throughput baselines below are unchanged.
Validation, 2026-06-21
Trivial-prompt generation (cudagraph sanity), both platforms:
2+2 → 4,7*8 → 56,capital of France → Paris— no garbage.Default decode path, MTP=2:
The GB10 SM121 run is a from-scratch 2-node rebuild of the cleaned head (NCCL
2.30.7re-pinned per node); arithmetic, GSM8K, and the long-context recall gate all pass, confirming the fence removal holds recall on SM121 as well. The recall fix is the int64 cast in the SM12x indexer kernel, so the 2026-06-12 throughput baselines below are unchanged.llama-benchy (eugr format), GB10 2-node / SM121, MTP=2, prefix-cache on (GB10 MTP decode/prefill profile — unchanged across the 2026-06-21 cleanup, decode kernel byte-identical):
Prefill 1595–1722 tok/s at depth; decode 40 tok/s @ C=1 holding 38–42 out to 32K context (no decode cliff); prefix-cache hit 42–46% under MTP.
Gated SM120 decode optimization (
VLLM_DEEPSEEK_V4_FLASHINFER_SM120_DECODE=1)The decode gate uses
flashinfer.mla._sparse_mla_sm120(in FlashInfer main / 0.6.13; absent from the 0.6.12 release). Installing it correctly matters: a barepip install --upgrade flashinfer-python @ git+mainbumpsflashinfer-pythonbut leaves a staleflashinfer-cubin/flashinfer-jit-cache, and FlashInfer then raises a version-mismatch error at startup (and re-JITs kernels). Uninstall the precompiled packages first, then upgrade:pip uninstall -y flashinfer-jit-cache flashinfer-cubin pip install --upgrade "flashinfer-python @ git+https://github.com/flashinfer-ai/flashinfer.git"For a reproducible pin instead of tracking moving
main, install matchingflashinfer-python+flashinfer-cubinnightlies (e.g.0.6.13.dev20260619, a bit-identical decode kernel to the validated build) — again uninstallingflashinfer-jit-cachefirst.RTX SM120, decode gate ON vs OFF, ctx0 decode (aggregate tok/s, 0 errors all rows):
gate-ON @C64 = 2814.8 tok/s matches the community target (~2815). The decode CUDA kernel is byte-identical across the rebase, so this profile is unchanged.
The default path needs no FlashInfer update. With the gate off (default), the import is lazy/gated, so FlashInfer 0.6.12 (official) works unchanged. On GB10 / 2-node, also pin
nvidia-nccl-cu13==2.30.7(a rebuild reverts it; a per-node mismatch hangs the NCCL handshake).Gated SM120 prefill optimization (
VLLM_DEEPSEEK_V4_FLASHINFER_SM120_PREFILL=1)Symmetric to the decode gate, prefill has an opt-in packed FlashInfer sparse-MLA path:
VLLM_DEEPSEEK_V4_FLASHINFER_SM120_PREFILL=1(default off; ~+5–6% single-stream prefill). With it off, prefill defers to the default FlashMLA indexed-D512 path. It routes through the sameflashinfer.mla._sparse_mla_sm120kernels as the decode gate, so it carries the identical FlashInfer version requirement — the install/pin steps above apply unchanged (FlashInfer main / 0.6.13; the default-off path needs no FlashInfer update and runs on 0.6.12). Decode and prefill share one FlashInfer build; there is no separate version to track for prefill.Branch validation, 2026-06-12
Base and head:
8a91228dbe363d1d113deb2a82e289429130dd01f32247a5a695fa8979d61837bf6b87da897dcb7dCommands run on the final head:
git diff --check upstream/main...HEADupstream/main..HEADSigned-off-byVLLM_TARGET_DEVICE=empty .venv/bin/python -m compileall -q vllm/envs.py vllm/model_executor/warmup/kernel_warmup.py vllm/models/deepseek_v4 vllm/v1/core vllm/v1/attention/backends/mla vllm/reasoning/deepseek_v4_reasoning_parser.py tests/test_envs.py tests/v1/core/test_prefix_caching.py tests/v1/core/test_scheduler.py tests/reasoning/test_deepseekv4_reasoning_parser.py tests/quantization/test_sm12x_tuned_config_lookup.py.venv/bin/python -m pytest tests/test_envs.py::test_deepseek_v4_sparse_mla_stats_path_env -qon the remote vLLM environment1 passed, 16 warningspython3 -m pytest tests/test_scripts.py -qin the public harness128 passed in 14.41sLocal vLLM pytest/ruff were not run on the Mac checkout because its
.venvdoes not currently includetorchorruff. GPU-path validation remains remote SM120/SM121-only.Latest clean SM120 RTX PRO 6000 x2 data, 2026-06-12
Artifact roots:
artifacts/codex_pr_stable_preview_f32247a/2x_rtx_pro_6000_sm120/rtx_current_pr_short_throughput_mtp_noep_20260612084721artifacts/codex_pr_stable_preview_f32247a/2x_rtx_pro_6000_sm120/rtx_current_pr_clean_mtp_noep_20260612080629Short-throughput profile:
max_model_len=131072,gpu_memory_utilization=0.975,max_num_batched_tokens=4096,max_num_seqs=24.FULL_AND_PIECEWISE, 80 prompts per concurrency.server_startup=0,bench_hf_mt_bench=0,bench_random_prefill_sweep=0.HF MT-bench, 80 prompts:
Random prefill sweep, C=1, output length 128, 8 requests per case:
Correctness and reliability profile:
max_model_len=131072,max_num_seqs=4,max_num_batched_tokens=4096.server_startup=0,bench_hf_mt_bench=0,eval_gsm8k=0,bench_random_prefill_sweep=0,bench_random_8000x1000=0,bench_random_256x256=0.GSM8K 5-shot, limit-200,
/v1/completions, MTP=2, concurrency 4:Additional 128K-profile random checks:
Latest clean GB10 / SM121 data, 2026-06-12
Artifact root:
artifacts/codex_pr_stable_preview_f32247a/2x_gb10_sm121/gb10_forum53_mtp2_epoff_c2_gmem0685_mml81920/20260612074113Profile:
max_model_len=81920,max_num_seqs=2,max_num_batched_tokens=4096,gpu_memory_utilization=0.685.forum53_c2:2:2:3200:256.Gate result:
oktrueserve_start.exit_code0streaming_pressure.exit_code0ok=true, signal count00 / 40Timing and runtime summary:
Running the NVFP4 checkpoint
This branch also serves
nvidia/DeepSeek-V4-Flash-NVFP4on SM12x (RTX PRO 6000 / GB10). The NVFP4 MoE auto-selects the FlashInfer CUTLASS backend (the SwiGLU-clamp model gate now accepts it), so no--moe-backendflag is required, and no special FlashInfer build is needed (the 0.6.12 release works):--kv-cache-dtype fp8is mandatory: DeepSeek-V4'sfp8_ds_mlaattention asserts an fp8 KV layout, so the defaultautofails at model construction (this is not NVFP4-specific). Expert-parallel off (plain TP) is the supported path.Accuracy matches MXFP4 (GSM8K 8-shot ~0.96 on both SM120 and SM121). Note that on SM12x NVFP4 is not a memory or throughput win versus MXFP4: NVFP4 weights are ~4 GiB/GPU larger (~78 vs ~74 GiB), leaving less KV-cache room (lower max concurrency); single-stream prefill is marginally faster and aggregate decode marginally slower. Its value here is checkpoint availability / parity with the SM100 datacenter path, not an SM12x performance advantage — MXFP4 remains the better practical choice on consumer Blackwell.
AI assistance disclosure
AI assistants, including OpenAI Codex/GPT models and Anthropic Claude models, were used for code review, refactoring support, regression-script writing, and benchmark analysis. The branch was validated through human review plus the commands and harness artifacts listed above.