Skip to content

[Feature] TRITON_MLA_SPARSE backend for SM8x/11x/12x DSA Sparse MLA Support#38476

Open
haosdent wants to merge 1 commit into
vllm-project:mainfrom
haosdent:fix-38006
Open

[Feature] TRITON_MLA_SPARSE backend for SM8x/11x/12x DSA Sparse MLA Support#38476
haosdent wants to merge 1 commit into
vllm-project:mainfrom
haosdent:fix-38006

Conversation

@haosdent

@haosdent haosdent commented Mar 29, 2026

Copy link
Copy Markdown
Contributor

Purpose

Closes #38006. Enables sparse MLA models (GLM-5, DeepSeek-V3.2) on SM80 (A100/A800) and SM121 (GB10/DGX Spark), where DeepGEMM / FlashMLA-Sparse / FlashInfer-MLA-Sparse are unavailable.

Changes

  1. Dispatch guard. is_deep_gemm_supported() (SM90+ check) replaces has_deep_gemm() in sparse_attn_indexer.py / indexer.py. Stops DeepGEMM kernels from being invoked on SM80/SM121.
  2. Triton fp8_mqa_logits for the indexer. mqa_logits_triton.py reproduces DeepGEMM's prefill + paged MQA logits. Prefill takes bf16 q/k (pre-decoded from FP8 in the Python wrapper) and feeds a straight tl.dot; paged decode keeps a 256-entry bf16 LUT for in-kernel FP8 decode. K-side scale applied to the fp32 dot output, per-row K-tile early-exit on the chunked-prefill path.
  3. TRITON_MLA_SPARSE backend. triton_mla_sparse_kernel.py adds a split-KV decode with N-way online-softmax merge plus a single-pass fast path. Autotune is warmed at init using indexer-derived (n_head, head_dim). Masked-out sentinel is -1e30 to avoid NaN from (-inf) − (-inf) on all-masked tiles.
  4. Cudagraph support. TritonMLASparseMetadataBuilder advertises AttentionCGSupport.UNIFORM_BATCH; flips A100 TP=8 back to FULL_AND_PIECEWISE.
  5. MXFP4 link stubs. mxfp4_experts_quant / silu_and_mul_mxfp4_experts_quant stubs in nvfp4_quant_entry.cu. Real impls are SM10.x-only in CMake but torch_bindings.cpp references them unconditionally, which breaks source builds on SM 8.x.

Benchmarks

8×A100 SXM TP=8, lukealonso/GLM-5.1-NVFP4, single prompt, decode 200 tokens. cold = first request on a fresh prompt; warm = repeat (prefix cache hit):

context prefix TTFT TPOT median output tok/s
short (1,744 in) cold 0.72 s 21.7 ms 39.7
short (1,744 in) warm 0.18 s 21.8 ms 44.3
mid (65,536 in) cold 37.0 s 24.3 ms 4.8
long (127,744 in) cold 98.1 s 25.2 ms 1.9
long (127,744 in) warm 0.60 s 25.3 ms 35.5

Tests

  • tests/kernels/attention/test_mqa_logits_triton.py — 41 cases (DeepGEMM reference + clean/dirty clean_logits + 256-byte FP8 decode).
  • tests/kernels/attention/test_triton_mla_sparse_kernel.py — 53 cases (split vs single-pass + auto-heuristic + short-prefill no-NaN).

Limitations (follow-up): BF16 KV cache only on SM80/SM121; VLLM_BATCH_INVARIANT should force num_kv_splits=1 — not wired.

@mergify

mergify Bot commented Mar 29, 2026

Copy link
Copy Markdown
Contributor

Documentation preview: https://vllm--38476.org.readthedocs.build/en/38476/

@mergify mergify Bot added documentation Improvements or additions to documentation nvidia rocm Related to AMD ROCm v1 labels Mar 29, 2026
@github-project-automation github-project-automation Bot moved this to Todo in AMD Mar 29, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the TRITON_MLA_SPARSE attention backend, providing a Triton-based fallback for sparse MLA on GPUs like NVIDIA Ampere. It also refactors FP8 MQA logit fallbacks into a new module and updates the sparse attention indexer to use these PyTorch implementations when DeepGEMM is unsupported. A review comment suggests moving a module-level import to the top of the file to comply with PEP 8 guidelines.

Logits tensor of shape [B * next_n, max_model_len], dtype
`torch.float32`.
"""
from vllm.utils.math_utils import cdiv

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

To adhere to PEP 8 guidelines, module-level imports should be placed at the top of the file. Please move this import statement to the top of the module, for example, after the from vllm.platforms import current_platform import. This improves code readability and consistency.

References
  1. PEP 8: E402 module level import not at top of file. Imports should be at the top of the file, just after any module comments and docstrings, and before module globals and constants. (link)

@ZJY0516

ZJY0516 commented Mar 29, 2026

Copy link
Copy Markdown
Member

Add PyTorch fallback for indexer MQA logits — Created a shared mqa_logits_fallback.py module with fp8_mqa_logits_torch and fp8_paged_mqa_logits_torch implementations (extracted from existing ROCm fallback code). The sparse attention indexer now dispatches to these fallbacks when DeepGEMM is not supported.

FYI, we don’t plan to support a torch native mqa_logits implementation. I also question whether it’s necessary to support sparse MLA on SM80.

@ehfd

ehfd commented Mar 30, 2026

Copy link
Copy Markdown
Contributor

Add PyTorch fallback for indexer MQA logits — Created a shared mqa_logits_fallback.py module with fp8_mqa_logits_torch and fp8_paged_mqa_logits_torch implementations (extracted from existing ROCm fallback code). The sparse attention indexer now dispatches to these fallbacks when DeepGEMM is not supported.

FYI, we don’t plan to support a torch native mqa_logits implementation. I also question whether it’s necessary to support sparse MLA on SM80.

@ZJY0516

Update: This PR is no longer a torch-native mqa_logits implementation and instead use Triton.

Ampere is still extremely commonly used... We need this for DS3.2 or GLM-5. TRITON_MLA_SPARSE is possible, so we need an option for that.

@ehfd

ehfd commented Mar 30, 2026

Copy link
Copy Markdown
Contributor

PyTorch fallback for indexer logits is slower than DeepGEMM — a Triton-based variant can follow

Perhaps we can integrate Triton now?

@workcode-del

workcode-del commented Mar 31, 2026

Copy link
Copy Markdown

Hello, after I modified the code according to your PR, the GLM5 model service started normally. However, the response speed is very slow, with only about 3 tokens being responded to per second. My device is also an A800 with 80G of storage capacity. Is this normal?
@haosdent

@ehfd

ehfd commented Mar 31, 2026

Copy link
Copy Markdown
Contributor

@workcode-del I believe that it's only expected to be anywhere remotely fast when no PyTorch fallbacks exist.

@workcode-del

Copy link
Copy Markdown

@workcode-del I believe that it's only expected to be anywhere remotely fast when no PyTorch fallbacks exist.

Could you please explain how to achieve the condition where there are no PyTorch fallbacks?

@mergify

mergify Bot commented Apr 1, 2026

Copy link
Copy Markdown
Contributor

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @haosdent.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Apr 1, 2026
@ehfd

ehfd commented Apr 1, 2026

Copy link
Copy Markdown
Contributor

@workcode-del
Further development is necessary.

@ehfd

ehfd commented Apr 2, 2026

Copy link
Copy Markdown
Contributor

#37968, #35271, #38076, #36519

Probably containing the reason why needs-rebase was added.

@ianlevesque

Copy link
Copy Markdown

I was able to use this patch to run GLM-5.1 on an 8-node DGX Spark cluster. Performance is obviously not stellar (~5 t/s) but it's a great first step with compatibility.

@haosdent

Copy link
Copy Markdown
Contributor Author

Thanks @ianlevesque , your 8 x DGX Spark is incredible! I just add new triton kernels to try to address the performance issue, may you test again when you are available?

@ianlevesque

Copy link
Copy Markdown

@haosdent retried with the new patch, it did improve to 10 t/s or so.

  Depth=0 (fresh context)

  ┌────────┬────────────────────────┬───────────────┐
  │  Test  │    Throughput (t/s)    │   TTFT (ms)   │
  ├────────┼────────────────────────┼───────────────┤
  │ pp512  │ 425–452                │ 1,059–1,091   │
  ├────────┼────────────────────────┼───────────────┤
  │ pp2048 │ 606–630                │ 2,892–2,971   │
  ├────────┼────────────────────────┼───────────────┤
  │ pp8192 │ 723–730                │ 10,041–10,090 │
  ├────────┼────────────────────────┼───────────────┤
  │ tg32   │ 9.91–10.29 (peak 11.0) │ —             │
  ├────────┼────────────────────────┼───────────────┤
  │ tg128  │ 9.91–10.29 (peak 11.0) │ —             │
  ├────────┼────────────────────────┼───────────────┤
  │ tg512  │ 9.85–10.23 (peak 11.0) │ —             │
  └────────┴────────────────────────┴───────────────┘

  Depth=2048 (warm context)

  ┌────────┬──────────────────┬───────────────┐
  │  Test  │ Throughput (t/s) │   TTFT (ms)   │
  ├────────┼──────────────────┼───────────────┤
  │ pp512  │ 667–697          │ 3,250–3,403   │
  ├────────┼──────────────────┼───────────────┤
  │ pp2048 │ 656–689          │ 5,282–5,618   │
  ├────────┼──────────────────┼───────────────┤
  │ pp8192 │ 717–731          │ 12,373–12,800 │
  ├────────┼──────────────────┼───────────────┤
  │ tg32   │ 9.18–9.77        │ —             │
  ├────────┼──────────────────┼───────────────┤
  │ tg128  │ 9.49–9.79        │ —             │
  ├────────┼──────────────────┼───────────────┤
  │ tg512  │ 9.51–9.70        │ —             │
  └────────┴──────────────────┴───────────────┘

@songshu0427-lgtm

Copy link
Copy Markdown

@ehfd @songshu0427-lgtm 我的问题已解决。

最初,我使用 claude-code-router 运行 Claude Code,GLM-5.1 由 vLLM 提供服务。当 Claude 进入规划模式时,我总是遇到以下错误:

无法解析模型的工具调用(重试也失败了)。

图像 我还发现了一个似乎相关的问题: #42400

后来,我改用agentfw运行 Claude Code ,编码模型仍然使用 vLLM 提供的 GLM-5.1 模型。切换后,一切正常,不再出现任何工具调用错误。

设置:

npm install -g @openguardrails/agentfw
agentfw claude

AgentFW: https://github.com/openguardrails/agentfw

下面附上一些截图供您参考。

  1. 将 vllm-glm5.1 配置为文本模型(我还将 Qwen3.6 配置为视觉模型,以便为 glm5.1 提供“眼睛”)。
图像 2. 跑步
agentfw claude
图像 3. 检查运行 图像 刀具调用和刀具调用结果 图像

我的问题也定位到了
I
image

@zt1024

zt1024 commented Jun 12, 2026

Copy link
Copy Markdown

"Does this patch support turboquant-vllm? Since turboquant-vllm enables KV cache compression, I'm interested in its compatibility. I actually opened this issue: https://github.com/varjoranta/turboquant-vllm/issues/56. Could you please help me look into it?" @haosdent

@ehfd

ehfd commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

@haosdent Can we add support for GLM-5.2-NVFP4? It seems that it has a specific shape that needs to be added manually (576 MLA heads?).

https://huggingface.co/lukealonso/GLM-5.2-NVFP4

@ehfd

ehfd commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

I think merging the main branch might solve it.

@ghostplant

ghostplant commented Jun 17, 2026

Copy link
Copy Markdown

env: A800-SXM + haosdent/vllm-nightly:fix-38006-6[https://hub.docker.com/r/haosdent/vllm-nightly/tags]
deploy:

sudo docker run -d \
    --name glm5_1_vllm \
    --gpus all \
    --network host \
    --ipc host \
    --shm-size 32g \
    --security-opt label=disable \
    --restart no \
    -v models:models:ro \
    haosdent/vllm-nightly:fix-38006-6 \
    models/cyankiwi/GLM-5.1-AWQ-4bit \
    --served-model-name glm5.1 \
    --host 0.0.0.0 \
    --port 12345 \
    --max-model-len 65536 \
    --trust-remote-code \
    --enable-expert-parallel \
    --tensor-parallel-size 8 \
    --max-num-seqs 512 \
    --gpu-memory-utilization 0.85 \
    --tool-call-parser glm47 \
    --reasoning-parser glm45 \
    --enable-auto-tool-choice \
    --chat-template-content-format=string

bench:

vllm bench serve \
    --backend openai-chat \
    --base-url http://127.0.0.1:12345 \
    --endpoint /v1/chat/completions \
    --model glm5.1 \
    --tokenizer models/cyankiwi/GLM-5.1-AWQ-4bit \
    --trust-remote-code \
    --dataset-name sharegpt \
    --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
    --num-prompts 128 \
    --max-concurrency 32 \
    --request-rate inf \
    --seed 42 \
    --save-result \
    --result-dir workspace_2026_4/deploy/logs \
    --result-filename bench_glm5_1_p128_c32_20260521-092828.json

result: c=32:

============ Serving Benchmark Result ============
Successful requests:                     128       
Failed requests:                         0         
Maximum request concurrency:             32        
Benchmark duration (s):                  84.09     
Total input tokens:                      27829     
Total generated tokens:                  30153     
Request throughput (req/s):              1.52      
Output token throughput (tok/s):         358.57    
Peak output token throughput (tok/s):    512.00    
Peak concurrent requests:                36.00     
Total token throughput (tok/s):          689.50    
---------------Time to First Token----------------
Mean TTFT (ms):                          912.35    
Median TTFT (ms):                        320.07    
P99 TTFT (ms):                           2867.16   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          70.80     
Median TPOT (ms):                        72.76     
P99 TPOT (ms):                           95.51     
---------------Inter-token Latency----------------
Mean ITL (ms):                           67.81     
Median ITL (ms):                         64.35     
P99 ITL (ms):                            296.68    
==================================================

c=96:

============ Serving Benchmark Result ============
Successful requests:                     384       
Failed requests:                         0         
Maximum request concurrency:             96        
Benchmark duration (s):                  119.69    
Total input tokens:                      83110     
Total generated tokens:                  80182     
Request throughput (req/s):              3.21      
Output token throughput (tok/s):         669.92    
Peak output token throughput (tok/s):    1056.00   
Peak concurrent requests:                105.00    
Total token throughput (tok/s):          1364.30   
---------------Time to First Token----------------
Mean TTFT (ms):                          570.01    
Median TTFT (ms):                        455.85    
P99 TTFT (ms):                           1251.19   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          120.69    
Median TPOT (ms):                        120.11    
P99 TPOT (ms):                           189.35    
---------------Inter-token Latency----------------
Mean ITL (ms):                           114.45    
Median ITL (ms):                         95.28     
P99 ITL (ms):                            353.53    
==================================================

I have reproduced these results, and they are credible.

Does single machine A100x8 SMX work for 96 concurrency?

@Kasempiternal

Kasempiternal commented Jun 18, 2026

Copy link
Copy Markdown

Thanks for this backend, @haosdent — it's the missing piece for sparse-MLA on Ampere. We used it to bring up GLM-5.2 (glm_moe_dsa) on 32×A100 (sm80, BF16, TP=4×PP=8) as a dataset-generation teacher, and it works: healthy server, CUDA-graph optimized, coherent output, ~264.6 tok/s aggregate (64-concurrent). Sharing what GLM-5.2 needed on top of this PR, in case it's useful for broader DSA support.

1. GLM-5.2 needs IndexShare (this PR builds the indexer per-layer)

GLM-5.2 uses a per-layer config.indexer_types pattern (full,full,full,shared,shared,shared,full,...). shared layers don't run their own indexer — they reuse the previous full layer's top-k indices and ship no indexer weights in the checkpoint. Reusing the DeepSeek-V3.2 path (one indexer per layer) makes vLLM (a) fail to load (ValueError: ...indexer.k_norm not initialized on the shared layers) and (b) be numerically wrong even if forced (uninitialized indexers). (This mirrors transformers' own GLM-5 skip-topk sharing, huggingface/transformers#46372.)

The good news: the machinery already exists here — mla.py gates the indexer on not skip_topk, and topk_indices_buffer is allocated once and shared across layers. We just drove skip_topk from indexer_types. In DeepseekV2MLAAttention.__init__ (deepseek_v2.py), right after the existing use_index_cache block:

# GLM-5.2 IndexShare: layers with indexer_types[i] == "shared" reuse the previous
# "full" layer's top-k indices (via the shared topk_indices_buffer) instead of
# running their own indexer. Matches modeling_glm_moe_dsa.py:
#   self.skip_topk = config.indexer_types[layer_idx] == "shared"
indexer_types = getattr(config, "indexer_types", None)
if indexer_types is not None:
    _layer_idx = extract_layer_index(prefix)
    if 0 <= _layer_idx < len(indexer_types) and indexer_types[_layer_idx] == "shared":
        _skip_topk = True

extract_layer_index is already imported. Layers 0-2 (full) write the buffer; 3-5 (shared) reuse layer 2; layer 6 (full) overwrites; 7-9 reuse layer 6 — bit-identical to modeling_glm_moe_dsa.py. The shared layers' indexer submodules are still constructed (for buffer access) but never called, so their checkpoint-absent weights stay unused (we ran with --model-loader-extra-config.enable_weights_track=false; a cleaner variant skips building the submodule for shared layers). We verified shared-layer output equals reusing the prior full layer's indices on a dummy-weights model with two shared groups.

2. Heads-up: a transformers config-alias bug (already fixed on main)

Older transformers releases (≤5.8.1) alias attribute_map = {"head_dim": "qk_rope_head_dim"}, and GLM-5.2's config.json ships both head_dim: 192 and qk_rope_head_dim: 64, so head_dim (192, == qk_nope) clobbers the true qk_rope_head_dim (64) — inflating the MLA head 576→704 and breaking weight load (fused_qkv_a_proj narrow 704 vs 576). This is already fixed on transformers main (huggingface/transformers#46338 removed the alias and sets head_dim = qk_rope_head_dim in __post_init__), so it only affects pinned older releases — there, --hf-overrides '{"qk_rope_head_dim": 64}' is the workaround. Flagging only because GLM-5.2 users on a pinned transformers will still hit it.

3. CUDA graphs on sm80 (Triton indexer fallback)

vllm::sparse_attn_indexer is in splitting_ops, but cudagraph_mode=FULL_AND_PIECEWISE captures the whole decode and pulls the autotuned Triton indexer into the capture stream → CUDA error: operation not permitted when stream is capturing. Serving with --compilation-config.cudagraph_mode=PIECEWISE (indexer runs eager at the split) + --disable-custom-all-reduce works and keeps graph acceleration for the dense/MoE path. Might be worth auto-downgrading FULL→PIECEWISE when a split op is non-capturable, or a docs note for non-Hopper DSA.


I've opened the IndexShare change above as a small PR onto fix-38006 (haosdent#7), DCO-signed — happy to restructure however you'd like it folded in, and I can add a forward-equivalence test for the shared-layer path. Thanks again for driving this.

@ehfd

ehfd commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

@Kasempiternal Thank you so much for the contribution!

@timinar

timinar commented Jun 20, 2026

Copy link
Copy Markdown

Validation run on our 8×A100 box; investigation + write-up done with Claude Code.

Independent confirmation: GLM-5.2 (glm_moe_dsa / DeepSeek Sparse Attention) running on 8x A100-80GB thanks to this PR. Huge thanks to @haosdent — the TRITON_MLA_SPARSE backend plus the bf16/Triton mqa_logits indexer fallback are exactly what unblocks sm_80, where stock vLLM dies with "FlashMLA Sparse is only supported on Hopper and Blackwell devices" (the issue here, plus #35021 / #41525).

Setup

  • Model: cyankiwi/GLM-5.2-AWQ-INT4 (753B MoE, compressed-tensors WNA16 INT4, indexer kept bf16, ~411 GiB on disk).
  • vLLM base: main d272418 (0.23.1rc1.dev209+gd272418f4); this PR cherry-picks cleanly as a single commit on top.
  • One conflict in sparse_attn_indexer.py (main since added an XPU branch): resolved into a three-way is_xpu() / use_deep_gemm / Triton-fallback dispatch, and took the PR's logger.warning_once over main's hard RuntimeError in __init__ so sm_80 routes through Triton instead of aborting.
  • I dropped the PR's lone .cu file (nvfp4_quant_entry.cu, an SM100 stub) — on this base it is never referenced at link time on an sm_80 build, which keeps the changeset python-only. That meant no CUDA recompile: I installed the patched tree editable with VLLM_USE_PRECOMPILED=1 over the d272418 cu130 nightly wheel. vllm --version shows ...+gd272418f4.precompiled.
  • 8x A100-SXM4-80GB, TP=8, bf16 KV cache (--kv-cache-dtype auto, not fp8), --no-async-scheduling, --gpu-memory-utilization 0.90, --max-model-len 32768. Weights load at 54.45 GiB/rank; GPU KV cache = 151,040 tokens.

Proof it routes correctly (server startup log):

WARNING [sparse_attn_indexer.py:482] DeepGEMM not supported on this platform; using Triton fallback for sparse attention indexer.
INFO    [cuda.py:457] Using TRITON_MLA_SPARSE attention backend out of potential backends: ['TRITON_MLA_SPARSE'].

is_deep_gemm_supported() returns False on A100, so the indexer takes the Triton path as intended.

Throughput (512-tok prompt / 128-tok greedy gen, ignore_eos):

Concurrency Aggregate decode tok/s Total tok/s Mean TTFT (s)
1 56.4 281.9 0.069
4 185.6 928.1 0.156
8 334.8 1674.1 0.226
16 548.0 2740.1 0.388
32 625.4 3127.1 0.562

~56 tok/s single-stream decode, plateauing near 625 tok/s aggregate at 32-way concurrency, TTFT sub-second throughout. For reference, llama.cpp (GGUF UD-Q4_K_XL) on the same 8x A100 does ~24.5 tok/s single-stream and saturates around 70 tok/s aggregate — so this path is ~2.3x single-stream and ~9x aggregate.

Output is coherent (verified on arithmetic, Rayleigh scattering, and a "who wrote Hamlet" prompt that closed its <think> block and returned a clean final sentence). Note GLM-5.2 is a reasoning model and the default max reasoning effort can eat the whole token budget, so size max_tokens accordingly.

Full write-up (weights -> cherry-pick + conflict resolution -> precompiled overlay install -> serve -> benchmarks): https://gist.github.com/timinar/c8d2eca4e2ea7d11db57a1e6e62d06a2.

@ghostplant

Copy link
Copy Markdown

Validation run on our 8×A100 box; investigation + write-up done with Claude Code.

Independent confirmation: GLM-5.2 (glm_moe_dsa / DeepSeek Sparse Attention) running on 8x A100-80GB thanks to this PR. Huge thanks to @haosdent — the TRITON_MLA_SPARSE backend plus the bf16/Triton mqa_logits indexer fallback are exactly what unblocks sm_80, where stock vLLM dies with "FlashMLA Sparse is only supported on Hopper and Blackwell devices" (the issue here, plus #35021 / #41525).

Setup

  • Model: cyankiwi/GLM-5.2-AWQ-INT4 (753B MoE, compressed-tensors WNA16 INT4, indexer kept bf16, ~411 GiB on disk).
  • vLLM base: main d272418 (0.23.1rc1.dev209+gd272418f4); this PR cherry-picks cleanly as a single commit on top.
  • One conflict in sparse_attn_indexer.py (main since added an XPU branch): resolved into a three-way is_xpu() / use_deep_gemm / Triton-fallback dispatch, and took the PR's logger.warning_once over main's hard RuntimeError in __init__ so sm_80 routes through Triton instead of aborting.
  • I dropped the PR's lone .cu file (nvfp4_quant_entry.cu, an SM100 stub) — on this base it is never referenced at link time on an sm_80 build, which keeps the changeset python-only. That meant no CUDA recompile: I installed the patched tree editable with VLLM_USE_PRECOMPILED=1 over the d272418 cu130 nightly wheel. vllm --version shows ...+gd272418f4.precompiled.
  • 8x A100-SXM4-80GB, TP=8, bf16 KV cache (--kv-cache-dtype auto, not fp8), --no-async-scheduling, --gpu-memory-utilization 0.90, --max-model-len 32768. Weights load at 54.45 GiB/rank; GPU KV cache = 151,040 tokens.

Proof it routes correctly (server startup log):

WARNING [sparse_attn_indexer.py:482] DeepGEMM not supported on this platform; using Triton fallback for sparse attention indexer.
INFO    [cuda.py:457] Using TRITON_MLA_SPARSE attention backend out of potential backends: ['TRITON_MLA_SPARSE'].

is_deep_gemm_supported() returns False on A100, so the indexer takes the Triton path as intended.

Throughput (512-tok prompt / 128-tok greedy gen, ignore_eos):

Concurrency Aggregate decode tok/s Total tok/s Mean TTFT (s)
1 56.4 281.9 0.069
4 185.6 928.1 0.156
8 334.8 1674.1 0.226
16 548.0 2740.1 0.388
32 625.4 3127.1 0.562
~56 tok/s single-stream decode, plateauing near 625 tok/s aggregate at 32-way concurrency, TTFT sub-second throughout. For reference, llama.cpp (GGUF UD-Q4_K_XL) on the same 8x A100 does ~24.5 tok/s single-stream and saturates around 70 tok/s aggregate — so this path is ~2.3x single-stream and ~9x aggregate.

Output is coherent (verified on arithmetic, Rayleigh scattering, and a "who wrote Hamlet" prompt that closed its <think> block and returned a clean final sentence). Note GLM-5.2 is a reasoning model and the default max reasoning effort can eat the whole token budget, so size max_tokens accordingly.

Full write-up (weights -> cherry-pick + conflict resolution -> precompiled overlay install -> serve -> benchmarks): https://gist.github.com/timinar/c8d2eca4e2ea7d11db57a1e6e62d06a2.

Is it meaningful for 32 concurrency for SM_8x? The 1M-context does not likely work under such concurrency size.

@timinar

timinar commented Jun 20, 2026

Copy link
Copy Markdown

Is it meaningful for 32 concurrency for SM_8x? The 1M-context does not likely work under such concurrency size.

Not really. It was max 32k context, I believe. So too short for multi-turn agentic tasks, but could be useful for some other types of local work.

@ehfd

ehfd commented Jun 21, 2026

Copy link
Copy Markdown
Contributor

Actually, pipeline parallelism will multiply the context to the number of GPUs, due to GLM-5.x being MLA.

@ghostplant

ghostplant commented Jun 21, 2026

Copy link
Copy Markdown

Is it meaningful for 32 concurrency for SM_8x? The 1M-context does not likely work under such concurrency size.

Not really. It was max 32k context, I believe. So too short for multi-turn agentic tasks, but could be useful for some other types of local work.

I realized this topic assumes INT4 (Q4) rather than NVFP4. Based on any 4-bit solution, 1M context window on an 8xA100 SXM node is still possible. However, vLLM needs to additionally implement context parallelism rather than relying on pipeline parallelism or tensor parallelism. Don't be upset and keep up more enhancements.

AIME benchmark requires at least 160K context to complete reasoning and 32K isn't long enough. My current trouble falls into how to reproduce their 99% scores on AIME-2026.

@ehfd

ehfd commented Jun 21, 2026

Copy link
Copy Markdown
Contributor

In MLA models, the number of KV cache heads is considered 1.
This duplicates the KV cache head in tensor parallelism. Decode context parallelism is used to de-duplicate the KV cache heads. However, it is not yet implemented for Sparse MLA; only in MLA.

But in pipeline parallelism, there is no KV cache head duplication, so the KV cache gets multiplied by the number of GPUs.

@ghostplant

Copy link
Copy Markdown

In MLA models, the number of KV cache heads is considered 1. This duplicates the KV cache head in tensor parallelism. Decode context parallelism is used to de-duplicate the KV cache heads. However, it is not yet implemented for Sparse MLA; only in MLA.

But in pipeline parallelism, there is no KV cache head duplication, so the KV cache gets multiplied by the number of GPUs.

Pipeline parallel supports 1M context but suffers from low cross-GPU utilization. Tutel images just supported paged Context-Sparse MLA recently: https://hub.docker.com/r/tutelgroup/deepseek-671b

@ehfd

ehfd commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Is there a way to use proper tool calling and reasoning parsers with Tutel?

@ghostplant

ghostplant commented Jun 22, 2026

Copy link
Copy Markdown

Is there a way to use proper tool calling and reasoning parsers with Tutel?

Claude Code connecting to Tutel will directly trigger GLM's tool calling.
But OpenAI chat completion interfaces cannot use tools at this point.

Without tool calls, I get only 90% for AIME-26 over GLM5.2-NVFP4, but it was >96% for GLM-5-NVFP4 and GLM-5.1-NVFP4. I don't know if NVFP4 dislike GLM-5.2 or not, so I wonder what the score would be if evaluating AIME-26 over Q4?

Maybe officially claimed 99% for AIME-26 is never reproducible, however, I don't have enough-strong GPU environment (Hxx/Bxx) to evaluate GLM-5.2-BF16 or GLM-5.2-FP8.

@songshu0427-lgtm

Copy link
Copy Markdown

tool call parse fix
glm47_stateful_tool_parser_remote_image.patch

Comment thread vllm/platforms/cuda.py
AttentionBackendEnum.FLASHMLA,
AttentionBackendEnum.FLASHINFER_MLA,
AttentionBackendEnum.TRITON_MLA,
AttentionBackendEnum.TRITON_MLA_SPARSE,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This backend shouldn't be prioritized over FlashMLA sparse, this will hurt SM90 performance. Please swap these two.

@halexan

halexan commented Jun 25, 2026

Copy link
Copy Markdown

Great job!

@halexan

halexan commented Jun 26, 2026

Copy link
Copy Markdown

Hi @timinar

How about cyankiwi/GLM-5.2-AWQ-INT4 accuracy recovery?

Is it as smart as the non-quantized version of the model?

@ghostplant

Copy link
Copy Markdown

Hi @timinar

How about cyankiwi/GLM-5.2-AWQ-INT4 accuracy recovery?

Is it as smart as the non-quantized version of the model?

I tested NVFP4 instead of INT4 version, without tool use, aime-2026 gets 90-93% only. Enabling tools improves it to > 99%.

@RefalMachine

RefalMachine commented Jun 27, 2026

Copy link
Copy Markdown

Has anyone encountered the following issue when trying to deploy GLM-5.2 on 8x A100 GPUs with Nvidia 535 drivers (cu129)?

  1. I am following this guide to enable the Triton fallback sparse MLA path: https://gist.github.com/timinar/c8d2eca4e2ea7d11db57a1e6e62d06a2
    with vllm/vllm-openai:cu129-nightly-a346d589f5932d4234bf5bf8718f10e26d187021 container as a base

  2. During the CUDA graph profiling stage (profile_cudagraph_memory), the engine crashes with the following error:
    RuntimeError: Triton Error [CUDA]: operation not permitted when stream is capturing
    This originates from vllm/v1/attention/ops/mqa_logits_triton.py (inside the fp8_paged_mqa_logits_triton execution path).

  3. To bypass this, I had to implement a manual warm-up sequence in the init method of SparseAttentionIndexer:
    warmup_fp8_mqa_logits_triton(num_heads, self.head_dim, device)
    warmup_fp8_paged_mqa_logits_triton(num_heads, self.head_dim, block_size, device)
    Additionally, I patched the @triton.jit decorators for both prefill and decode kernels in mqa_logits_triton.py to include do_not_specialize for all tensor stride arguments (to prevent Triton JIT re-compilation during graph profiling under different batch sizes):
    @triton.jit(do_not_specialize=[
    "stride_q_b", "stride_q_n", "stride_q_h", "stride_q_d",
    "stride_kvf_block", "stride_kvf_s", "stride_kvf_d",
    ...
    ])

  4. Now the model successfully boots up, but only if I increase --gpu-memory-utilization to 0.95. With the reference 0.90 (from the gist), it raises ValueError: No available memory for the cache blocks even for a small 32768 context window.

This workaround feels too hacky and diverges significantly from the reference behavior in the gist (where things reportedly work out-of-the-box on 0.90 memory limit without Triton capturing crashes).

Has anyone successfully run this without such JIT-warmup patches on the newer vLLM V1 engine? Any insights on why the memory footprint/graph profiling behavior differs so much from the reference would be greatly appreciated!

UPD: The problem was with the container/main versions. Here's the working Dockerfile.

# Nightly image from June 22, 2026 — functionally identical to commit 435f82d61
# (the commit from the rumor). Diff between 9037498c2 and 435f82d61 is 1 test-only file.
# This is the exact window when @timinar's guide was written (June 20-24).
FROM vllm/vllm-openai:cu129-nightly-9037498c22891e55b594f567fb91d9b4efbf3e99

USER root

# 1. Install minimal git
RUN apt-get update && apt-get install -y --no-install-recommends git && rm -rf /var/lib/apt/lists/*

WORKDIR /vllm-workspace

# 2. Clone the repo, pin commit 435f82d61 (the one from the rumor), and cherry-pick PR #38476
#    PR #38476 adds TRITON_MLA_SPARSE attention backend for SM80/SM121 sparse MLA support.
#    On this commit, the only conflicts are in sparse_attn_indexer.py and platforms/cuda.py.
#    Unlike newer nightlies (June 26+), no CUDA graph capture hacks are needed here because
#    the Triton MoE regression (#46142/#46254) is not yet in main.
RUN git config --global user.email "auto-resolver@vllm.ai" && \
    git config --global user.name "Auto Resolver" && \
    git clone https://github.com/vllm-project/vllm && \
    cd vllm && \
    git checkout 435f82d61a1eddb84854ca59a008a8e4d97ab439 && \
    git fetch origin pull/38476/head:pr-38476 && \
    (git cherry-pick pr-38476 || echo "Merge conflict expected, proceeding to patch...")

WORKDIR /vllm-workspace/vllm

# 3.1. Patch sparse_attn_indexer.py — Conflict 1: fp8_mqa_logits
#      Resolve to a three-way dispatch: XPU (from main) → DeepGEMM (from PR) → Triton fallback (from PR).
#      On SM80/A100: is_deep_gemm_supported()=False, so Triton fallback is used automatically.
RUN python3 -c 'import re; p="vllm/model_executor/layers/sparse_attn_indexer.py"; f=open(p,"r"); c=f.read(); f.close(); pat=r"<<<<<<< HEAD\s+if current_platform\.is_xpu\(\):.*?fp8_mqa_logits_triton.*?>>>>>>> \w+.*?[^\n]*"; rep="            if current_platform.is_xpu():\n                if q_scale_slice is not None:\n                    raise RuntimeError(\"XPU fp8_mqa_logits does not support FP4 Q\")\n                logits = torch.ops.vllm.xpu_fp8_mqa_logits(\n                    q_slice_cast,\n                    k_quant_cast,\n                    k_scale_cast,\n                    weights[chunk.token_start : chunk.token_end],\n                    chunk.cu_seqlen_ks,\n                    chunk.cu_seqlen_ke,\n                )\n            elif use_deep_gemm:\n                logits = fp8_fp4_mqa_logits(\n                    (q_slice_cast, q_scale_slice),\n                    (k_quant_cast, k_scale_cast),\n                    weights[chunk.token_start : chunk.token_end],\n                    chunk.cu_seqlen_ks,\n                    chunk.cu_seqlen_ke,\n                    clean_logits=False,\n                )\n            else:\n                logits = fp8_mqa_logits_triton(\n                    q_slice_cast,\n                    (k_quant_cast, k_scale_cast),\n                    weights[chunk.token_start : chunk.token_end],\n                    chunk.cu_seqlen_ks,\n                    chunk.cu_seqlen_ke,\n                    clean_logits=False,\n                )"; c=re.sub(pat, rep, c, flags=re.DOTALL); f=open(p,"w"); f.write(c); f.close()'

# 3.2. Patch sparse_attn_indexer.py — Conflict 2: fp8_paged_mqa_logits
#      Same three-way dispatch pattern: XPU → DeepGEMM → Triton fallback.
RUN python3 -c 'import re; p="vllm/model_executor/layers/sparse_attn_indexer.py"; f=open(p,"r"); c=f.read(); f.close(); pat=r"<<<<<<< HEAD\s+if current_platform\.is_xpu\(\):.*?fp8_paged_mqa_logits_triton.*?>>>>>>> \w+.*?[^\n]*"; rep="        if current_platform.is_xpu():\n            if padded_q_scale is not None:\n                raise RuntimeError(\"XPU fp8_paged_mqa_logits does not support FP4 Q\")\n            seq_lens_xpu = (\n                seq_lens[:, -1].contiguous() if seq_lens.ndim == 2 else seq_lens\n            )\n            logits = torch.ops.vllm.xpu_fp8_paged_mqa_logits(\n                padded_q_quant_cast,\n                kv_cache,\n                weights[:num_padded_tokens],\n                seq_lens_xpu,\n                decode_metadata.block_table,\n                decode_metadata.schedule_metadata,\n                max_model_len,\n            )\n        elif use_deep_gemm:\n            logits = fp8_fp4_paged_mqa_logits(\n                (padded_q_quant_cast, padded_q_scale),\n                kv_cache,\n                weights[:num_padded_tokens],\n                seq_lens,\n                decode_metadata.block_table,\n                decode_metadata.schedule_metadata,\n                max_model_len=max_model_len,\n                clean_logits=False,\n            )\n        else:\n            # SM80/SM121 Triton fallback.\n            active_max_model_len = attn_metadata_narrowed.max_seq_len\n            logits = fp8_paged_mqa_logits_triton(\n                padded_q_quant_cast,\n                kv_cache,\n                weights[:num_padded_tokens],\n                seq_lens,\n                decode_metadata.block_table,\n                max_model_len=active_max_model_len,\n                clean_logits=False,\n            )"; c=re.sub(pat, rep, c, flags=re.DOTALL); f=open(p,"w"); f.write(c); f.close()'

# 3.3. Patch sparse_attn_indexer.py — Remaining conflicts: take PR version (theirs)
#      This resolves the __init__ conflict: replaces main's hard RuntimeError (when DeepGEMM
#      is missing) with the PR's warn-and-fallback on `not is_deep_gemm_supported()`.
#      No JIT warmup hack needed on this base image.
RUN python3 -c 'import re; p="vllm/model_executor/layers/sparse_attn_indexer.py"; f=open(p,"r"); c=f.read(); f.close(); pat=r"<<<<<<< HEAD\s+.*?=======\s+(.*?)>>>>>>> \w+.*?[^\n]*"; c=re.sub(pat, r"\1", c, flags=re.DOTALL); f=open(p,"w"); f.write(c); f.close()'

# 4. Patch platforms/cuda.py — if conflict, take PR version (theirs).
#    Adds TRITON_MLA_SPARSE to the supported backends list for SM80/SM121.
RUN python3 -c 'import re; p="vllm/platforms/cuda.py"; f=open(p,"r"); c=f.read(); f.close(); pat=r"<<<<<<< HEAD\s+.*?=======\s+(.*?)>>>>>>> \w+.*?[^\n]*"; c=re.sub(pat, r"\1", c, flags=re.DOTALL); f=open(p,"w"); f.write(c); f.close()'

# 5. Revert nvfp4_quant_entry.cu to HEAD to avoid rebuilding _C_stable_libtorch.abi3.so
#    (the PR's stub would break the .so link on SM8x — see @timinar's guide)
RUN git checkout HEAD -- csrc/libtorch_stable/quantization/fp4/nvfp4_quant_entry.cu

# 6. Verify no conflict markers remain, then finalize the cherry-pick
RUN python3 -c 'import sys; files=["vllm/model_executor/layers/sparse_attn_indexer.py","vllm/platforms/cuda.py"]; [sys.exit(1) for f in files for m in ["<<<<<<< HEAD","=======",">>>>>>>"] if m in open(f).read()]'; \
    git add vllm/model_executor/layers/sparse_attn_indexer.py && \
    git add vllm/platforms/cuda.py && \
    git add csrc/libtorch_stable/quantization/fp4/nvfp4_quant_entry.cu && \
    (git add docs/design/attention_backends.md 2>/dev/null || true) && \
    git -c core.editor=true cherry-pick --continue

# 7. Overlay the patched Python code onto the system vLLM installation
RUN cp -r /vllm-workspace/vllm/vllm/* /usr/local/lib/python3.12/dist-packages/vllm/

# Clean up temp repo
RUN rm -rf /vllm-workspace/vllm

# The backend auto-selects on SM80, but this env is harmless and matches @timinar's guide
ENV VLLM_ATTENTION_BACKEND=TRITON_MLA_SPARSE

# vLLM entrypoint
ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server"]

@ehfd

ehfd commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

@RefalMachine Thank you so much for sharing the Dockerfile!

@RefalMachine

Copy link
Copy Markdown

@RefalMachine Thank you so much for sharing the Dockerfile!

The current code successfully deploys the model for me, however, after N minutes of operation under high load, NaN values randomly start appearing in the logits (it’s unclear where or why), resulting in the output degenerating into continuous '!!!' tokens. This can only be resolved by restarting. Unfortunately, I don't know how to fix this issue yet. If anyone has encountered this before, I would be grateful for any leads.

@Ph0enix89

Copy link
Copy Markdown

I am able to run it on 40 GB A100 GPUs. Using the latest main with the PR on top of it. Details below. There are two main issues:

  1. Tool calling doesn't work. Queries via curl work fine. However whenever I try to use it in Hermes eventually there's a timeout. In the logs there is Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered. Has anyone run into it? Any recommendations? I tried launching with --chat-template-content-format string but that doesn't help. There are some reports that tool calling works without pipeline parallelism but these come from AMD ROCM systems so it's not clear if it's relevant.
  2. KV cache quantization is not available. It's been mentioned that the problem is that fp8e4nv is hardcoded and not supported. AI suggested that the code should be adjusted to use 'fp8e5 for SM80. However it's not clear how much effort it would take to implement. In the current main there is also commit b588f66dc ("[GLM5.2 Perf] fused_indexer_q_rope_quant triton kernel") that also uses hardcoded fp8e4nv for some optimizations. It has to be reverted, otherwise vllm fails during startup on A100 (when trying to load GLM-5.2). Since the A100s I have access to have only 40 GBs doubling KV cache would be very handy, that's probably the biggest throughput bottleneck.

As for the steps I had to combine a bunch of scattered pieces to make it work on my system:

  1. Checkout main
  2. Revert b588f66dc ("[GLM5.2 Perf] fused_indexer_q_rope_quant triton kernel")
  3. Follow steps in this guide to modify the code
  4. Add this patch
  5. Install the patched version with VLLM_USE_PRECOMPILED=1 uv pip install -e .
  6. Install runai-streamer with pip3 install vllm[runai]. Not strictly necessary but without it loading the model takes forever.
  7. Finally launch it with:
vllm serve "$MODEL" \
        --served-model-name glm-5.2 \
        --kv-cache-dtype auto \
        --tensor-parallel-size ${TP_SIZE} \
        --pipeline-parallel-size ${PP_SIZE} \
        --distributed-executor-backend mp \
        --nnodes ${NNODES} \
        --node-rank "${PROCID}" \
        --master-addr "$HEAD_HOST_IP" \
        --no-async-scheduling \
        --tool-call-parser glm47 \
        --enable-auto-tool-choice \
        --reasoning-parser glm45 \
        --gpu-memory-utilization 0.93 \
        --load-format runai_streamer \
        --compilation-config.cudagraph_mode=PIECEWISE \
        --disable-custom-all-reduce \
        --chat-template-content-format string \
        --block-size 128 \
        ${HEADLESS_FLAG}
7) It took a while to figure out that `--block-size 128` is necessary. Without it vllm fails to start with `RuntimeError: Worker failed with error 'No common block size for 16. ', please check the stack trace above for the root cause`.

These are my steps that result in a working setup without tool calling support. Perhaps it helps some people. Any tips are welcome.

@ghostplant

Copy link
Copy Markdown

I am able to run it on 40 GB A100 GPUs. Using the latest main with the PR on top of it. Details below. There are two main issues:

  1. Tool calling doesn't work. Queries via curl work fine. However whenever I try to use it in Hermes eventually there's a timeout. In the logs there is Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered. Has anyone run into it? Any recommendations? I tried launching with --chat-template-content-format string but that doesn't help. There are some reports that tool calling works without pipeline parallelism but these come from AMD ROCM systems so it's not clear if it's relevant.
  2. KV cache quantization is not available. It's been mentioned that the problem is that fp8e4nv is hardcoded and not supported. AI suggested that the code should be adjusted to use 'fp8e5 for SM80. However it's not clear how much effort it would take to implement. In the current main there is also commit b588f66dc ("[GLM5.2 Perf] fused_indexer_q_rope_quant triton kernel") that also uses hardcoded fp8e4nv for some optimizations. It has to be reverted, otherwise vllm fails during startup on A100 (when trying to load GLM-5.2). Since the A100s I have access to have only 40 GBs doubling KV cache would be very handy, that's probably the biggest throughput bottleneck.

As for the steps I had to combine a bunch of scattered pieces to make it work on my system:

  1. Checkout main
  2. Revert b588f66dc ("[GLM5.2 Perf] fused_indexer_q_rope_quant triton kernel")
  3. Follow steps in this guide to modify the code
  4. Add this patch
  5. Install the patched version with VLLM_USE_PRECOMPILED=1 uv pip install -e .
  6. Install runai-streamer with pip3 install vllm[runai]. Not strictly necessary but without it loading the model takes forever.
  7. Finally launch it with:
vllm serve "$MODEL" \
        --served-model-name glm-5.2 \
        --kv-cache-dtype auto \
        --tensor-parallel-size ${TP_SIZE} \
        --pipeline-parallel-size ${PP_SIZE} \
        --distributed-executor-backend mp \
        --nnodes ${NNODES} \
        --node-rank "${PROCID}" \
        --master-addr "$HEAD_HOST_IP" \
        --no-async-scheduling \
        --tool-call-parser glm47 \
        --enable-auto-tool-choice \
        --reasoning-parser glm45 \
        --gpu-memory-utilization 0.93 \
        --load-format runai_streamer \
        --compilation-config.cudagraph_mode=PIECEWISE \
        --disable-custom-all-reduce \
        --chat-template-content-format string \
        --block-size 128 \
        ${HEADLESS_FLAG}
7) It took a while to figure out that `--block-size 128` is necessary. Without it vllm fails to start with `RuntimeError: Worker failed with error 'No common block size for 16. ', please check the stack trace above for the root cause`.

These are my steps that result in a working setup without tool calling support. Perhaps it helps some people. Any tips are welcome.

Do you run this with 16 A100s?

@Ph0enix89

Copy link
Copy Markdown

Do you run this with 16 A100s?

  1. 16 nodes with 4 GPUs each. Not the most efficient config but it is what it is.

@ehfd

ehfd commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

#43477 was merged and a maintainer said on Slack that they were open to merging this one after that PR.

Integration of GLM-5.2 and DeepSeek V4 support, as well as a rebase is desired.

@haosdent

@lzf-tech

Copy link
Copy Markdown

#43477 was merged and a maintainer said on Slack that they were open to merging this one after that PR.

Integration of GLM-5.2 and DeepSeek V4 support, as well as a rebase is desired.

@haosdent

It seems @haosdent hasn’t been active on this project for some time. Really looking forward to the GLM-5.2 and DeepSeek V4 support once the rebase is done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation intel-gpu Related to Intel GPU needs-rebase nvidia rocm Related to AMD ROCm v1

Projects

Status: Todo
Status: No status

Development

Successfully merging this pull request may close these issues.

[Feature]: Implement TRITON_MLA_SPARSE backend for sm80/120/121 support of Sparse MLA