[Feature] TRITON_MLA_SPARSE backend for SM8x/11x/12x DSA Sparse MLA Support by haosdent · Pull Request #38476 · vllm-project/vllm

haosdent · 2026-03-29T15:23:23Z

Purpose

Closes #38006. Enables sparse MLA models (GLM-5, DeepSeek-V3.2) on SM80 (A100/A800) and SM121 (GB10/DGX Spark), where DeepGEMM / FlashMLA-Sparse / FlashInfer-MLA-Sparse are unavailable.

Changes

Dispatch guard. is_deep_gemm_supported() (SM90+ check) replaces has_deep_gemm() in sparse_attn_indexer.py / indexer.py. Stops DeepGEMM kernels from being invoked on SM80/SM121.
Triton fp8_mqa_logits for the indexer. mqa_logits_triton.py reproduces DeepGEMM's prefill + paged MQA logits. Prefill takes bf16 q/k (pre-decoded from FP8 in the Python wrapper) and feeds a straight tl.dot; paged decode keeps a 256-entry bf16 LUT for in-kernel FP8 decode. K-side scale applied to the fp32 dot output, per-row K-tile early-exit on the chunked-prefill path.
TRITON_MLA_SPARSE backend. triton_mla_sparse_kernel.py adds a split-KV decode with N-way online-softmax merge plus a single-pass fast path. Autotune is warmed at init using indexer-derived (n_head, head_dim). Masked-out sentinel is -1e30 to avoid NaN from (-inf) − (-inf) on all-masked tiles.
Cudagraph support. TritonMLASparseMetadataBuilder advertises AttentionCGSupport.UNIFORM_BATCH; flips A100 TP=8 back to FULL_AND_PIECEWISE.
MXFP4 link stubs. mxfp4_experts_quant / silu_and_mul_mxfp4_experts_quant stubs in nvfp4_quant_entry.cu. Real impls are SM10.x-only in CMake but torch_bindings.cpp references them unconditionally, which breaks source builds on SM 8.x.

Benchmarks

8×A100 SXM TP=8, lukealonso/GLM-5.1-NVFP4, single prompt, decode 200 tokens. cold = first request on a fresh prompt; warm = repeat (prefix cache hit):

context	prefix	TTFT	TPOT median	output tok/s
short (1,744 in)	cold	0.72 s	21.7 ms	39.7
short (1,744 in)	warm	0.18 s	21.8 ms	44.3
mid (65,536 in)	cold	37.0 s	24.3 ms	4.8
long (127,744 in)	cold	98.1 s	25.2 ms	1.9
long (127,744 in)	warm	0.60 s	25.3 ms	35.5

Tests

tests/kernels/attention/test_mqa_logits_triton.py — 41 cases (DeepGEMM reference + clean/dirty clean_logits + 256-byte FP8 decode).
tests/kernels/attention/test_triton_mla_sparse_kernel.py — 53 cases (split vs single-pass + auto-heuristic + short-prefill no-NaN).

Limitations (follow-up): BF16 KV cache only on SM80/SM121; VLLM_BATCH_INVARIANT should force num_kv_splits=1 — not wired.

mergify · 2026-03-29T15:24:04Z

Documentation preview: https://vllm--38476.org.readthedocs.build/en/38476/

gemini-code-assist

Code Review

This pull request introduces the TRITON_MLA_SPARSE attention backend, providing a Triton-based fallback for sparse MLA on GPUs like NVIDIA Ampere. It also refactors FP8 MQA logit fallbacks into a new module and updates the sparse attention indexer to use these PyTorch implementations when DeepGEMM is unsupported. A review comment suggests moving a module-level import to the top of the file to comply with PEP 8 guidelines.

gemini-code-assist · 2026-03-29T15:25:01Z

+        Logits tensor of shape [B * next_n, max_model_len], dtype
+        `torch.float32`.
+    """
+    from vllm.utils.math_utils import cdiv


To adhere to PEP 8 guidelines, module-level imports should be placed at the top of the file. Please move this import statement to the top of the module, for example, after the from vllm.platforms import current_platform import. This improves code readability and consistency.

References

PEP 8: E402 module level import not at top of file. Imports should be at the top of the file, just after any module comments and docstrings, and before module globals and constants. ^(link)

ZJY0516 · 2026-03-29T17:59:16Z

Add PyTorch fallback for indexer MQA logits — Created a shared mqa_logits_fallback.py module with fp8_mqa_logits_torch and fp8_paged_mqa_logits_torch implementations (extracted from existing ROCm fallback code). The sparse attention indexer now dispatches to these fallbacks when DeepGEMM is not supported.

FYI, we don’t plan to support a torch native mqa_logits implementation. I also question whether it’s necessary to support sparse MLA on SM80.

ehfd · 2026-03-30T14:55:52Z

Add PyTorch fallback for indexer MQA logits — Created a shared mqa_logits_fallback.py module with fp8_mqa_logits_torch and fp8_paged_mqa_logits_torch implementations (extracted from existing ROCm fallback code). The sparse attention indexer now dispatches to these fallbacks when DeepGEMM is not supported.

FYI, we don’t plan to support a torch native mqa_logits implementation. I also question whether it’s necessary to support sparse MLA on SM80.

@ZJY0516

Update: This PR is no longer a torch-native mqa_logits implementation and instead use Triton.

Ampere is still extremely commonly used... We need this for DS3.2 or GLM-5. TRITON_MLA_SPARSE is possible, so we need an option for that.

ehfd · 2026-03-30T15:00:04Z

PyTorch fallback for indexer logits is slower than DeepGEMM — a Triton-based variant can follow

Perhaps we can integrate Triton now?

workcode-del · 2026-03-31T09:27:32Z

Hello, after I modified the code according to your PR, the GLM5 model service started normally. However, the response speed is very slow, with only about 3 tokens being responded to per second. My device is also an A800 with 80G of storage capacity. Is this normal?
@haosdent

ehfd · 2026-03-31T09:49:20Z

@workcode-del I believe that it's only expected to be anywhere remotely fast when no PyTorch fallbacks exist.

workcode-del · 2026-04-01T01:09:47Z

@workcode-del I believe that it's only expected to be anywhere remotely fast when no PyTorch fallbacks exist.

Could you please explain how to achieve the condition where there are no PyTorch fallbacks?

mergify · 2026-04-01T05:01:13Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @haosdent.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

ehfd · 2026-04-01T05:39:01Z

@workcode-del
Further development is necessary.

ehfd · 2026-04-02T17:08:26Z

#37968, #35271, #38076, #36519

Probably containing the reason why needs-rebase was added.

ianlevesque · 2026-04-08T14:07:17Z

I was able to use this patch to run GLM-5.1 on an 8-node DGX Spark cluster. Performance is obviously not stellar (~5 t/s) but it's a great first step with compatibility.

haosdent · 2026-04-18T16:39:06Z

Thanks @ianlevesque , your 8 x DGX Spark is incredible! I just add new triton kernels to try to address the performance issue, may you test again when you are available?

ianlevesque · 2026-04-18T23:28:51Z

@haosdent retried with the new patch, it did improve to 10 t/s or so.

  Depth=0 (fresh context)

  ┌────────┬────────────────────────┬───────────────┐
  │  Test  │    Throughput (t/s)    │   TTFT (ms)   │
  ├────────┼────────────────────────┼───────────────┤
  │ pp512  │ 425–452                │ 1,059–1,091   │
  ├────────┼────────────────────────┼───────────────┤
  │ pp2048 │ 606–630                │ 2,892–2,971   │
  ├────────┼────────────────────────┼───────────────┤
  │ pp8192 │ 723–730                │ 10,041–10,090 │
  ├────────┼────────────────────────┼───────────────┤
  │ tg32   │ 9.91–10.29 (peak 11.0) │ —             │
  ├────────┼────────────────────────┼───────────────┤
  │ tg128  │ 9.91–10.29 (peak 11.0) │ —             │
  ├────────┼────────────────────────┼───────────────┤
  │ tg512  │ 9.85–10.23 (peak 11.0) │ —             │
  └────────┴────────────────────────┴───────────────┘

  Depth=2048 (warm context)

  ┌────────┬──────────────────┬───────────────┐
  │  Test  │ Throughput (t/s) │   TTFT (ms)   │
  ├────────┼──────────────────┼───────────────┤
  │ pp512  │ 667–697          │ 3,250–3,403   │
  ├────────┼──────────────────┼───────────────┤
  │ pp2048 │ 656–689          │ 5,282–5,618   │
  ├────────┼──────────────────┼───────────────┤
  │ pp8192 │ 717–731          │ 12,373–12,800 │
  ├────────┼──────────────────┼───────────────┤
  │ tg32   │ 9.18–9.77        │ —             │
  ├────────┼──────────────────┼───────────────┤
  │ tg128  │ 9.49–9.79        │ —             │
  ├────────┼──────────────────┼───────────────┤
  │ tg512  │ 9.51–9.70        │ —             │
  └────────┴──────────────────┴───────────────┘

songshu0427-lgtm · 2026-06-10T04:38:41Z

@ehfd @songshu0427-lgtm 我的问题已解决。

最初，我使用 claude-code-router 运行 Claude Code，GLM-5.1 由 vLLM 提供服务。当 Claude 进入规划模式时，我总是遇到以下错误：

无法解析模型的工具调用（重试也失败了）。

我还发现了一个似乎相关的问题： #42400
后来，我改用agentfw运行 Claude Code ，编码模型仍然使用 vLLM 提供的 GLM-5.1 模型。切换后，一切正常，不再出现任何工具调用错误。

设置：
npm install -g @openguardrails/agentfw
agentfw claude
AgentFW： https://github.com/openguardrails/agentfw

下面附上一些截图供您参考。

将 vllm-glm5.1 配置为文本模型（我还将 Qwen3.6 配置为视觉模型，以便为 glm5.1 提供“眼睛”）。

2. 跑步
agentfw claude
3. 检查运行刀具调用和刀具调用结果

我的问题也定位到了
I

zt1024 · 2026-06-12T02:26:54Z

"Does this patch support turboquant-vllm? Since turboquant-vllm enables KV cache compression, I'm interested in its compatibility. I actually opened this issue: https://github.com/varjoranta/turboquant-vllm/issues/56. Could you please help me look into it?" @haosdent

ehfd · 2026-06-17T15:35:13Z

@haosdent Can we add support for GLM-5.2-NVFP4? It seems that it has a specific shape that needs to be added manually (576 MLA heads?).

https://huggingface.co/lukealonso/GLM-5.2-NVFP4

ehfd · 2026-06-17T17:02:35Z

I think merging the main branch might solve it.

ghostplant · 2026-06-17T17:54:22Z

env: A800-SXM + haosdent/vllm-nightly:fix-38006-6[https://hub.docker.com/r/haosdent/vllm-nightly/tags]
deploy:

sudo docker run -d \
    --name glm5_1_vllm \
    --gpus all \
    --network host \
    --ipc host \
    --shm-size 32g \
    --security-opt label=disable \
    --restart no \
    -v models:models:ro \
    haosdent/vllm-nightly:fix-38006-6 \
    models/cyankiwi/GLM-5.1-AWQ-4bit \
    --served-model-name glm5.1 \
    --host 0.0.0.0 \
    --port 12345 \
    --max-model-len 65536 \
    --trust-remote-code \
    --enable-expert-parallel \
    --tensor-parallel-size 8 \
    --max-num-seqs 512 \
    --gpu-memory-utilization 0.85 \
    --tool-call-parser glm47 \
    --reasoning-parser glm45 \
    --enable-auto-tool-choice \
    --chat-template-content-format=string

bench:

vllm bench serve \
    --backend openai-chat \
    --base-url http://127.0.0.1:12345 \
    --endpoint /v1/chat/completions \
    --model glm5.1 \
    --tokenizer models/cyankiwi/GLM-5.1-AWQ-4bit \
    --trust-remote-code \
    --dataset-name sharegpt \
    --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
    --num-prompts 128 \
    --max-concurrency 32 \
    --request-rate inf \
    --seed 42 \
    --save-result \
    --result-dir workspace_2026_4/deploy/logs \
    --result-filename bench_glm5_1_p128_c32_20260521-092828.json

result: c=32:

============ Serving Benchmark Result ============
Successful requests:                     128       
Failed requests:                         0         
Maximum request concurrency:             32        
Benchmark duration (s):                  84.09     
Total input tokens:                      27829     
Total generated tokens:                  30153     
Request throughput (req/s):              1.52      
Output token throughput (tok/s):         358.57    
Peak output token throughput (tok/s):    512.00    
Peak concurrent requests:                36.00     
Total token throughput (tok/s):          689.50    
---------------Time to First Token----------------
Mean TTFT (ms):                          912.35    
Median TTFT (ms):                        320.07    
P99 TTFT (ms):                           2867.16   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          70.80     
Median TPOT (ms):                        72.76     
P99 TPOT (ms):                           95.51     
---------------Inter-token Latency----------------
Mean ITL (ms):                           67.81     
Median ITL (ms):                         64.35     
P99 ITL (ms):                            296.68    
==================================================

c=96:

============ Serving Benchmark Result ============
Successful requests:                     384       
Failed requests:                         0         
Maximum request concurrency:             96        
Benchmark duration (s):                  119.69    
Total input tokens:                      83110     
Total generated tokens:                  80182     
Request throughput (req/s):              3.21      
Output token throughput (tok/s):         669.92    
Peak output token throughput (tok/s):    1056.00   
Peak concurrent requests:                105.00    
Total token throughput (tok/s):          1364.30   
---------------Time to First Token----------------
Mean TTFT (ms):                          570.01    
Median TTFT (ms):                        455.85    
P99 TTFT (ms):                           1251.19   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          120.69    
Median TPOT (ms):                        120.11    
P99 TPOT (ms):                           189.35    
---------------Inter-token Latency----------------
Mean ITL (ms):                           114.45    
Median ITL (ms):                         95.28     
P99 ITL (ms):                            353.53    
==================================================

I have reproduced these results, and they are credible.

Does single machine A100x8 SMX work for 96 concurrency?

Kasempiternal · 2026-06-18T07:54:57Z

Thanks for this backend, @haosdent — it's the missing piece for sparse-MLA on Ampere. We used it to bring up GLM-5.2 (glm_moe_dsa) on 32×A100 (sm80, BF16, TP=4×PP=8) as a dataset-generation teacher, and it works: healthy server, CUDA-graph optimized, coherent output, ~264.6 tok/s aggregate (64-concurrent). Sharing what GLM-5.2 needed on top of this PR, in case it's useful for broader DSA support.

1. GLM-5.2 needs IndexShare (this PR builds the indexer per-layer)

GLM-5.2 uses a per-layer config.indexer_types pattern (full,full,full,shared,shared,shared,full,...). shared layers don't run their own indexer — they reuse the previous full layer's top-k indices and ship no indexer weights in the checkpoint. Reusing the DeepSeek-V3.2 path (one indexer per layer) makes vLLM (a) fail to load (ValueError: ...indexer.k_norm not initialized on the shared layers) and (b) be numerically wrong even if forced (uninitialized indexers). (This mirrors transformers' own GLM-5 skip-topk sharing, huggingface/transformers#46372.)

The good news: the machinery already exists here — mla.py gates the indexer on not skip_topk, and topk_indices_buffer is allocated once and shared across layers. We just drove skip_topk from indexer_types. In DeepseekV2MLAAttention.__init__ (deepseek_v2.py), right after the existing use_index_cache block:

# GLM-5.2 IndexShare: layers with indexer_types[i] == "shared" reuse the previous
# "full" layer's top-k indices (via the shared topk_indices_buffer) instead of
# running their own indexer. Matches modeling_glm_moe_dsa.py:
#   self.skip_topk = config.indexer_types[layer_idx] == "shared"
indexer_types = getattr(config, "indexer_types", None)
if indexer_types is not None:
    _layer_idx = extract_layer_index(prefix)
    if 0 <= _layer_idx < len(indexer_types) and indexer_types[_layer_idx] == "shared":
        _skip_topk = True

extract_layer_index is already imported. Layers 0-2 (full) write the buffer; 3-5 (shared) reuse layer 2; layer 6 (full) overwrites; 7-9 reuse layer 6 — bit-identical to modeling_glm_moe_dsa.py. The shared layers' indexer submodules are still constructed (for buffer access) but never called, so their checkpoint-absent weights stay unused (we ran with --model-loader-extra-config.enable_weights_track=false; a cleaner variant skips building the submodule for shared layers). We verified shared-layer output equals reusing the prior full layer's indices on a dummy-weights model with two shared groups.

2. Heads-up: a `transformers` config-alias bug (already fixed on `main`)

Older transformers releases (≤5.8.1) alias attribute_map = {"head_dim": "qk_rope_head_dim"}, and GLM-5.2's config.json ships both head_dim: 192 and qk_rope_head_dim: 64, so head_dim (192, == qk_nope) clobbers the true qk_rope_head_dim (64) — inflating the MLA head 576→704 and breaking weight load (fused_qkv_a_proj narrow 704 vs 576). This is already fixed on transformers main (huggingface/transformers#46338 removed the alias and sets head_dim = qk_rope_head_dim in __post_init__), so it only affects pinned older releases — there, --hf-overrides '{"qk_rope_head_dim": 64}' is the workaround. Flagging only because GLM-5.2 users on a pinned transformers will still hit it.

3. CUDA graphs on sm80 (Triton indexer fallback)

vllm::sparse_attn_indexer is in splitting_ops, but cudagraph_mode=FULL_AND_PIECEWISE captures the whole decode and pulls the autotuned Triton indexer into the capture stream → CUDA error: operation not permitted when stream is capturing. Serving with --compilation-config.cudagraph_mode=PIECEWISE (indexer runs eager at the split) + --disable-custom-all-reduce works and keeps graph acceleration for the dense/MoE path. Might be worth auto-downgrading FULL→PIECEWISE when a split op is non-capturable, or a docs note for non-Hopper DSA.

I've opened the IndexShare change above as a small PR onto fix-38006 (haosdent#7), DCO-signed — happy to restructure however you'd like it folded in, and I can add a forward-equivalence test for the shared-layer path. Thanks again for driving this.

ehfd · 2026-06-18T09:44:05Z

@Kasempiternal Thank you so much for the contribution!

timinar · 2026-06-20T19:30:48Z

Validation run on our 8×A100 box; investigation + write-up done with Claude Code.

Independent confirmation: GLM-5.2 (glm_moe_dsa / DeepSeek Sparse Attention) running on 8x A100-80GB thanks to this PR. Huge thanks to @haosdent — the TRITON_MLA_SPARSE backend plus the bf16/Triton mqa_logits indexer fallback are exactly what unblocks sm_80, where stock vLLM dies with "FlashMLA Sparse is only supported on Hopper and Blackwell devices" (the issue here, plus #35021 / #41525).

Setup

Model: cyankiwi/GLM-5.2-AWQ-INT4 (753B MoE, compressed-tensors WNA16 INT4, indexer kept bf16, ~411 GiB on disk).
vLLM base: main d272418 (0.23.1rc1.dev209+gd272418f4); this PR cherry-picks cleanly as a single commit on top.
One conflict in sparse_attn_indexer.py (main since added an XPU branch): resolved into a three-way is_xpu() / use_deep_gemm / Triton-fallback dispatch, and took the PR's logger.warning_once over main's hard RuntimeError in __init__ so sm_80 routes through Triton instead of aborting.
I dropped the PR's lone .cu file (nvfp4_quant_entry.cu, an SM100 stub) — on this base it is never referenced at link time on an sm_80 build, which keeps the changeset python-only. That meant no CUDA recompile: I installed the patched tree editable with VLLM_USE_PRECOMPILED=1 over the d272418 cu130 nightly wheel. vllm --version shows ...+gd272418f4.precompiled.
8x A100-SXM4-80GB, TP=8, bf16 KV cache (--kv-cache-dtype auto, not fp8), --no-async-scheduling, --gpu-memory-utilization 0.90, --max-model-len 32768. Weights load at 54.45 GiB/rank; GPU KV cache = 151,040 tokens.

Proof it routes correctly (server startup log):

WARNING [sparse_attn_indexer.py:482] DeepGEMM not supported on this platform; using Triton fallback for sparse attention indexer.
INFO    [cuda.py:457] Using TRITON_MLA_SPARSE attention backend out of potential backends: ['TRITON_MLA_SPARSE'].

is_deep_gemm_supported() returns False on A100, so the indexer takes the Triton path as intended.

Throughput (512-tok prompt / 128-tok greedy gen, ignore_eos):

Concurrency	Aggregate decode tok/s	Total tok/s	Mean TTFT (s)
1	56.4	281.9	0.069
4	185.6	928.1	0.156
8	334.8	1674.1	0.226
16	548.0	2740.1	0.388
32	625.4	3127.1	0.562

~56 tok/s single-stream decode, plateauing near 625 tok/s aggregate at 32-way concurrency, TTFT sub-second throughout. For reference, llama.cpp (GGUF UD-Q4_K_XL) on the same 8x A100 does ~24.5 tok/s single-stream and saturates around 70 tok/s aggregate — so this path is ~2.3x single-stream and ~9x aggregate.

Output is coherent (verified on arithmetic, Rayleigh scattering, and a "who wrote Hamlet" prompt that closed its <think> block and returned a clean final sentence). Note GLM-5.2 is a reasoning model and the default max reasoning effort can eat the whole token budget, so size max_tokens accordingly.

Full write-up (weights -> cherry-pick + conflict resolution -> precompiled overlay install -> serve -> benchmarks): https://gist.github.com/timinar/c8d2eca4e2ea7d11db57a1e6e62d06a2.

ghostplant · 2026-06-20T20:51:54Z

Validation run on our 8×A100 box; investigation + write-up done with Claude Code.

Independent confirmation: GLM-5.2 (glm_moe_dsa / DeepSeek Sparse Attention) running on 8x A100-80GB thanks to this PR. Huge thanks to @haosdent — the TRITON_MLA_SPARSE backend plus the bf16/Triton mqa_logits indexer fallback are exactly what unblocks sm_80, where stock vLLM dies with "FlashMLA Sparse is only supported on Hopper and Blackwell devices" (the issue here, plus #35021 / #41525).

Setup

Model: cyankiwi/GLM-5.2-AWQ-INT4 (753B MoE, compressed-tensors WNA16 INT4, indexer kept bf16, ~411 GiB on disk).

vLLM base: main d272418 (0.23.1rc1.dev209+gd272418f4); this PR cherry-picks cleanly as a single commit on top.

One conflict in sparse_attn_indexer.py (main since added an XPU branch): resolved into a three-way is_xpu() / use_deep_gemm / Triton-fallback dispatch, and took the PR's logger.warning_once over main's hard RuntimeError in __init__ so sm_80 routes through Triton instead of aborting.

I dropped the PR's lone .cu file (nvfp4_quant_entry.cu, an SM100 stub) — on this base it is never referenced at link time on an sm_80 build, which keeps the changeset python-only. That meant no CUDA recompile: I installed the patched tree editable with VLLM_USE_PRECOMPILED=1 over the d272418 cu130 nightly wheel. vllm --version shows ...+gd272418f4.precompiled.

8x A100-SXM4-80GB, TP=8, bf16 KV cache (--kv-cache-dtype auto, not fp8), --no-async-scheduling, --gpu-memory-utilization 0.90, --max-model-len 32768. Weights load at 54.45 GiB/rank; GPU KV cache = 151,040 tokens.

Proof it routes correctly (server startup log):
WARNING [sparse_attn_indexer.py:482] DeepGEMM not supported on this platform; using Triton fallback for sparse attention indexer.
INFO    [cuda.py:457] Using TRITON_MLA_SPARSE attention backend out of potential backends: ['TRITON_MLA_SPARSE'].
is_deep_gemm_supported() returns False on A100, so the indexer takes the Triton path as intended.

Throughput (512-tok prompt / 128-tok greedy gen, ignore_eos):

Concurrency Aggregate decode tok/s Total tok/s Mean TTFT (s)
1 56.4 281.9 0.069
4 185.6 928.1 0.156
8 334.8 1674.1 0.226
16 548.0 2740.1 0.388
32 625.4 3127.1 0.562
~56 tok/s single-stream decode, plateauing near 625 tok/s aggregate at 32-way concurrency, TTFT sub-second throughout. For reference, llama.cpp (GGUF UD-Q4_K_XL) on the same 8x A100 does ~24.5 tok/s single-stream and saturates around 70 tok/s aggregate — so this path is ~2.3x single-stream and ~9x aggregate.

Output is coherent (verified on arithmetic, Rayleigh scattering, and a "who wrote Hamlet" prompt that closed its <think> block and returned a clean final sentence). Note GLM-5.2 is a reasoning model and the default max reasoning effort can eat the whole token budget, so size max_tokens accordingly.

Full write-up (weights -> cherry-pick + conflict resolution -> precompiled overlay install -> serve -> benchmarks): https://gist.github.com/timinar/c8d2eca4e2ea7d11db57a1e6e62d06a2.

Is it meaningful for 32 concurrency for SM_8x? The 1M-context does not likely work under such concurrency size.

timinar · 2026-06-20T22:42:05Z

Is it meaningful for 32 concurrency for SM_8x? The 1M-context does not likely work under such concurrency size.

Not really. It was max 32k context, I believe. So too short for multi-turn agentic tasks, but could be useful for some other types of local work.

ehfd · 2026-06-21T02:28:26Z

Actually, pipeline parallelism will multiply the context to the number of GPUs, due to GLM-5.x being MLA.

ghostplant · 2026-06-21T03:07:57Z

Is it meaningful for 32 concurrency for SM_8x? The 1M-context does not likely work under such concurrency size.

Not really. It was max 32k context, I believe. So too short for multi-turn agentic tasks, but could be useful for some other types of local work.

I realized this topic assumes INT4 (Q4) rather than NVFP4. Based on any 4-bit solution, 1M context window on an 8xA100 SXM node is still possible. However, vLLM needs to additionally implement context parallelism rather than relying on pipeline parallelism or tensor parallelism. Don't be upset and keep up more enhancements.

AIME benchmark requires at least 160K context to complete reasoning and 32K isn't long enough. My current trouble falls into how to reproduce their 99% scores on AIME-2026.

ehfd · 2026-06-21T09:19:03Z

In MLA models, the number of KV cache heads is considered 1.
This duplicates the KV cache head in tensor parallelism. Decode context parallelism is used to de-duplicate the KV cache heads. However, it is not yet implemented for Sparse MLA; only in MLA.

But in pipeline parallelism, there is no KV cache head duplication, so the KV cache gets multiplied by the number of GPUs.

ghostplant · 2026-06-21T23:08:35Z

In MLA models, the number of KV cache heads is considered 1. This duplicates the KV cache head in tensor parallelism. Decode context parallelism is used to de-duplicate the KV cache heads. However, it is not yet implemented for Sparse MLA; only in MLA.

But in pipeline parallelism, there is no KV cache head duplication, so the KV cache gets multiplied by the number of GPUs.

Pipeline parallel supports 1M context but suffers from low cross-GPU utilization. Tutel images just supported paged Context-Sparse MLA recently: https://hub.docker.com/r/tutelgroup/deepseek-671b

ehfd · 2026-06-22T05:14:47Z

Is there a way to use proper tool calling and reasoning parsers with Tutel?

ghostplant · 2026-06-22T05:43:52Z

Is there a way to use proper tool calling and reasoning parsers with Tutel?

Claude Code connecting to Tutel will directly trigger GLM's tool calling.
But OpenAI chat completion interfaces cannot use tools at this point.

Without tool calls, I get only 90% for AIME-26 over GLM5.2-NVFP4, but it was >96% for GLM-5-NVFP4 and GLM-5.1-NVFP4. I don't know if NVFP4 dislike GLM-5.2 or not, so I wonder what the score would be if evaluating AIME-26 over Q4?

Maybe officially claimed 99% for AIME-26 is never reproducible, however, I don't have enough-strong GPU environment (Hxx/Bxx) to evaluate GLM-5.2-BF16 or GLM-5.2-FP8.

songshu0427-lgtm · 2026-06-23T02:31:05Z

tool call parse fix
glm47_stateful_tool_parser_remote_image.patch

MatthewBonanni · 2026-06-24T20:06:35Z

                AttentionBackendEnum.FLASHMLA,
                AttentionBackendEnum.FLASHINFER_MLA,
                AttentionBackendEnum.TRITON_MLA,
+                AttentionBackendEnum.TRITON_MLA_SPARSE,


This backend shouldn't be prioritized over FlashMLA sparse, this will hurt SM90 performance. Please swap these two.

halexan · 2026-06-25T02:05:23Z

Great job!

halexan · 2026-06-26T06:51:35Z

Hi @timinar

How about cyankiwi/GLM-5.2-AWQ-INT4 accuracy recovery?

Is it as smart as the non-quantized version of the model?

ghostplant · 2026-06-26T11:47:10Z

Hi @timinar

How about cyankiwi/GLM-5.2-AWQ-INT4 accuracy recovery?

Is it as smart as the non-quantized version of the model?

I tested NVFP4 instead of INT4 version, without tool use, aime-2026 gets 90-93% only. Enabling tools improves it to > 99%.

RefalMachine · 2026-06-27T09:50:51Z

Has anyone encountered the following issue when trying to deploy GLM-5.2 on 8x A100 GPUs with Nvidia 535 drivers (cu129)?

I am following this guide to enable the Triton fallback sparse MLA path: https://gist.github.com/timinar/c8d2eca4e2ea7d11db57a1e6e62d06a2
with vllm/vllm-openai:cu129-nightly-a346d589f5932d4234bf5bf8718f10e26d187021 container as a base
During the CUDA graph profiling stage (profile_cudagraph_memory), the engine crashes with the following error:
RuntimeError: Triton Error [CUDA]: operation not permitted when stream is capturing
This originates from vllm/v1/attention/ops/mqa_logits_triton.py (inside the fp8_paged_mqa_logits_triton execution path).
To bypass this, I had to implement a manual warm-up sequence in the init method of SparseAttentionIndexer:
warmup_fp8_mqa_logits_triton(num_heads, self.head_dim, device)
warmup_fp8_paged_mqa_logits_triton(num_heads, self.head_dim, block_size, device)
Additionally, I patched the @triton.jit decorators for both prefill and decode kernels in mqa_logits_triton.py to include do_not_specialize for all tensor stride arguments (to prevent Triton JIT re-compilation during graph profiling under different batch sizes):
@triton.jit(do_not_specialize=[
"stride_q_b", "stride_q_n", "stride_q_h", "stride_q_d",
"stride_kvf_block", "stride_kvf_s", "stride_kvf_d",
...
])
Now the model successfully boots up, but only if I increase --gpu-memory-utilization to 0.95. With the reference 0.90 (from the gist), it raises ValueError: No available memory for the cache blocks even for a small 32768 context window.

This workaround feels too hacky and diverges significantly from the reference behavior in the gist (where things reportedly work out-of-the-box on 0.90 memory limit without Triton capturing crashes).

Has anyone successfully run this without such JIT-warmup patches on the newer vLLM V1 engine? Any insights on why the memory footprint/graph profiling behavior differs so much from the reference would be greatly appreciated!

UPD: The problem was with the container/main versions. Here's the working Dockerfile.

# Nightly image from June 22, 2026 — functionally identical to commit 435f82d61
# (the commit from the rumor). Diff between 9037498c2 and 435f82d61 is 1 test-only file.
# This is the exact window when @timinar's guide was written (June 20-24).
FROM vllm/vllm-openai:cu129-nightly-9037498c22891e55b594f567fb91d9b4efbf3e99

USER root

# 1. Install minimal git
RUN apt-get update && apt-get install -y --no-install-recommends git && rm -rf /var/lib/apt/lists/*

WORKDIR /vllm-workspace

# 2. Clone the repo, pin commit 435f82d61 (the one from the rumor), and cherry-pick PR #38476
#    PR #38476 adds TRITON_MLA_SPARSE attention backend for SM80/SM121 sparse MLA support.
#    On this commit, the only conflicts are in sparse_attn_indexer.py and platforms/cuda.py.
#    Unlike newer nightlies (June 26+), no CUDA graph capture hacks are needed here because
#    the Triton MoE regression (#46142/#46254) is not yet in main.
RUN git config --global user.email "auto-resolver@vllm.ai" && \
    git config --global user.name "Auto Resolver" && \
    git clone https://github.com/vllm-project/vllm && \
    cd vllm && \
    git checkout 435f82d61a1eddb84854ca59a008a8e4d97ab439 && \
    git fetch origin pull/38476/head:pr-38476 && \
    (git cherry-pick pr-38476 || echo "Merge conflict expected, proceeding to patch...")

WORKDIR /vllm-workspace/vllm

# 3.1. Patch sparse_attn_indexer.py — Conflict 1: fp8_mqa_logits
#      Resolve to a three-way dispatch: XPU (from main) → DeepGEMM (from PR) → Triton fallback (from PR).
#      On SM80/A100: is_deep_gemm_supported()=False, so Triton fallback is used automatically.
RUN python3 -c 'import re; p="vllm/model_executor/layers/sparse_attn_indexer.py"; f=open(p,"r"); c=f.read(); f.close(); pat=r"<<<<<<< HEAD\s+if current_platform\.is_xpu\(\):.*?fp8_mqa_logits_triton.*?>>>>>>> \w+.*?[^\n]*"; rep="            if current_platform.is_xpu():\n                if q_scale_slice is not None:\n                    raise RuntimeError(\"XPU fp8_mqa_logits does not support FP4 Q\")\n                logits = torch.ops.vllm.xpu_fp8_mqa_logits(\n                    q_slice_cast,\n                    k_quant_cast,\n                    k_scale_cast,\n                    weights[chunk.token_start : chunk.token_end],\n                    chunk.cu_seqlen_ks,\n                    chunk.cu_seqlen_ke,\n                )\n            elif use_deep_gemm:\n                logits = fp8_fp4_mqa_logits(\n                    (q_slice_cast, q_scale_slice),\n                    (k_quant_cast, k_scale_cast),\n                    weights[chunk.token_start : chunk.token_end],\n                    chunk.cu_seqlen_ks,\n                    chunk.cu_seqlen_ke,\n                    clean_logits=False,\n                )\n            else:\n                logits = fp8_mqa_logits_triton(\n                    q_slice_cast,\n                    (k_quant_cast, k_scale_cast),\n                    weights[chunk.token_start : chunk.token_end],\n                    chunk.cu_seqlen_ks,\n                    chunk.cu_seqlen_ke,\n                    clean_logits=False,\n                )"; c=re.sub(pat, rep, c, flags=re.DOTALL); f=open(p,"w"); f.write(c); f.close()'

# 3.2. Patch sparse_attn_indexer.py — Conflict 2: fp8_paged_mqa_logits
#      Same three-way dispatch pattern: XPU → DeepGEMM → Triton fallback.
RUN python3 -c 'import re; p="vllm/model_executor/layers/sparse_attn_indexer.py"; f=open(p,"r"); c=f.read(); f.close(); pat=r"<<<<<<< HEAD\s+if current_platform\.is_xpu\(\):.*?fp8_paged_mqa_logits_triton.*?>>>>>>> \w+.*?[^\n]*"; rep="        if current_platform.is_xpu():\n            if padded_q_scale is not None:\n                raise RuntimeError(\"XPU fp8_paged_mqa_logits does not support FP4 Q\")\n            seq_lens_xpu = (\n                seq_lens[:, -1].contiguous() if seq_lens.ndim == 2 else seq_lens\n            )\n            logits = torch.ops.vllm.xpu_fp8_paged_mqa_logits(\n                padded_q_quant_cast,\n                kv_cache,\n                weights[:num_padded_tokens],\n                seq_lens_xpu,\n                decode_metadata.block_table,\n                decode_metadata.schedule_metadata,\n                max_model_len,\n            )\n        elif use_deep_gemm:\n            logits = fp8_fp4_paged_mqa_logits(\n                (padded_q_quant_cast, padded_q_scale),\n                kv_cache,\n                weights[:num_padded_tokens],\n                seq_lens,\n                decode_metadata.block_table,\n                decode_metadata.schedule_metadata,\n                max_model_len=max_model_len,\n                clean_logits=False,\n            )\n        else:\n            # SM80/SM121 Triton fallback.\n            active_max_model_len = attn_metadata_narrowed.max_seq_len\n            logits = fp8_paged_mqa_logits_triton(\n                padded_q_quant_cast,\n                kv_cache,\n                weights[:num_padded_tokens],\n                seq_lens,\n                decode_metadata.block_table,\n                max_model_len=active_max_model_len,\n                clean_logits=False,\n            )"; c=re.sub(pat, rep, c, flags=re.DOTALL); f=open(p,"w"); f.write(c); f.close()'

# 3.3. Patch sparse_attn_indexer.py — Remaining conflicts: take PR version (theirs)
#      This resolves the __init__ conflict: replaces main's hard RuntimeError (when DeepGEMM
#      is missing) with the PR's warn-and-fallback on `not is_deep_gemm_supported()`.
#      No JIT warmup hack needed on this base image.
RUN python3 -c 'import re; p="vllm/model_executor/layers/sparse_attn_indexer.py"; f=open(p,"r"); c=f.read(); f.close(); pat=r"<<<<<<< HEAD\s+.*?=======\s+(.*?)>>>>>>> \w+.*?[^\n]*"; c=re.sub(pat, r"\1", c, flags=re.DOTALL); f=open(p,"w"); f.write(c); f.close()'

# 4. Patch platforms/cuda.py — if conflict, take PR version (theirs).
#    Adds TRITON_MLA_SPARSE to the supported backends list for SM80/SM121.
RUN python3 -c 'import re; p="vllm/platforms/cuda.py"; f=open(p,"r"); c=f.read(); f.close(); pat=r"<<<<<<< HEAD\s+.*?=======\s+(.*?)>>>>>>> \w+.*?[^\n]*"; c=re.sub(pat, r"\1", c, flags=re.DOTALL); f=open(p,"w"); f.write(c); f.close()'

# 5. Revert nvfp4_quant_entry.cu to HEAD to avoid rebuilding _C_stable_libtorch.abi3.so
#    (the PR's stub would break the .so link on SM8x — see @timinar's guide)
RUN git checkout HEAD -- csrc/libtorch_stable/quantization/fp4/nvfp4_quant_entry.cu

# 6. Verify no conflict markers remain, then finalize the cherry-pick
RUN python3 -c 'import sys; files=["vllm/model_executor/layers/sparse_attn_indexer.py","vllm/platforms/cuda.py"]; [sys.exit(1) for f in files for m in ["<<<<<<< HEAD","=======",">>>>>>>"] if m in open(f).read()]'; \
    git add vllm/model_executor/layers/sparse_attn_indexer.py && \
    git add vllm/platforms/cuda.py && \
    git add csrc/libtorch_stable/quantization/fp4/nvfp4_quant_entry.cu && \
    (git add docs/design/attention_backends.md 2>/dev/null || true) && \
    git -c core.editor=true cherry-pick --continue

# 7. Overlay the patched Python code onto the system vLLM installation
RUN cp -r /vllm-workspace/vllm/vllm/* /usr/local/lib/python3.12/dist-packages/vllm/

# Clean up temp repo
RUN rm -rf /vllm-workspace/vllm

# The backend auto-selects on SM80, but this env is harmless and matches @timinar's guide
ENV VLLM_ATTENTION_BACKEND=TRITON_MLA_SPARSE

# vLLM entrypoint
ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server"]

ehfd · 2026-06-28T09:33:12Z

@RefalMachine Thank you so much for sharing the Dockerfile!

RefalMachine · 2026-06-28T14:25:43Z

@RefalMachine Thank you so much for sharing the Dockerfile!

The current code successfully deploys the model for me, however, after N minutes of operation under high load, NaN values randomly start appearing in the logits (it’s unclear where or why), resulting in the output degenerating into continuous '!!!' tokens. This can only be resolved by restarting. Unfortunately, I don't know how to fix this issue yet. If anyone has encountered this before, I would be grateful for any leads.

Ph0enix89 · 2026-06-28T14:49:36Z

I am able to run it on 40 GB A100 GPUs. Using the latest main with the PR on top of it. Details below. There are two main issues:

Tool calling doesn't work. Queries via curl work fine. However whenever I try to use it in Hermes eventually there's a timeout. In the logs there is Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered. Has anyone run into it? Any recommendations? I tried launching with --chat-template-content-format string but that doesn't help. There are some reports that tool calling works without pipeline parallelism but these come from AMD ROCM systems so it's not clear if it's relevant.
KV cache quantization is not available. It's been mentioned that the problem is that fp8e4nv is hardcoded and not supported. AI suggested that the code should be adjusted to use 'fp8e5 for SM80. However it's not clear how much effort it would take to implement. In the current main there is also commit b588f66dc ("[GLM5.2 Perf] fused_indexer_q_rope_quant triton kernel") that also uses hardcoded fp8e4nv for some optimizations. It has to be reverted, otherwise vllm fails during startup on A100 (when trying to load GLM-5.2). Since the A100s I have access to have only 40 GBs doubling KV cache would be very handy, that's probably the biggest throughput bottleneck.

As for the steps I had to combine a bunch of scattered pieces to make it work on my system:

Checkout main
Revert b588f66dc ("[GLM5.2 Perf] fused_indexer_q_rope_quant triton kernel")
Follow steps in this guide to modify the code
Add this patch
Install the patched version with VLLM_USE_PRECOMPILED=1 uv pip install -e .
Install runai-streamer with pip3 install vllm[runai]. Not strictly necessary but without it loading the model takes forever.
Finally launch it with:

vllm serve "$MODEL" \
        --served-model-name glm-5.2 \
        --kv-cache-dtype auto \
        --tensor-parallel-size ${TP_SIZE} \
        --pipeline-parallel-size ${PP_SIZE} \
        --distributed-executor-backend mp \
        --nnodes ${NNODES} \
        --node-rank "${PROCID}" \
        --master-addr "$HEAD_HOST_IP" \
        --no-async-scheduling \
        --tool-call-parser glm47 \
        --enable-auto-tool-choice \
        --reasoning-parser glm45 \
        --gpu-memory-utilization 0.93 \
        --load-format runai_streamer \
        --compilation-config.cudagraph_mode=PIECEWISE \
        --disable-custom-all-reduce \
        --chat-template-content-format string \
        --block-size 128 \
        ${HEADLESS_FLAG}
7) It took a while to figure out that `--block-size 128` is necessary. Without it vllm fails to start with `RuntimeError: Worker failed with error 'No common block size for 16. ', please check the stack trace above for the root cause`.

These are my steps that result in a working setup without tool calling support. Perhaps it helps some people. Any tips are welcome.

ghostplant · 2026-06-28T16:30:34Z

I am able to run it on 40 GB A100 GPUs. Using the latest main with the PR on top of it. Details below. There are two main issues:

Tool calling doesn't work. Queries via curl work fine. However whenever I try to use it in Hermes eventually there's a timeout. In the logs there is Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered. Has anyone run into it? Any recommendations? I tried launching with --chat-template-content-format string but that doesn't help. There are some reports that tool calling works without pipeline parallelism but these come from AMD ROCM systems so it's not clear if it's relevant.

KV cache quantization is not available. It's been mentioned that the problem is that fp8e4nv is hardcoded and not supported. AI suggested that the code should be adjusted to use 'fp8e5 for SM80. However it's not clear how much effort it would take to implement. In the current main there is also commit b588f66dc ("[GLM5.2 Perf] fused_indexer_q_rope_quant triton kernel") that also uses hardcoded fp8e4nv for some optimizations. It has to be reverted, otherwise vllm fails during startup on A100 (when trying to load GLM-5.2). Since the A100s I have access to have only 40 GBs doubling KV cache would be very handy, that's probably the biggest throughput bottleneck.

As for the steps I had to combine a bunch of scattered pieces to make it work on my system:

Checkout main

Revert b588f66dc ("[GLM5.2 Perf] fused_indexer_q_rope_quant triton kernel")

Follow steps in this guide to modify the code

Add this patch

Install the patched version with VLLM_USE_PRECOMPILED=1 uv pip install -e .

Install runai-streamer with pip3 install vllm[runai]. Not strictly necessary but without it loading the model takes forever.

Finally launch it with:
vllm serve "$MODEL" \
        --served-model-name glm-5.2 \
        --kv-cache-dtype auto \
        --tensor-parallel-size ${TP_SIZE} \
        --pipeline-parallel-size ${PP_SIZE} \
        --distributed-executor-backend mp \
        --nnodes ${NNODES} \
        --node-rank "${PROCID}" \
        --master-addr "$HEAD_HOST_IP" \
        --no-async-scheduling \
        --tool-call-parser glm47 \
        --enable-auto-tool-choice \
        --reasoning-parser glm45 \
        --gpu-memory-utilization 0.93 \
        --load-format runai_streamer \
        --compilation-config.cudagraph_mode=PIECEWISE \
        --disable-custom-all-reduce \
        --chat-template-content-format string \
        --block-size 128 \
        ${HEADLESS_FLAG}
7) It took a while to figure out that `--block-size 128` is necessary. Without it vllm fails to start with `RuntimeError: Worker failed with error 'No common block size for 16. ', please check the stack trace above for the root cause`.

These are my steps that result in a working setup without tool calling support. Perhaps it helps some people. Any tips are welcome.

Do you run this with 16 A100s?

Ph0enix89 · 2026-06-28T19:49:12Z

Do you run this with 16 A100s?

16 nodes with 4 GPUs each. Not the most efficient config but it is what it is.

ehfd · 2026-06-29T03:15:28Z

#43477 was merged and a maintainer said on Slack that they were open to merging this one after that PR.

Integration of GLM-5.2 and DeepSeek V4 support, as well as a rebase is desired.

@haosdent

lzf-tech · 2026-06-30T02:04:55Z

#43477 was merged and a maintainer said on Slack that they were open to merging this one after that PR.

Integration of GLM-5.2 and DeepSeek V4 support, as well as a rebase is desired.

@haosdent

It seems @haosdent hasn’t been active on this project for some time. Really looking forward to the GLM-5.2 and DeepSeek V4 support once the rebase is done.

mergify Bot added documentation Improvements or additions to documentation nvidia rocm Related to AMD ROCm v1 labels Mar 29, 2026

github-project-automation Bot added this to AMD and NVIDIA Mar 29, 2026

github-project-automation Bot moved this to Todo in AMD Mar 29, 2026

gemini-code-assist Bot reviewed Mar 29, 2026

View reviewed changes

mergify Bot added the needs-rebase label Apr 1, 2026

ehfd mentioned this pull request Apr 18, 2026

[bugfix]: support deepseek sparse attention on unsupported targets #39594

Open

5 tasks

haosdent force-pushed the fix-38006 branch from ba113ec to af801ac Compare April 18, 2026 15:50

mergify Bot removed the needs-rebase label Apr 18, 2026

haosdent force-pushed the fix-38006 branch 3 times, most recently from 17f68af to 6760f0c Compare April 18, 2026 16:09

mergify Bot added the intel-gpu Related to Intel GPU label Apr 18, 2026

haosdent force-pushed the fix-38006 branch from 6760f0c to 696f4ef Compare April 18, 2026 16:24

Kasempiternal mentioned this pull request Jun 18, 2026

[Model] GLM-5.2 (glm_moe_dsa) IndexShare: shared indexer layers reuse previous top-k haosdent/vllm#7

Open

MatthewBonanni reviewed Jun 24, 2026

View reviewed changes

Uh oh!

Uh oh!

Conversation

haosdent commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Changes

Benchmarks

Tests

Uh oh!

mergify Bot commented Mar 29, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

ZJY0516 commented Mar 29, 2026

Uh oh!

ehfd commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ehfd commented Mar 30, 2026

Uh oh!

workcode-del commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ehfd commented Mar 31, 2026

Uh oh!

workcode-del commented Apr 1, 2026

Uh oh!

mergify Bot commented Apr 1, 2026

Uh oh!

ehfd commented Apr 1, 2026

Uh oh!

ehfd commented Apr 2, 2026

Uh oh!

ianlevesque commented Apr 8, 2026

Uh oh!

haosdent commented Apr 18, 2026

Uh oh!

ianlevesque commented Apr 18, 2026

Uh oh!

songshu0427-lgtm commented Jun 10, 2026

Uh oh!

zt1024 commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ehfd commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ehfd commented Jun 17, 2026

Uh oh!

ghostplant commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Kasempiternal commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. GLM-5.2 needs IndexShare (this PR builds the indexer per-layer)

2. Heads-up: a transformers config-alias bug (already fixed on main)

3. CUDA graphs on sm80 (Triton indexer fallback)

Uh oh!

ehfd commented Jun 18, 2026

Uh oh!

timinar commented Jun 20, 2026

Uh oh!

ghostplant commented Jun 20, 2026

Uh oh!

timinar commented Jun 20, 2026

Uh oh!

ehfd commented Jun 21, 2026

Uh oh!

ghostplant commented Jun 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ehfd commented Jun 21, 2026

Uh oh!

ghostplant commented Jun 21, 2026

haosdent commented Mar 29, 2026 •

edited

Loading

ehfd commented Mar 30, 2026 •

edited

Loading

workcode-del commented Mar 31, 2026 •

edited

Loading

zt1024 commented Jun 12, 2026 •

edited

Loading

ehfd commented Jun 17, 2026 •

edited

Loading

ghostplant commented Jun 17, 2026 •

edited

Loading

Kasempiternal commented Jun 18, 2026 •

edited

Loading

2. Heads-up: a `transformers` config-alias bug (already fixed on `main`)

ghostplant commented Jun 21, 2026 •

edited

Loading

ghostplant commented Jun 22, 2026 •

edited

Loading

halexan commented Jun 26, 2026 •

edited

Loading

RefalMachine commented Jun 27, 2026 •

edited

Loading

ehfd commented Jun 29, 2026 •

edited

Loading