[bugfix]: support deepseek sparse attention on unsupported targets by koush · Pull Request #39594 · vllm-project/vllm

koush · 2026-04-12T01:08:19Z

This patch disables sparse attention via an environment variable to force dense attention computation on architectures that do not support FlashMLA.

Purpose

Blackwell RTX 6000 Pro is sm120 and is not supported by FlashMLA, so none of the models that use DSA will run. This includes Deepseek Exp, 3.2 and GLM 5+.

Test Plan

Run:

VLLM_MLA_FORCE_DENSE=1 vllm serve koushd/GLM-5.1-NVFP4 -tp 8

Test Result

Model loads and performs inference correctly, whereas it would crash on load before.

With this patch the following end up being used for attention computation:

(Worker_TP0 pid=24922) INFO 04-12 01:11:20 [cuda.py:366] Using TRITON_MLA attention backend out of potential backends: ['TRITON_MLA'].
(Worker_TP0 pid=24922) INFO 04-12 01:11:20 [mla_attention.py:2147] Using FlashAttention prefill for MLA
(Worker_TP0 pid=24922) INFO 04-12 01:11:20 [init.py:652] Using FlashInferCutlassNvFp4LinearKernel for NVFP4 GEMM

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

This patch disables sparse attention via an environment variable to force dense attention computation on architectures that do not support FlashMLA. Signed-off-by: Koushik Dutta <koushd@gmail.com>

gemini-code-assist

Code Review

This pull request introduces the VLLM_MLA_FORCE_DENSE environment variable to allow forcing dense attention and disabling the sparse attention indexer, which is useful for architectures where DeepGEMM is unsupported. Additionally, it replaces has_deep_gemm() with is_deep_gemm_supported() throughout the codebase to ensure better compatibility checks. Feedback was provided to improve the robustness of weight loading in the DeepSeek-V2 model by checking the model's internal state rather than the environment variable directly.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Koushik Dutta <koush@koushikdutta.com>

SaittaGrowing · 2026-04-16T07:05:01Z

Hi @koush ,
I use this patch but it dosen't work in my case.

The env of my maschine shows below:

-----------GPU INFO---------------
index, name, compute_cap, driver_version, memory.total [MiB]
0, NVIDIA GeForce RTX 5090, 12.0, 580.142, 32607 MiB
1, NVIDIA GeForce RTX 5090, 12.0, 580.142, 32607 MiB
2, NVIDIA GeForce RTX 5090, 12.0, 580.142, 32607 MiB
3, NVIDIA GeForce RTX 5090, 12.0, 580.142, 32607 MiB
4, NVIDIA GeForce RTX 5090, 12.0, 580.142, 32607 MiB
5, NVIDIA GeForce RTX 5090, 12.0, 580.142, 32607 MiB
6, NVIDIA GeForce RTX 5090, 12.0, 580.142, 32607 MiB
7, NVIDIA GeForce RTX 5090, 12.0, 580.142, 32607 MiB

-------- LIB VERSION-----------
vllm 0.19.1.dev1+g43a9b1afb.cu130
flashinfer-cubin 0.6.6
flashinfer-jit-cache 0.6.6+cu130
flashinfer-python 0.6.6

--------LAUNCH COMMAND-------
VLLM_MLA_FORCE_DENSE=1 vllm serve zai-org/GLM-5.1-FP8 --tensor-parallel-size=32 --gpu-memory-utilization=0.85 --speculative-config.method=mtp --speculative-config.num_speculative_tokens=3 --tool-call-parser=glm47 --reasoning-parser=glm45 --enable-auto-tool-choice --chat-template-content-format=string --served-model-name=infvllm-eam8cbnjju-glm-5-1-fp8-v1 --host 0.0.0.0 --port 8000 --distributed-executor-backend ray

-------------LOG------------------
(EngineCore pid=1427834) ^[[36m(RayWorkerWrapper pid=1428293)^[[0m INFO 04-16 06:58:50 [gpu_model_runner.py:4735] Starting to load model zai-org/GLM-5.1-FP8...
(EngineCore pid=1427834) ^[[36m(RayWorkerWrapper pid=1428293)^[[0m INFO 04-16 06:58:50 [init.py:261] Selected CutlassFP8ScaledMMLinearKernel for Fp8LinearMethod
(EngineCore pid=1427834) ^[[36m(RayWorkerWrapper pid=1428293)^[[0m INFO 04-16 06:58:50 [cuda.py:334] Using TRITON_MLA attention backend out of potential backends: ['TRITON_MLA'].
(EngineCore pid=1427834) ^[[36m(RayWorkerWrapper pid=1428293)^[[0m INFO 04-16 06:58:50 [mla_attention.py:2137] Using FlashAttention prefill for MLA
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] EngineCore failed to start.^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] Traceback (most recent call last):^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1082, in run_engine_core^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] return func(*args, **kwargs)^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 848, in init^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] super().init(^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 114, in init^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] self.model_executor = executor_class(vllm_config)^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] return func(*args, **kwargs)^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 103, in init^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] self._init_executor()^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/ray_executor.py", line 96, in _init_executor^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] self._init_workers_ray(placement_group)^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/ray_executor.py", line 389, in _init_workers_ray^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] self.collective_rpc("load_model")^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/ray_executor.py", line 516, in collective_rpc^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] return ray.get(ray_worker_outputs, timeout=timeout)^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] return fn(*args, **kwargs)^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/ray/_private/client_mode_hook.py", line 104, in wrapper^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] return func(*args, **kwargs)^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/ray/_private/worker.py", line 2981, in get^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] values, debugger_breakpoint = worker.get_objects(^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/ray/_private/worker.py", line 1012, in get_objects^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] raise value.as_instanceof_cause()^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] ray.exceptions.RayTaskError(ValueError): ^[[36mray::RayWorkerWrapper.execute_method()^[[39m (pid=294743,, actor_id=fd40d5728b4f26b26ea43b0111000000, repr=<vllm.v1.executor.ray_utils.RayWorkerWrapper object at 0x7f643d44fef0>)^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/ray_utils.py", line 75, in execute_method^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] raise e^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/ray_utils.py", line 65, in execute_method^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] return run_method(self, method, args, kwargs)^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/serial_utils.py", line 510, in run_method^
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] return func(*args, **kwargs)^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 323, in load_model^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] self.model_runner.load_model(load_dummy_weights=load_dummy_weights)^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] return func(*args, **kwargs)^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 4751, in load_model^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] self.model = model_loader.load_model(^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] return func(*args, **kwargs)^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 55, in load_model^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] model = initialize_model(^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] ^^^^^^^^^^^^^^^^^^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] return func(*args, **kwargs)^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/utils.py", line 57, in initialize_model^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] model = model_class(vllm_config=vllm_config, prefix=prefix)^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 1339, in init^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] self.model = self.model_cls(^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] ^^^^^^^^^^^^^^^^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 364, in init^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] old_init(self, **kwargs)^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 1176, in init^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] self.start_layer, self.end_layer, self.layers = make_layers(^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] ^^^^^^^^^^^^^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 652, in make_layers^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] + get_offloader().wrap_modules(^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/offloader/base.py", line 90, in wrap_modules^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] return list(modules_generator)^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 653, in ^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] layer_fn(prefix=f"{prefix}.{idx}") for idx in range(start_layer, end_layer)^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 1178, in ^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] lambda prefix: DeepseekV2DecoderLayer(^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 1050, in init^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] self.self_attn = attn_cls(^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] ^^^^^^^^^^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 984, in init^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] self.mla_attn = MultiHeadLatentAttentionWrapper(^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/mla.py", line 95, in init^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] self.mla_attn = MLAAttention(^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/attention/mla_attention.py", line 332, in init^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] self.attn_backend = get_attn_backend(^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] ^^^^^^^^^^^^^^^^^^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/selector.py", line 92, in get_attn_backend^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] return _cached_get_attn_backend(^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/selector.py", line 107, in _cached_get_attn_backend^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] attention_cls = current_platform.get_attn_backend_cls(^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/platforms/cuda.py", line 298, in get_attn_backend_cls^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] raise ValueError(^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] ValueError: No valid attention backend found for cuda with AttentionSelectorConfig(head_size=576, dtype=torch.bfloat16, kv_cache_dtype=auto, block_size=None, use_mla=True, has_sink=False, use_sparse=True, use_mm_prefix=False, use_per_head_quant_scales=False, attn_type=AttentionType.DECODER). Reasons: {FLASH_ATTN_MLA: [sparse not supported, compute capability not supported, FlashAttention MLA not supported on this device], FLASHMLA: [sparse not supported, compute capability not supported, FlashMLA Sparse is only supported on Hopper and Blackwell devices.], FLASHINFER_MLA: [sparse not supported, compute capability not supported], TRITON_MLA: [sparse not supported], FLASHMLA_SPARSE: [compute capability not supported]}.

ehfd · 2026-04-18T15:05:14Z

#38476

@koush How does the PR interact with this PR? Is your solution better or more performant than this?

ehfd · 2026-04-18T15:46:13Z

ba113ec

The PyTorch fallback for fp8_mqa_logits / fp8_paged_mqa_logits added in
the previous commit for SM80 compatibility uses Python-level nested
loops, which also activates on GB10/SM121 (DGX Spark) since DeepGEMM
requires SM90 or SM100. A user reported ~5 t/s running GLM-5.1 on an
8-node DGX Spark cluster due to this.

Micro-benchmark on GB10 (before this patch):
fp8_paged_mqa_logits_torch (ctx=2048): 10.05 ms/call
fp8_mqa_logits_torch (M=N=2048): 11.46 ms/call

Adds two Triton kernels matching DeepGEMM's signatures:

fp8_paged_mqa_logits_triton (decode)
fp8_mqa_logits_triton (prefill)

Both dequant FP8 on-the-fly, compute Q@K^T, apply relu + per-head
weights, and sum across heads in a single kernel launch.

Benchmark after this patch:
fp8_paged_mqa_logits_triton (ctx=2048): 0.09 ms/call (105x faster)
fp8_mqa_logits_triton (M=N=2048): 0.54 ms/call (21x faster)

At ctx=2048 across 60 sparse layers, this saves ~597 ms/token,
projecting 5 t/s -> ~30+ t/s on DGX Spark.

The PyTorch implementations remain in mqa_logits_fallback.py as
correctness references for tests.

Signed-off-by: srt180 srt180@users.noreply.github.com

Signed-off-by: haosdent haosdent@gmail.com

ehfd · 2026-04-18T16:14:22Z

Need figures on token throughput; whether Triton Sparse DSA Attention is better or forcing dense attention.

ianlevesque · 2026-04-20T01:46:57Z

I did a very very rough benchmark with GLM-5.1 (nvfp4) on 8x DGX Sparks:

pp	tg	pp t/s	tg t/s	peak tg t/s	ttfr ms	est_ppt ms
512	32	540.81 ± 12.70	8.59 ± 0.37	9.33 ± 0.47	796.74 ± 21.58	795.45 ± 21.58
512	128	554.42 ± 3.15	8.26 ± 0.06	9.33 ± 0.47	859.81 ± 6.77	858.52 ± 6.77
2048	32	892.66 ± 6.31	8.45 ± 0.15	9.00 ± 0.00	1991.23 ± 53.66	1989.94 ± 53.66
2048	128	883.56 ± 25.91	8.11 ± 0.11	9.00 ± 0.00	2072.54 ± 20.28	2071.24 ± 20.28
8192	32	1109.09 ± 1.06	8.18 ± 0.13	9.00 ± 0.00	6514.83 ± 82.47	6513.53 ± 82.47
8192	128	1106.67 ± 2.64	8.07 ± 0.25	9.00 ± 0.00	6562.78 ± 183.77	6561.49 ± 183.77

vllm config:

port: 8001
tensor-parallel-size: 8
trust-remote-code: true
gpu-memory-utilization: 0.84
max-model-len: 200704
compilation-config: '{"pass_config":{"fuse_allreduce_rms":true}}'
load-format: fastsafetensors
kv-cache-dtype: auto
enable-prefix-caching: true
enable-chunked-prefill: true
enable-prompt-tokens-details: true
enable-auto-tool-choice: true
tool-call-parser: glm47
reasoning-parser: glm45
mm-encoder-tp-mode: data
served-model-name: lukealonso/GLM-5.1-NVFP4

env vars:

VLLM_ENGINE_ITERATION_TIMEOUT_S=600
VLLM_EXECUTE_MODEL_TIMEOUT_SECONDS=600
VLLM_LOGGING_LEVEL=INFO
VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1
VLLM_MLA_FORCE_DENSE=1
NCCL_CROSS_NIC=1
NCCL_CUMEM_ENABLE=0
NCCL_IB_HCA=rocep1s0f0,roceP2p1s0f0
NCCL_IGNORE_CPU_AFFINITY=1
NCCL_SOCKET_IFNAME=enp1s0f0np0
NCCL_VERSION=2.28.3-1

vllm serve lukealonso/GLM-5.1-NVFP4 --config /etc/vllm/config.yaml --nnodes 8

I don't really know what I'm doing, as reflected in it still crashing when sending parallel requests. It can complete one full request of 200k of context though.

mergify · 2026-05-23T08:34:21Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @koush.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

…elf.model.is_v32 PR vllm-project#39594's second commit (gemini-bot suggestion) changed the indexer-weight skip from 'envs.VLLM_MLA_FORCE_DENSE' to 'not self.model.is_v32', but this code is inside DeepseekV2Model.load_weights where self is the model itself (has self.is_v32), not a wrapper with .model -> AttributeError on the force-dense path (the only path where the skip activates). Fixes GLM-5.2 force-dense weight load.

[bugfix]: support deepseek sparse attention on unsupported targets

b95c63f

This patch disables sparse attention via an environment variable to force dense attention computation on architectures that do not support FlashMLA. Signed-off-by: Koushik Dutta <koushd@gmail.com>

koush requested review from mgoin, njhill and pavanimajety as code owners April 12, 2026 01:08

mergify Bot added deepseek Related to DeepSeek models v1 bug Something isn't working labels Apr 12, 2026

gemini-code-assist Bot reviewed Apr 12, 2026

View reviewed changes

Comment thread vllm/model_executor/models/deepseek_v2.py Outdated

Update vllm/model_executor/models/deepseek_v2.py

5e42a8c

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Koushik Dutta <koush@koushikdutta.com>

koush mentioned this pull request Apr 12, 2026

[Bug]: vLLM does not support DeepSeek series on RTX PRO 6000/SM120 #26211

Open

1 task

sfbemerk mentioned this pull request Apr 14, 2026

[SM120][GLM-5.1] NVFP4 DCP/MTP stack tracker #37113

Open

ehfd mentioned this pull request Apr 18, 2026

[Feature]: Implement TRITON_MLA_SPARSE backend for sm80/120/121 support of Sparse MLA #38006

Open

1 task

ehfd mentioned this pull request Apr 19, 2026

[Feature] TRITON_MLA_SPARSE backend for SM8x/11x/12x DSA Sparse MLA Support #38476

Open

idonati mentioned this pull request May 7, 2026

[Bug]: DeepSeek-V4-Flash hangs after ~6 requests with cudagraph_mode=FULL_AND_PIECEWISE + chunked prefill on SM 12.x (GB10) #40969

Open

1 task

mergify Bot added the needs-rebase label May 23, 2026

ehfd mentioned this pull request Jun 20, 2026

[Bug]: GLM-5（Sparse MLA / DSA 模型）无法在 sm80 GPU（A100/A800）上运行 — DeepGemm 硬依赖无 fallback #35021

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[bugfix]: support deepseek sparse attention on unsupported targets#39594

[bugfix]: support deepseek sparse attention on unsupported targets#39594
koush wants to merge 2 commits into
vllm-project:mainfrom
koush:mla-force-dense

koush commented Apr 12, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

SaittaGrowing commented Apr 16, 2026 •

edited

Loading

Uh oh!

ehfd commented Apr 18, 2026 •

edited

Loading

Uh oh!

ehfd commented Apr 18, 2026

Uh oh!

ehfd commented Apr 18, 2026

Uh oh!

ianlevesque commented Apr 20, 2026

Uh oh!

mergify Bot commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Uh oh!

Conversation

koush commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

SaittaGrowing commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ehfd commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ehfd commented Apr 18, 2026

Uh oh!

ehfd commented Apr 18, 2026

Uh oh!

ianlevesque commented Apr 20, 2026

Uh oh!

mergify Bot commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

koush commented Apr 12, 2026 •

edited

Loading

SaittaGrowing commented Apr 16, 2026 •

edited

Loading

ehfd commented Apr 18, 2026 •

edited

Loading