Skip to content

[bugfix]: support deepseek sparse attention on unsupported targets#39594

Open
koush wants to merge 2 commits into
vllm-project:mainfrom
koush:mla-force-dense
Open

[bugfix]: support deepseek sparse attention on unsupported targets#39594
koush wants to merge 2 commits into
vllm-project:mainfrom
koush:mla-force-dense

Conversation

@koush

@koush koush commented Apr 12, 2026

Copy link
Copy Markdown
Contributor

This patch disables sparse attention via an environment variable to force dense attention computation on architectures that do not support FlashMLA.

Purpose

Blackwell RTX 6000 Pro is sm120 and is not supported by FlashMLA, so none of the models that use DSA will run. This includes Deepseek Exp, 3.2 and GLM 5+.

Test Plan

Run:

VLLM_MLA_FORCE_DENSE=1 vllm serve koushd/GLM-5.1-NVFP4 -tp 8

Test Result

Model loads and performs inference correctly, whereas it would crash on load before.

With this patch the following end up being used for attention computation:

(Worker_TP0 pid=24922) INFO 04-12 01:11:20 [cuda.py:366] Using TRITON_MLA attention backend out of potential backends: ['TRITON_MLA'].
(Worker_TP0 pid=24922) INFO 04-12 01:11:20 [mla_attention.py:2147] Using FlashAttention prefill for MLA
(Worker_TP0 pid=24922) INFO 04-12 01:11:20 [init.py:652] Using FlashInferCutlassNvFp4LinearKernel for NVFP4 GEMM


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

This patch disables sparse attention via an environment variable to force dense attention computation on architectures that do not support FlashMLA.

Signed-off-by: Koushik Dutta <koushd@gmail.com>
@mergify mergify Bot added deepseek Related to DeepSeek models v1 bug Something isn't working labels Apr 12, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the VLLM_MLA_FORCE_DENSE environment variable to allow forcing dense attention and disabling the sparse attention indexer, which is useful for architectures where DeepGEMM is unsupported. Additionally, it replaces has_deep_gemm() with is_deep_gemm_supported() throughout the codebase to ensure better compatibility checks. Feedback was provided to improve the robustness of weight loading in the DeepSeek-V2 model by checking the model's internal state rather than the environment variable directly.

Comment thread vllm/model_executor/models/deepseek_v2.py Outdated
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Koushik Dutta <koush@koushikdutta.com>
@SaittaGrowing

SaittaGrowing commented Apr 16, 2026

Copy link
Copy Markdown

Hi @koush ,
I use this patch but it dosen't work in my case.

The env of my maschine shows below:

-----------GPU INFO---------------
index, name, compute_cap, driver_version, memory.total [MiB]
0, NVIDIA GeForce RTX 5090, 12.0, 580.142, 32607 MiB
1, NVIDIA GeForce RTX 5090, 12.0, 580.142, 32607 MiB
2, NVIDIA GeForce RTX 5090, 12.0, 580.142, 32607 MiB
3, NVIDIA GeForce RTX 5090, 12.0, 580.142, 32607 MiB
4, NVIDIA GeForce RTX 5090, 12.0, 580.142, 32607 MiB
5, NVIDIA GeForce RTX 5090, 12.0, 580.142, 32607 MiB
6, NVIDIA GeForce RTX 5090, 12.0, 580.142, 32607 MiB
7, NVIDIA GeForce RTX 5090, 12.0, 580.142, 32607 MiB

-------- LIB VERSION-----------
vllm 0.19.1.dev1+g43a9b1afb.cu130
flashinfer-cubin 0.6.6
flashinfer-jit-cache 0.6.6+cu130
flashinfer-python 0.6.6

--------LAUNCH COMMAND-------
VLLM_MLA_FORCE_DENSE=1 vllm serve zai-org/GLM-5.1-FP8 --tensor-parallel-size=32 --gpu-memory-utilization=0.85 --speculative-config.method=mtp --speculative-config.num_speculative_tokens=3 --tool-call-parser=glm47 --reasoning-parser=glm45 --enable-auto-tool-choice --chat-template-content-format=string --served-model-name=infvllm-eam8cbnjju-glm-5-1-fp8-v1 --host 0.0.0.0 --port 8000 --distributed-executor-backend ray

-------------LOG------------------
(EngineCore pid=1427834) ^[[36m(RayWorkerWrapper pid=1428293)^[[0m INFO 04-16 06:58:50 [gpu_model_runner.py:4735] Starting to load model zai-org/GLM-5.1-FP8...
(EngineCore pid=1427834) ^[[36m(RayWorkerWrapper pid=1428293)^[[0m INFO 04-16 06:58:50 [init.py:261] Selected CutlassFP8ScaledMMLinearKernel for Fp8LinearMethod
(EngineCore pid=1427834) ^[[36m(RayWorkerWrapper pid=1428293)^[[0m INFO 04-16 06:58:50 [cuda.py:334] Using TRITON_MLA attention backend out of potential backends: ['TRITON_MLA'].
(EngineCore pid=1427834) ^[[36m(RayWorkerWrapper pid=1428293)^[[0m INFO 04-16 06:58:50 [mla_attention.py:2137] Using FlashAttention prefill for MLA
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] EngineCore failed to start.^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] Traceback (most recent call last):^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1082, in run_engine_core^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] return func(*args, **kwargs)^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 848, in init^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] super().init(^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 114, in init^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] self.model_executor = executor_class(vllm_config)^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] return func(*args, **kwargs)^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 103, in init^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] self._init_executor()^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/ray_executor.py", line 96, in _init_executor^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] self._init_workers_ray(placement_group)^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/ray_executor.py", line 389, in _init_workers_ray^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] self.collective_rpc("load_model")^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/ray_executor.py", line 516, in collective_rpc^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] return ray.get(ray_worker_outputs, timeout=timeout)^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] return fn(*args, **kwargs)^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/ray/_private/client_mode_hook.py", line 104, in wrapper^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] return func(*args, **kwargs)^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/ray/_private/worker.py", line 2981, in get^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] values, debugger_breakpoint = worker.get_objects(^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/ray/_private/worker.py", line 1012, in get_objects^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] raise value.as_instanceof_cause()^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] ray.exceptions.RayTaskError(ValueError): ^[[36mray::RayWorkerWrapper.execute_method()^[[39m (pid=294743,, actor_id=fd40d5728b4f26b26ea43b0111000000, repr=<vllm.v1.executor.ray_utils.RayWorkerWrapper object at 0x7f643d44fef0>)^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/ray_utils.py", line 75, in execute_method^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] raise e^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/ray_utils.py", line 65, in execute_method^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] return run_method(self, method, args, kwargs)^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/serial_utils.py", line 510, in run_method^
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] return func(*args, **kwargs)^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 323, in load_model^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] self.model_runner.load_model(load_dummy_weights=load_dummy_weights)^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] return func(*args, **kwargs)^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 4751, in load_model^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] self.model = model_loader.load_model(^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] return func(*args, **kwargs)^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 55, in load_model^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] model = initialize_model(^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] ^^^^^^^^^^^^^^^^^^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] return func(*args, **kwargs)^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/utils.py", line 57, in initialize_model^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] model = model_class(vllm_config=vllm_config, prefix=prefix)^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 1339, in init^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] self.model = self.model_cls(^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] ^^^^^^^^^^^^^^^^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 364, in init^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] old_init(self, **kwargs)^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 1176, in init^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] self.start_layer, self.end_layer, self.layers = make_layers(^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] ^^^^^^^^^^^^^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 652, in make_layers^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] + get_offloader().wrap_modules(^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/offloader/base.py", line 90, in wrap_modules^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] return list(modules_generator)^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 653, in ^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] layer_fn(prefix=f"{prefix}.{idx}") for idx in range(start_layer, end_layer)^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 1178, in ^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] lambda prefix: DeepseekV2DecoderLayer(^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 1050, in init^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] self.self_attn = attn_cls(^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] ^^^^^^^^^^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 984, in init^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] self.mla_attn = MultiHeadLatentAttentionWrapper(^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/mla.py", line 95, in init^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] self.mla_attn = MLAAttention(^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/attention/mla_attention.py", line 332, in init^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] self.attn_backend = get_attn_backend(^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] ^^^^^^^^^^^^^^^^^^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/selector.py", line 92, in get_attn_backend^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] return _cached_get_attn_backend(^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/selector.py", line 107, in _cached_get_attn_backend^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] attention_cls = current_platform.get_attn_backend_cls(^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/platforms/cuda.py", line 298, in get_attn_backend_cls^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] raise ValueError(^M
(EngineCore pid=1427834) ERROR 04-16 06:58:50 [core.py:1108] ValueError: No valid attention backend found for cuda with AttentionSelectorConfig(head_size=576, dtype=torch.bfloat16, kv_cache_dtype=auto, block_size=None, use_mla=True, has_sink=False, use_sparse=True, use_mm_prefix=False, use_per_head_quant_scales=False, attn_type=AttentionType.DECODER). Reasons: {FLASH_ATTN_MLA: [sparse not supported, compute capability not supported, FlashAttention MLA not supported on this device], FLASHMLA: [sparse not supported, compute capability not supported, FlashMLA Sparse is only supported on Hopper and Blackwell devices.], FLASHINFER_MLA: [sparse not supported, compute capability not supported], TRITON_MLA: [sparse not supported], FLASHMLA_SPARSE: [compute capability not supported]}.

@ehfd

ehfd commented Apr 18, 2026

Copy link
Copy Markdown
Contributor

#38476

@koush How does the PR interact with this PR? Is your solution better or more performant than this?

@ehfd

ehfd commented Apr 18, 2026

Copy link
Copy Markdown
Contributor

ba113ec

The PyTorch fallback for fp8_mqa_logits / fp8_paged_mqa_logits added in
the previous commit for SM80 compatibility uses Python-level nested
loops, which also activates on GB10/SM121 (DGX Spark) since DeepGEMM
requires SM90 or SM100. A user reported ~5 t/s running GLM-5.1 on an
8-node DGX Spark cluster due to this.

Micro-benchmark on GB10 (before this patch):
fp8_paged_mqa_logits_torch (ctx=2048): 10.05 ms/call
fp8_mqa_logits_torch (M=N=2048): 11.46 ms/call

Adds two Triton kernels matching DeepGEMM's signatures:

  • fp8_paged_mqa_logits_triton (decode)
  • fp8_mqa_logits_triton (prefill)

Both dequant FP8 on-the-fly, compute Q@K^T, apply relu + per-head
weights, and sum across heads in a single kernel launch.

Benchmark after this patch:
fp8_paged_mqa_logits_triton (ctx=2048): 0.09 ms/call (105x faster)
fp8_mqa_logits_triton (M=N=2048): 0.54 ms/call (21x faster)

At ctx=2048 across 60 sparse layers, this saves ~597 ms/token,
projecting 5 t/s -> ~30+ t/s on DGX Spark.

The PyTorch implementations remain in mqa_logits_fallback.py as
correctness references for tests.

Signed-off-by: srt180 srt180@users.noreply.github.com

Signed-off-by: haosdent haosdent@gmail.com

@ehfd

ehfd commented Apr 18, 2026

Copy link
Copy Markdown
Contributor

Need figures on token throughput; whether Triton Sparse DSA Attention is better or forcing dense attention.

@ianlevesque

Copy link
Copy Markdown

I did a very very rough benchmark with GLM-5.1 (nvfp4) on 8x DGX Sparks:

pp tg pp t/s tg t/s peak tg t/s ttfr ms est_ppt ms
512 32 540.81 ± 12.70 8.59 ± 0.37 9.33 ± 0.47 796.74 ± 21.58 795.45 ± 21.58
512 128 554.42 ± 3.15 8.26 ± 0.06 9.33 ± 0.47 859.81 ± 6.77 858.52 ± 6.77
2048 32 892.66 ± 6.31 8.45 ± 0.15 9.00 ± 0.00 1991.23 ± 53.66 1989.94 ± 53.66
2048 128 883.56 ± 25.91 8.11 ± 0.11 9.00 ± 0.00 2072.54 ± 20.28 2071.24 ± 20.28
8192 32 1109.09 ± 1.06 8.18 ± 0.13 9.00 ± 0.00 6514.83 ± 82.47 6513.53 ± 82.47
8192 128 1106.67 ± 2.64 8.07 ± 0.25 9.00 ± 0.00 6562.78 ± 183.77 6561.49 ± 183.77

vllm config:

port: 8001
tensor-parallel-size: 8
trust-remote-code: true
gpu-memory-utilization: 0.84
max-model-len: 200704
compilation-config: '{"pass_config":{"fuse_allreduce_rms":true}}'
load-format: fastsafetensors
kv-cache-dtype: auto
enable-prefix-caching: true
enable-chunked-prefill: true
enable-prompt-tokens-details: true
enable-auto-tool-choice: true
tool-call-parser: glm47
reasoning-parser: glm45
mm-encoder-tp-mode: data
served-model-name: lukealonso/GLM-5.1-NVFP4

env vars:

VLLM_ENGINE_ITERATION_TIMEOUT_S=600
VLLM_EXECUTE_MODEL_TIMEOUT_SECONDS=600
VLLM_LOGGING_LEVEL=INFO
VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1
VLLM_MLA_FORCE_DENSE=1
NCCL_CROSS_NIC=1
NCCL_CUMEM_ENABLE=0
NCCL_IB_HCA=rocep1s0f0,roceP2p1s0f0
NCCL_IGNORE_CPU_AFFINITY=1
NCCL_SOCKET_IFNAME=enp1s0f0np0
NCCL_VERSION=2.28.3-1

vllm serve lukealonso/GLM-5.1-NVFP4 --config /etc/vllm/config.yaml --nnodes 8

I don't really know what I'm doing, as reflected in it still crashing when sending parallel requests. It can complete one full request of 200k of context though.

@mergify

mergify Bot commented May 23, 2026

Copy link
Copy Markdown
Contributor

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @koush.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label May 23, 2026
ianlevesque added a commit to ianlevesque/vllm that referenced this pull request Jun 22, 2026
…elf.model.is_v32

PR vllm-project#39594's second commit (gemini-bot suggestion) changed the indexer-weight
skip from 'envs.VLLM_MLA_FORCE_DENSE' to 'not self.model.is_v32', but this code
is inside DeepseekV2Model.load_weights where self is the model itself (has
self.is_v32), not a wrapper with .model -> AttributeError on the force-dense
path (the only path where the skip activates). Fixes GLM-5.2 force-dense weight load.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working deepseek Related to DeepSeek models needs-rebase v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants