[bugfix]: support deepseek sparse attention on unsupported targets#39594
[bugfix]: support deepseek sparse attention on unsupported targets#39594koush wants to merge 2 commits into
Conversation
This patch disables sparse attention via an environment variable to force dense attention computation on architectures that do not support FlashMLA. Signed-off-by: Koushik Dutta <koushd@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request introduces the VLLM_MLA_FORCE_DENSE environment variable to allow forcing dense attention and disabling the sparse attention indexer, which is useful for architectures where DeepGEMM is unsupported. Additionally, it replaces has_deep_gemm() with is_deep_gemm_supported() throughout the codebase to ensure better compatibility checks. Feedback was provided to improve the robustness of weight loading in the DeepSeek-V2 model by checking the model's internal state rather than the environment variable directly.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Koushik Dutta <koush@koushikdutta.com>
|
Hi @koush , The env of my maschine shows below: -----------GPU INFO--------------- -------- LIB VERSION----------- --------LAUNCH COMMAND------- -------------LOG------------------ |
|
The PyTorch fallback for fp8_mqa_logits / fp8_paged_mqa_logits added in Micro-benchmark on GB10 (before this patch): Adds two Triton kernels matching DeepGEMM's signatures:
Both dequant FP8 on-the-fly, compute Q@K^T, apply relu + per-head Benchmark after this patch: At ctx=2048 across 60 sparse layers, this saves ~597 ms/token, The PyTorch implementations remain in mqa_logits_fallback.py as Signed-off-by: srt180 srt180@users.noreply.github.com Signed-off-by: haosdent haosdent@gmail.com |
|
Need figures on token throughput; whether Triton Sparse DSA Attention is better or forcing dense attention. |
|
I did a very very rough benchmark with GLM-5.1 (nvfp4) on 8x DGX Sparks:
vllm config: port: 8001
tensor-parallel-size: 8
trust-remote-code: true
gpu-memory-utilization: 0.84
max-model-len: 200704
compilation-config: '{"pass_config":{"fuse_allreduce_rms":true}}'
load-format: fastsafetensors
kv-cache-dtype: auto
enable-prefix-caching: true
enable-chunked-prefill: true
enable-prompt-tokens-details: true
enable-auto-tool-choice: true
tool-call-parser: glm47
reasoning-parser: glm45
mm-encoder-tp-mode: data
served-model-name: lukealonso/GLM-5.1-NVFP4env vars:
I don't really know what I'm doing, as reflected in it still crashing when sending parallel requests. It can complete one full request of 200k of context though. |
|
This pull request has merge conflicts that must be resolved before it can be |
…elf.model.is_v32 PR vllm-project#39594's second commit (gemini-bot suggestion) changed the indexer-weight skip from 'envs.VLLM_MLA_FORCE_DENSE' to 'not self.model.is_v32', but this code is inside DeepseekV2Model.load_weights where self is the model itself (has self.is_v32), not a wrapper with .model -> AttributeError on the force-dense path (the only path where the skip activates). Fixes GLM-5.2 force-dense weight load.
This patch disables sparse attention via an environment variable to force dense attention computation on architectures that do not support FlashMLA.
Purpose
Blackwell RTX 6000 Pro is sm120 and is not supported by FlashMLA, so none of the models that use DSA will run. This includes Deepseek Exp, 3.2 and GLM 5+.
Test Plan
Run:
VLLM_MLA_FORCE_DENSE=1 vllm serve koushd/GLM-5.1-NVFP4 -tp 8
Test Result
Model loads and performs inference correctly, whereas it would crash on load before.
With this patch the following end up being used for attention computation:
(Worker_TP0 pid=24922) INFO 04-12 01:11:20 [cuda.py:366] Using TRITON_MLA attention backend out of potential backends: ['TRITON_MLA'].
(Worker_TP0 pid=24922) INFO 04-12 01:11:20 [mla_attention.py:2147] Using FlashAttention prefill for MLA
(Worker_TP0 pid=24922) INFO 04-12 01:11:20 [init.py:652] Using FlashInferCutlassNvFp4LinearKernel for NVFP4 GEMM
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.