[Optimization][DeepSeekV3.2]Precompute the attention_mask_offset for Prefill in the Indexer#7598
[Optimization][DeepSeekV3.2]Precompute the attention_mask_offset for Prefill in the Indexer#7598ShaneGZhu wants to merge 6 commits intoPaddlePaddle:developfrom
Conversation
|
Thanks for your contribution! |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## develop #7598 +/- ##
==========================================
Coverage ? 71.68%
==========================================
Files ? 420
Lines ? 57885
Branches ? 9077
==========================================
Hits ? 41495
Misses ? 13561
Partials ? 2829
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| ) | ||
|
|
||
|
|
||
| @enable_compat_on_triton_kernel |
There was a problem hiding this comment.
1.不必要的注释清理掉
2. 改成英文的
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 AI Code Review |
2026-04-24 22:33:33
📋 Review 摘要
PR 概述:将 DeepSeekV3.2 Indexer Prefill 阶段的 attention_mask_offset 计算从 CPU Python 循环迁移到 Triton GPU kernel,消除热路径上的 Python 循环开销。
变更范围:model_executor/ops/triton_ops、model_executor/layers/attention、model_executor/models/deepseek_v3.py、forward_meta.py
影响面 Tag:Models OP
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🟡 建议 | indexer_update_attn_mask_offsets.py:67 |
ids_remove_padding 参数冗余,仅用了 .shape[0],接口存在语义误导 |
| 🟡 建议 | dsa_attention_backend.py:200 |
_update_forward_meta 每次 forward 均触发 GPU allocation,纯 Decode 阶段存在不必要开销 |
| ❓ 疑问 | deepseek_v3.py:678 |
直接切片访问 Optional 字段,缺乏 None 守卫 |
总体评价
优化思路清晰,Triton kernel 实现正确,单元测试覆盖充分(含边界场景和压力测试)。主要建议:清理冗余接口参数、加 forward_mode guard 避免 Decode 阶段的无效 allocation,以及在 deepseek_v3.py 消费侧补充 None 断言。
|
|
||
|
|
||
| def update_indexer_attn_mask_offsets( | ||
| ids_remove_padding, |
There was a problem hiding this comment.
🟡 建议 ids_remove_padding 参数仅用于获取 num_tokens,实际数据未传入 kernel。
该参数在函数内仅作 num_tokens = ids_remove_padding.shape[0],而 cu_seqlens_k[-1] 同样等于 sum(seq_lens_this_time),可以直接用来代替,无需传入整个 Tensor。当前接口容易让调用方误以为 kernel 会读取 token ID 数据,存在语义误导。
建议修改:
num_tokens = int(cu_seqlens_k[-1].item())并从函数签名中移除 ids_remove_padding 参数,同步更新 dsa_attention_backend.py 的调用处。
| result = result.view(num_blocks, block_size, 1, -1) | ||
| return result | ||
|
|
||
| def _update_forward_meta(self, forward_meta: ForwardMeta): |
There was a problem hiding this comment.
🟡 建议 _update_forward_meta 在每次 forward(含纯 Decode 阶段)时都会被调用,存在不必要的 GPU 内存分配开销。
对于纯 Decode batch,seq_lens_encoder 全为 0,kernel 内所有 block 都会直接 return,但 paddle.zeros((num_tokens * 2), ...) 的分配仍然发生。在高吞吐 Decode 场景下,这是每步都触发的冗余 GPU allocation。
建议加入 forward_mode 判断提前跳过:
from fastdeploy.model_executor.forward_meta import ForwardMode
def _update_forward_meta(self, forward_meta: ForwardMeta):
# Only needed during Prefill / Mixed stages
if forward_meta.forward_mode == ForwardMode.DECODE:
return
forward_meta.indexer_attn_mask_offsets = update_indexer_attn_mask_offsets(...)|
|
||
| # indexer_attn_mask_offsets is pre-computed by the Triton kernel | ||
| # update_indexer_attn_mask_offsets in dsa_attention_backend and stored in forward_meta. | ||
| ks = forward_meta.indexer_attn_mask_offsets[::2].contiguous() |
There was a problem hiding this comment.
❓ 疑问 forward_meta.indexer_attn_mask_offsets 字段类型为 Optional[paddle.Tensor],此处直接切片访问未做 None 检查。
若 init_attention_metadata 未被调用(例如单元测试直接构造 ForwardMeta、或 Triton 不可用导致 _update_forward_meta 抛出 ImportError 而被跳过),该字段将保持 None,切片 [::2] 会抛出 TypeError。
是否可以添加断言保证前置条件?
assert forward_meta.indexer_attn_mask_offsets is not None, \
"indexer_attn_mask_offsets must be precomputed by DSAAttentionBackend"
ks = forward_meta.indexer_attn_mask_offsets[::2].contiguous()
Motivation
In DeepSeekV3.2's Indexer Prefill stage, the
attention_mask_offset(i.e., the causal attention window[ks, ke)for each token) was previously computed via Python loops on CPU — iterating batch by batch and token by token. This becomes a bottleneck on the critical forward path when handling large batch sizes or long sequences.This PR precomputes
attention_mask_offsetusing a Triton kernel on GPU duringinit_attention_metadata, eliminating the CPU-side Python loop overhead and improving Prefill throughput for the Indexer attention backend.Modifications
fastdeploy/model_executor/ops/triton_ops/indexer_update_attn_mask_offsets.py(new file)Implements the Triton kernel
update_indexer_attn_mask_offsetsthat batch-computes the causal attention window[ks, ke)for all prefill tokens in a single GPU kernel launch. Also provides a Python reference implementationref_update_attn_mask_offsetsfor correctness verification.fastdeploy/model_executor/layers/attention/dsa_attention_backend.pyAdds a
_update_forward_metamethod that callsupdate_indexer_attn_mask_offsetsto precomputeattention_mask_offsetand stores the result intoforward_meta.indexer_attn_mask_offsets.fastdeploy/model_executor/forward_meta.pyAdds
indexer_attn_mask_offsetsfield toForwardMetato carry the precomputed offsets.Usage or Command
No API changes. The optimization is transparent to users. To verify correctness, run:
Accuracy Tests
The new Triton kernel is validated against the Python reference implementation
ref_update_attn_mask_offsetsin the unit tests, covering:No changes to model forward logic or kernel math — accuracy of model outputs is unaffected.
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.