[Optimization] Support logprob overlap in speculative decoding#7600
[Optimization] Support logprob overlap in speculative decoding#7600Deleter-D wants to merge 6 commits intoPaddlePaddle:developfrom
Conversation
|
Thanks for your contribution! |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## develop #7600 +/- ##
==========================================
Coverage ? 71.68%
==========================================
Files ? 419
Lines ? 57849
Branches ? 9077
==========================================
Hits ? 41470
Misses ? 13551
Partials ? 2828
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 AI Code Review |
2026-04-24 17:44:55
📋 Review 摘要
PR 概述:将 speculate_get_target_logits 与 accepted token 提取合并为单个 CUDA kernel,通过 GPU 端计算 cu_batch_token_offset 实现 logprob 与 token acceptance 的 kernel 级重叠,提升投机解码性能。
变更范围:custom_ops/gpu_ops/speculate_decoding/、fastdeploy/model_executor/layers/sample/
影响面 Tag:OP Speculative Decoding
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🔴 Bug | logprobs.py:216 |
gather_logprobs 使用了包含填充 0 的完整 token_ids/raw_logprobs,导致 logprobs 结果错误 |
| 🟡 建议 | speculate_logprob_utils.cu:238 |
token_ids 写入缺少 C++ 层边界断言防护 |
| 🟡 建议 | speculate_logprob_utils.cu:282 |
两个 kernel launch 后缺少 CUDA error 检查 |
核心 Bug 说明
gather_logprobs 使用填充数据导致 logprobs 错误(logprobs.py 第 190-216 行):
output_logits 和 token_ids 分配了 real_bsz * max_draft_token_num_plus_1 大小,但 kernel 实际只写入了 accept_num.sum() 个有效行(accept_num.sum() ≤ real_bsz * max_draft_token_num_plus_1)。剩余位置:
token_ids中仍为fill_value=0(即 token_id=0)output_logits中为未初始化(paddle.empty)数据
gather_logprobs(raw_logprobs, num_logprobs, token_ids=token_ids) 会对全量数据操作,导致为 token_id=0 错误地提取 logprob,最终 logprobs 结果包含无效数据。
建议修复:在 gather_logprobs 调用前截断到有效长度。
总体评价
本 PR 的优化方向正确,将 CPU 端的 accept_num cumsum 计算下沉到 GPU kernel,减少了同步开销。但 Python 调用层存在 output_logits/token_ids 预分配过大后未截断就传入 gather_logprobs 的 Bug,会导致 logprob 计算结果包含填充噪声数据,建议修复后合入。
Motivation
Improve performance of speculative decoding by overlapping logprob computation with token acceptance operations.
Modifications
speculate_get_target_logitsintospeculate_get_accept_tokens_and_logitsto enable kernel-level overlapcompute_cu_batch_offset_kernelfor efficient batch offset calculationUsage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.