[Cherry-Pick][Speculative Decoding][BugFix] overlap compute logprobs for speculative decoding (#7406)#7585
[Cherry-Pick][Speculative Decoding][BugFix] overlap compute logprobs for speculative decoding (#7406)#7585huicongyao wants to merge 2 commits intoPaddlePaddle:release/2.6from
Conversation
|
Thanks for your contribution! |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## release/2.6 #7585 +/- ##
==============================================
Coverage ? 73.66%
==============================================
Files ? 376
Lines ? 53405
Branches ? 8351
==============================================
Hits ? 39340
Misses ? 11296
Partials ? 2769
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
/re-run run_tests_with_coverage |
1 similar comment
|
/re-run run_tests_with_coverage |
…ive decoding (PaddlePaddle#7406) * fix shape mismatch while cuda graph closed * fix * fix xpu typo * overlap compute logprobs * fix * optimize * fix * opt * fix unitest error and optimize code
|
/re-run ci_xpu |
2 similar comments
|
/re-run ci_xpu |
|
/re-run ci_xpu |
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-04-27 10:29:34
📋 Review 摘要
PR 概述:为投机解码(Speculative Decoding)的 logprobs 计算新增 overlap compute 支持,修复 logprobs 在投机解码路径下的正确性问题,同时修复多处函数拼写错误。
变更范围:custom_ops/gpu_ops/speculate_decoding/、fastdeploy/model_executor/layers/sample/、fastdeploy/model_executor/、fastdeploy/spec_decode/mtp.py、fastdeploy/worker/gpu_model_runner.py、fastdeploy/worker/xpu_model_runner.py
影响面 Tag:[Speculative Decoding] [BugFix] [OP]
📝 PR 规范检查
PR 标题格式符合 Cherry-Pick 规范,Tags 标注完整。PR 描述模板中 Motivation、Modifications、Usage 等核心段落未填写实质内容,仅保留了占位提示文本,建议补全。
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🟡 建议 | custom_ops/gpu_ops/speculate_decoding/build_sampling_params_logprob.cu:34 |
CUDA kernel 中每个 block 内所有线程重复计算 start_offset,存在冗余计算 |
| 🟡 建议 | fastdeploy/model_executor/layers/sample/sampler.py:905 |
移除 if top_p_token_mask.any(): 保护后,无 top_p 需求场景也会执行全量 softmax 计算,存在性能回退 |
| 🟡 建议 | fastdeploy/model_executor/layers/sample/logprobs.py:148 |
build_output_logprobs 函数 docstring 返回值描述(3元素)与实际返回类型(2元素)不一致 |
| ❓ 疑问 | fastdeploy/model_executor/layers/sample/sampler.py:870 |
real_bsz=0 时 num_tokens=0 产生空张量,需确认 CUDA kernel 和后续逻辑的安全性 |
总体评价
本 PR 通过引入新的 CUDA kernel BuildSamplingParamLogProb 优化了投机解码路径下 logprobs 的并行计算,并修复了多处函数名拼写错误(specualate → speculate),整体改动方向正确。测试覆盖较为完整,包含了 golden value 验证和边界情况。建议关注 CUDA kernel 中的冗余计算、移除 top_p 条件保护后的性能影响,以及 real_bsz=0 的边界安全性。
| // Compute start offset: sum of token_num_per_batch[0..bi-1] | ||
| int start_offset = 0; | ||
| for (int i = 0; i < bi; i++) { | ||
| start_offset += token_num_per_batch[i]; |
There was a problem hiding this comment.
🟡 建议 start_offset 在每个 block 中重复计算,存在冗余工作。
当前实现中,每个 block(每个 bi)内的所有 256 个线程都各自执行相同的 O(bi) 累加循环来计算 start_offset。这是完全冗余的——可以让 tid==0 的线程计算后写入共享内存,其余线程读取共享内存,或者在 C++ 层预先计算前缀和传入 kernel。
建议改为使用 shared memory 避免重复计算:
__shared__ int32_t s_start_offset;
if (tid == 0) {
int off = 0;
for (int i = 0; i < bi; i++) off += token_num_per_batch[i];
s_start_offset = off;
}
__syncthreads();
int start_offset = s_start_offset;| probs = top_p_normalize_probs_paddle(probs, real_token_top_p) | ||
| top_p_logprob = paddle.log(probs) | ||
|
|
||
| probs = F.softmax(padded_logits, axis=-1) |
There was a problem hiding this comment.
🟡 建议 移除了 if top_p_token_mask.any(): 条件保护,导致即使所有请求均无 top_p 缩放需求,也会无条件执行 F.softmax、top_p_normalize_probs_paddle、paddle.log 三个计算密集型操作。
原代码有 if top_p_token_mask.any(): 保护,性能更优。如果该变更是为解决某个特定 bug 而故意移除,建议在注释中说明原因。
| compute_logprobs_fn: Optional[Callable] = None, | ||
| ) -> Tuple[Optional[LogprobsTensors], Optional[paddle.Tensor], Optional[paddle.Tensor]]: | ||
| real_bsz: int = 0, | ||
| ) -> Tuple[Optional[LogprobsTensors], Optional[paddle.Tensor]]: |
There was a problem hiding this comment.
🟡 建议 docstring 中的 Returns 描述与实际函数返回类型不一致。
函数签名已更新为 Tuple[Optional[LogprobsTensors], Optional[paddle.Tensor]](2元素),但 docstring 仍写:
Returns:
tuple: (logprobs_tensors, cu_batch_token_offset, output_logits)
描述的是3元素元组。请同步更新 docstring。
| batch_token_num = share_inputs["accept_num"][:real_bsz] | ||
|
|
||
| # NOTE(huicongyao): temporarily used to provide a max_sized input, remove in the future | ||
| num_tokens = real_bsz * (self.num_speculative_tokens + 1) |
There was a problem hiding this comment.
❓ 疑问 当 real_bsz=0 时,num_tokens = 0 * (self.num_speculative_tokens + 1) = 0,后续 paddle.zeros(shape=[0, last_logits.shape[1]], ...) 将创建空张量。
请确认:
build_sampling_params_logprob(..., num_tokens=0)的 CUDA kernel(token_num_output_cpu=0)不会产生越界访问;F.log_softmax(padded_logits=空张量)的返回结果在后续[:real_token_num]切片逻辑中是否正确处理。
如果 real_bsz=0 的情况已在上游通过 if token_num_cpu > 0: 保护(如 mtp.py),建议在函数头部加断言或注释说明。
Motivation
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.