[Cherry-Pick][Speculative Decoding][BugFix] overlap compute logprobs for speculative decoding (#7406) by huicongyao · Pull Request #7585 · PaddlePaddle/FastDeploy

huicongyao · 2026-04-23T07:23:43Z

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick，PR标题需遵循格式，在最开始加上[Cherry-Pick]标签，以及最后面加上原PR ID，例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2026-04-23T07:23:48Z

Thanks for your contribution!

codecov-commenter · 2026-04-23T09:21:37Z

Codecov Report

❌ Patch coverage is 96.87500% with 1 line in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (release/2.6@c8a59a3). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
...astdeploy/model_executor/layers/sample/logprobs.py	66.66%	1 Missing ⚠️

Additional details and impacted files

@@              Coverage Diff               @@
##             release/2.6    #7585   +/-   ##
==============================================
  Coverage               ?   73.66%           
==============================================
  Files                  ?      376           
  Lines                  ?    53405           
  Branches               ?     8351           
==============================================
  Hits                   ?    39340           
  Misses                 ?    11296           
  Partials               ?     2769

Flag	Coverage Δ
GPU	`73.66% <96.87%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

yongqiangma

LGTM

huicongyao · 2026-04-24T09:22:10Z

/re-run run_tests_with_coverage

huicongyao · 2026-04-25T09:22:54Z

/re-run run_tests_with_coverage

…ive decoding (PaddlePaddle#7406) * fix shape mismatch while cuda graph closed * fix * fix xpu typo * overlap compute logprobs * fix * optimize * fix * opt * fix unitest error and optimize code

huicongyao · 2026-04-25T18:56:35Z

/re-run ci_xpu

huicongyao · 2026-04-25T20:47:52Z

/re-run ci_xpu

huicongyao · 2026-04-26T06:09:53Z

/re-run ci_xpu

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-04-27 10:29:34

📋 Review 摘要

PR 概述：为投机解码（Speculative Decoding）的 logprobs 计算新增 overlap compute 支持，修复 logprobs 在投机解码路径下的正确性问题，同时修复多处函数拼写错误。

变更范围：custom_ops/gpu_ops/speculate_decoding/、fastdeploy/model_executor/layers/sample/、fastdeploy/model_executor/、fastdeploy/spec_decode/mtp.py、fastdeploy/worker/gpu_model_runner.py、fastdeploy/worker/xpu_model_runner.py

影响面 Tag：[Speculative Decoding] [BugFix] [OP]

📝 PR 规范检查

PR 标题格式符合 Cherry-Pick 规范，Tags 标注完整。PR 描述模板中 Motivation、Modifications、Usage 等核心段落未填写实质内容，仅保留了占位提示文本，建议补全。

问题

级别	文件	概述
🟡 建议	`custom_ops/gpu_ops/speculate_decoding/build_sampling_params_logprob.cu:34`	CUDA kernel 中每个 block 内所有线程重复计算 `start_offset`，存在冗余计算
🟡 建议	`fastdeploy/model_executor/layers/sample/sampler.py:905`	移除 `if top_p_token_mask.any():` 保护后，无 top_p 需求场景也会执行全量 softmax 计算，存在性能回退
🟡 建议	`fastdeploy/model_executor/layers/sample/logprobs.py:148`	`build_output_logprobs` 函数 docstring 返回值描述（3元素）与实际返回类型（2元素）不一致
❓ 疑问	`fastdeploy/model_executor/layers/sample/sampler.py:870`	`real_bsz=0` 时 `num_tokens=0` 产生空张量，需确认 CUDA kernel 和后续逻辑的安全性

总体评价

本 PR 通过引入新的 CUDA kernel BuildSamplingParamLogProb 优化了投机解码路径下 logprobs 的并行计算，并修复了多处函数名拼写错误（specualate → speculate），整体改动方向正确。测试覆盖较为完整，包含了 golden value 验证和边界情况。建议关注 CUDA kernel 中的冗余计算、移除 top_p 条件保护后的性能影响，以及 real_bsz=0 的边界安全性。

PaddlePaddle-bot · 2026-04-27T02:31:54Z

+  // Compute start offset: sum of token_num_per_batch[0..bi-1]
+  int start_offset = 0;
+  for (int i = 0; i < bi; i++) {
+    start_offset += token_num_per_batch[i];


🟡 建议 start_offset 在每个 block 中重复计算，存在冗余工作。

当前实现中，每个 block（每个 bi）内的所有 256 个线程都各自执行相同的 O(bi) 累加循环来计算 start_offset。这是完全冗余的——可以让 tid==0 的线程计算后写入共享内存，其余线程读取共享内存，或者在 C++ 层预先计算前缀和传入 kernel。

建议改为使用 shared memory 避免重复计算：

__shared__ int32_t s_start_offset; if (tid == 0) { int off = 0; for (int i = 0; i < bi; i++) off += token_num_per_batch[i]; s_start_offset = off; } __syncthreads(); int start_offset = s_start_offset;

PaddlePaddle-bot · 2026-04-27T02:31:55Z

-                probs = top_p_normalize_probs_paddle(probs, real_token_top_p)
-                top_p_logprob = paddle.log(probs)
+
+            probs = F.softmax(padded_logits, axis=-1)


🟡 建议 移除了 if top_p_token_mask.any(): 条件保护，导致即使所有请求均无 top_p 缩放需求，也会无条件执行 F.softmax、top_p_normalize_probs_paddle、paddle.log 三个计算密集型操作。

原代码有 if top_p_token_mask.any(): 保护，性能更优。如果该变更是为解决某个特定 bug 而故意移除，建议在注释中说明原因。

PaddlePaddle-bot · 2026-04-27T02:31:55Z

    compute_logprobs_fn: Optional[Callable] = None,
-) -> Tuple[Optional[LogprobsTensors], Optional[paddle.Tensor], Optional[paddle.Tensor]]:
+    real_bsz: int = 0,
+) -> Tuple[Optional[LogprobsTensors], Optional[paddle.Tensor]]:


🟡 建议 docstring 中的 Returns 描述与实际函数返回类型不一致。

函数签名已更新为 Tuple[Optional[LogprobsTensors], Optional[paddle.Tensor]]（2元素），但 docstring 仍写：

Returns: tuple: (logprobs_tensors, cu_batch_token_offset, output_logits)

描述的是3元素元组。请同步更新 docstring。

PaddlePaddle-bot · 2026-04-27T02:31:55Z

-        batch_token_num = share_inputs["accept_num"][:real_bsz]
+
+        # NOTE(huicongyao): temporarily used to provide a max_sized input, remove in the future
+        num_tokens = real_bsz * (self.num_speculative_tokens + 1)


❓ 疑问 当 real_bsz=0 时，num_tokens = 0 * (self.num_speculative_tokens + 1) = 0，后续 paddle.zeros(shape=[0, last_logits.shape[1]], ...) 将创建空张量。

请确认：

build_sampling_params_logprob(..., num_tokens=0) 的 CUDA kernel（token_num_output_cpu=0）不会产生越界访问；

F.log_softmax(padded_logits=空张量) 的返回结果在后续 [:real_token_num] 切片逻辑中是否正确处理。

如果 real_bsz=0 的情况已在上游通过 if token_num_cpu > 0: 保护（如 mtp.py），建议在函数头部加断言或注释说明。

huicongyao had a problem deploying to Metax_ci April 23, 2026 07:23 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

yongqiangma previously approved these changes Apr 24, 2026

View reviewed changes

huicongyao dismissed yongqiangma’s stale review via aed7d24 April 24, 2026 07:46

huicongyao force-pushed the release/2.6 branch from 7f2b2bf to aed7d24 Compare April 24, 2026 07:46

huicongyao had a problem deploying to Metax_ci April 24, 2026 07:46 — with GitHub Actions Failure

huicongyao requested a review from PaddlePaddle-bot April 24, 2026 07:47

This comment was marked as outdated.

Sign in to view

[Speculative Decoding] [BugFix] overlap compute logprobs for speculat…

7ce7ff4

…ive decoding (PaddlePaddle#7406) * fix shape mismatch while cuda graph closed * fix * fix xpu typo * overlap compute logprobs * fix * optimize * fix * opt * fix unitest error and optimize code

huicongyao force-pushed the release/2.6 branch from aed7d24 to 7ce7ff4 Compare April 25, 2026 12:40

huicongyao had a problem deploying to Metax_ci April 25, 2026 12:40 — with GitHub Actions Failure

fix

75c9eb8

huicongyao had a problem deploying to Metax_ci April 25, 2026 17:47 — with GitHub Actions Failure

yongqiangma approved these changes Apr 27, 2026

View reviewed changes

PaddlePaddle-bot reviewed Apr 27, 2026

View reviewed changes

freeliuzc approved these changes Apr 27, 2026

View reviewed changes

Conversation

huicongyao commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot Bot commented Apr 23, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

codecov-commenter commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

yongqiangma left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

huicongyao commented Apr 24, 2026

Uh oh!

huicongyao commented Apr 25, 2026

Uh oh!

huicongyao commented Apr 25, 2026

Uh oh!

huicongyao commented Apr 25, 2026

Uh oh!

huicongyao commented Apr 26, 2026

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

📝 PR 规范检查

问题

总体评价

Uh oh!

PaddlePaddle-bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

huicongyao commented Apr 23, 2026 •

edited

Loading

codecov-commenter commented Apr 23, 2026 •

edited

Loading