Skip to content

[Cherry-Pick][Speculative Decoding][BugFix] overlap compute logprobs for speculative decoding (#7406)#7585

Open
huicongyao wants to merge 2 commits intoPaddlePaddle:release/2.6from
huicongyao:release/2.6
Open

[Cherry-Pick][Speculative Decoding][BugFix] overlap compute logprobs for speculative decoding (#7406)#7585
huicongyao wants to merge 2 commits intoPaddlePaddle:release/2.6from
huicongyao:release/2.6

Conversation

@huicongyao
Copy link
Copy Markdown
Contributor

@huicongyao huicongyao commented Apr 23, 2026

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented Apr 23, 2026

Thanks for your contribution!

PaddlePaddle-bot

This comment was marked as outdated.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 23, 2026

Codecov Report

❌ Patch coverage is 96.87500% with 1 line in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (release/2.6@c8a59a3). Learn more about missing BASE report.

Files with missing lines Patch % Lines
...astdeploy/model_executor/layers/sample/logprobs.py 66.66% 1 Missing ⚠️
Additional details and impacted files
@@              Coverage Diff               @@
##             release/2.6    #7585   +/-   ##
==============================================
  Coverage               ?   73.66%           
==============================================
  Files                  ?      376           
  Lines                  ?    53405           
  Branches               ?     8351           
==============================================
  Hits                   ?    39340           
  Misses                 ?    11296           
  Partials               ?     2769           
Flag Coverage Δ
GPU 73.66% <96.87%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

yongqiangma
yongqiangma previously approved these changes Apr 24, 2026
Copy link
Copy Markdown
Collaborator

@yongqiangma yongqiangma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

PaddlePaddle-bot

This comment was marked as outdated.

@huicongyao
Copy link
Copy Markdown
Contributor Author

/re-run run_tests_with_coverage

1 similar comment
@huicongyao
Copy link
Copy Markdown
Contributor Author

/re-run run_tests_with_coverage

…ive decoding (PaddlePaddle#7406)

* fix shape mismatch while cuda graph closed

* fix

* fix xpu typo

* overlap compute logprobs

* fix

* optimize

* fix

* opt

* fix unitest error and optimize code
@huicongyao
Copy link
Copy Markdown
Contributor Author

/re-run ci_xpu

2 similar comments
@huicongyao
Copy link
Copy Markdown
Contributor Author

/re-run ci_xpu

@huicongyao
Copy link
Copy Markdown
Contributor Author

/re-run ci_xpu

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-04-27 10:29:34

📋 Review 摘要

PR 概述:为投机解码(Speculative Decoding)的 logprobs 计算新增 overlap compute 支持,修复 logprobs 在投机解码路径下的正确性问题,同时修复多处函数拼写错误。

变更范围custom_ops/gpu_ops/speculate_decoding/fastdeploy/model_executor/layers/sample/fastdeploy/model_executor/fastdeploy/spec_decode/mtp.pyfastdeploy/worker/gpu_model_runner.pyfastdeploy/worker/xpu_model_runner.py

影响面 Tag[Speculative Decoding] [BugFix] [OP]

📝 PR 规范检查

PR 标题格式符合 Cherry-Pick 规范,Tags 标注完整。PR 描述模板中 Motivation、Modifications、Usage 等核心段落未填写实质内容,仅保留了占位提示文本,建议补全。

问题

级别 文件 概述
🟡 建议 custom_ops/gpu_ops/speculate_decoding/build_sampling_params_logprob.cu:34 CUDA kernel 中每个 block 内所有线程重复计算 start_offset,存在冗余计算
🟡 建议 fastdeploy/model_executor/layers/sample/sampler.py:905 移除 if top_p_token_mask.any(): 保护后,无 top_p 需求场景也会执行全量 softmax 计算,存在性能回退
🟡 建议 fastdeploy/model_executor/layers/sample/logprobs.py:148 build_output_logprobs 函数 docstring 返回值描述(3元素)与实际返回类型(2元素)不一致
❓ 疑问 fastdeploy/model_executor/layers/sample/sampler.py:870 real_bsz=0num_tokens=0 产生空张量,需确认 CUDA kernel 和后续逻辑的安全性

总体评价

本 PR 通过引入新的 CUDA kernel BuildSamplingParamLogProb 优化了投机解码路径下 logprobs 的并行计算,并修复了多处函数名拼写错误(specualatespeculate),整体改动方向正确。测试覆盖较为完整,包含了 golden value 验证和边界情况。建议关注 CUDA kernel 中的冗余计算、移除 top_p 条件保护后的性能影响,以及 real_bsz=0 的边界安全性。

// Compute start offset: sum of token_num_per_batch[0..bi-1]
int start_offset = 0;
for (int i = 0; i < bi; i++) {
start_offset += token_num_per_batch[i];
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 start_offset 在每个 block 中重复计算,存在冗余工作。

当前实现中,每个 block(每个 bi)内的所有 256 个线程都各自执行相同的 O(bi) 累加循环来计算 start_offset。这是完全冗余的——可以让 tid==0 的线程计算后写入共享内存,其余线程读取共享内存,或者在 C++ 层预先计算前缀和传入 kernel。

建议改为使用 shared memory 避免重复计算:

__shared__ int32_t s_start_offset;
if (tid == 0) {
  int off = 0;
  for (int i = 0; i < bi; i++) off += token_num_per_batch[i];
  s_start_offset = off;
}
__syncthreads();
int start_offset = s_start_offset;

probs = top_p_normalize_probs_paddle(probs, real_token_top_p)
top_p_logprob = paddle.log(probs)

probs = F.softmax(padded_logits, axis=-1)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 移除了 if top_p_token_mask.any(): 条件保护,导致即使所有请求均无 top_p 缩放需求,也会无条件执行 F.softmaxtop_p_normalize_probs_paddlepaddle.log 三个计算密集型操作。

原代码有 if top_p_token_mask.any(): 保护,性能更优。如果该变更是为解决某个特定 bug 而故意移除,建议在注释中说明原因。

compute_logprobs_fn: Optional[Callable] = None,
) -> Tuple[Optional[LogprobsTensors], Optional[paddle.Tensor], Optional[paddle.Tensor]]:
real_bsz: int = 0,
) -> Tuple[Optional[LogprobsTensors], Optional[paddle.Tensor]]:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 docstring 中的 Returns 描述与实际函数返回类型不一致。

函数签名已更新为 Tuple[Optional[LogprobsTensors], Optional[paddle.Tensor]](2元素),但 docstring 仍写:

Returns:
    tuple: (logprobs_tensors, cu_batch_token_offset, output_logits)

描述的是3元素元组。请同步更新 docstring。

batch_token_num = share_inputs["accept_num"][:real_bsz]

# NOTE(huicongyao): temporarily used to provide a max_sized input, remove in the future
num_tokens = real_bsz * (self.num_speculative_tokens + 1)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ 疑问real_bsz=0 时,num_tokens = 0 * (self.num_speculative_tokens + 1) = 0,后续 paddle.zeros(shape=[0, last_logits.shape[1]], ...) 将创建空张量。

请确认:

  1. build_sampling_params_logprob(..., num_tokens=0) 的 CUDA kernel(token_num_output_cpu=0)不会产生越界访问;
  2. F.log_softmax(padded_logits=空张量) 的返回结果在后续 [:real_token_num] 切片逻辑中是否正确处理。

如果 real_bsz=0 的情况已在上游通过 if token_num_cpu > 0: 保护(如 mtp.py),建议在函数头部加断言或注释说明。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants