[PD] Refine metrics and trace for pd by juncaipeng · Pull Request #7613 · PaddlePaddle/FastDeploy

juncaipeng · 2026-04-24T09:33:26Z

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick，PR标题需遵循格式，在最开始加上[Cherry-Pick]标签，以及最后面加上原PR ID，例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

在 trace 侧新增多类 PD 相关事件（Prefill 申请 Decode 资源、cache transfer 检查、Decode 端预分配/接收首 token 等），并补充事件到 stage 映射。
在 metrics 侧新增/调整请求队列相关 Gauge（增加 queuing 概念），并在 ResourceManagerV1 中更新对应采集逻辑。
在引擎/输出处理流程中增加 PD 相关 trace 打点与部分指标计数（预分配请求数、重调度次数、首 token 接收失败次数等）。

Usage or Command

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2026-04-24T09:33:33Z

Thanks for your contribution!

Copilot

Pull request overview

该 PR 主要面向 PD（Prefill/Decode 拆分）场景，补充/细化 trace 埋点与 Prometheus 指标，以便更精确观测 Prefill/Decode 各阶段耗时及队列/资源状态。

Changes:

在 trace 侧新增多类 PD 相关事件（Prefill 申请 Decode 资源、cache transfer 检查、Decode 端预分配/接收首 token 等），并补充事件到 stage 映射。
在 metrics 侧新增/调整请求队列相关 Gauge（增加 queuing 概念），并在 ResourceManagerV1 中更新对应采集逻辑。
在引擎/输出处理流程中增加 PD 相关 trace 打点与部分指标计数（预分配请求数、重调度次数、首 token 接收失败次数等）。

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
fastdeploy/trace/constants.py	新增 PD trace 事件枚举并补齐 EVENT_TO_STAGE_MAP
fastdeploy/output/token_processor.py	Prefill cache transfer trace；完成阶段 trace/指标记录逻辑调整
fastdeploy/metrics/metrics.py	新增 queuing Gauge、PD 相关指标声明与部分指标文案调整
fastdeploy/engine/sched/resource_manager_v1.py	v1 资源管理器按 running/waiting/queuing 口径更新指标
fastdeploy/engine/common_engine.py	Prefill 申请 Decode 资源 trace；Decode 预分配/处理 prefilled 请求 trace 与 PD 指标计数

codecov-commenter · 2026-04-24T11:35:03Z

Codecov Report

❌ Patch coverage is 91.30435% with 4 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@ee81b57). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
fastdeploy/engine/common_engine.py	76.92%	3 Missing ⚠️
fastdeploy/output/token_processor.py	92.30%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #7613   +/-   ##
==========================================
  Coverage           ?   71.57%           
==========================================
  Files              ?      396           
  Lines              ?    55448           
  Branches           ?     8675           
==========================================
  Hits               ?    39688           
  Misses             ?    13022           
  Partials           ?     2738

Flag	Coverage Δ
GPU	`71.57% <91.30%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

PaddlePaddle-bot · 2026-04-28T09:25:23Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-04-28 22:18:27

CI报告基于以下代码生成（30分钟更新一次）:

PR commit: 6208342
Merge base: ee81b57 (branch: develop)
查看完整 Diff
CI 详情

1 任务总览

有 1 个 required 任务失败，2 个 required 任务运行中，5 个 required 任务等待中，CI 尚未完成。

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
41(0)	41	28	2	3	7	1

2 任务状态汇总

2.1 Required任务 : 2/10 通过

必选任务阻塞合并，失败需优先处理。

状态	任务	耗时	根因	修复建议	日志	重跑
❌	`Approval`	8s	PR流程：PR修改日志行为，需指定RD审批	请 xyxinyang 或 zyyzghb 对 PR 进行 Approve	Job	-
⏳	`xpu_4cards_case_test / run_xpu_4cards_cases`	-	运行中	-	Job	-
⏳	`xpu_8cards_case_test / run_xpu_8cards_cases`	-	运行中	-	Job	-
⏸️	`Extracted partial CE model tasks / run_ce_cases`	-	等待中	-	-	-
⏸️	`Run Base Tests / base_tests`	-	等待中	-	-	-
⏸️	`Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage`	-	等待中	-	-	-
⏸️	`Run Four Cards Tests / run_4_cards_tests`	-	等待中	-	-	-
⏸️	`Run Stable Tests / stable_tests`	-	等待中	-	-	-
✅	其余 2 个必选任务通过	-	-	-	-	-

2.2 可选任务 — 26/31 通过

可选任务不阻塞合并，失败仅供参考。

状态	任务	耗时	日志	重跑
❌	`Check PR Template`	15s	Job	-
⏳	`Trigger Jenkins for PR`	-	Job	-
⏸️	`CI_HPU`	-	-	-
⏸️	`Run iluvatar Tests / run_iluvatar_cases`	-	-	-
✅	其余 26 个可选任务通过	-	-	-

3 失败详情（仅 required）

Approval — PR流程（置信度: 高）

Approval

状态: ❌ 失败
错误类型: PR流程
置信度: 高
根因摘要: PR修改日志行为，需指定RD审批后才能合并
分析器: 通用分析(fallback)

根因详情:
PR 在 diff 中新增了 llm_logger.info( 调用，触发了 FastDeploy 的日志修改审批流程。脚本 check_approval.sh 要求修改日志行为（.info/.debug/.error/log_request）的 PR 必须获得指定 RD（xyxinyang 或 zyyzghb）的 Approve，当前状态为"1 个审批缺失"，以 exit code 6 退出。

关键日志:

Detected log modification in diff:
+            llm_logger.info(
0. You must have one FastDeploy RD (xyxinyang(zhouchong), zyyzghb(zhangyongyue)) approval for modifying logging behavior (.info/.debug/.error/log_request).
There are 1 approved errors.

修复建议:

请 xyxinyang(zhouchong) 或 zyyzghb(zhangyongyue) 对本 PR 进行 Review & Approve

修复建议摘要: 请 xyxinyang 或 zyyzghb 对 PR 进行 Approve

链接: 查看日志

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-04-28 22:01:43

📋 Review 摘要

PR 概述：为 PD 分离场景新增 trace 事件打点和 metrics 指标采集
变更范围：fastdeploy/trace/、fastdeploy/metrics/、fastdeploy/engine/、fastdeploy/output/、docs/
影响面 Tag：[PD Disaggregation] [Engine] [DataProcessor] [Docs]

📝 PR 规范检查

标题 [PD] 不是官方 Tag，应改为 [PD Disaggregation]；PR 描述缺少 ## Motivation 内容，## Usage or Command 和 ## Accuracy Tests 段落为空，Checklist 全未勾选。

标题建议（可直接复制）：

[PD Disaggregation] Refine metrics and trace for PD disaggregation

PR 描述建议（可直接复制，必须复刻 checklist §D2 模板的完整结构）：

## Motivation
为 PD 分离（Prefill-Decode Disaggregation）场景补充可观测性支持，提升请求级别延迟瓶颈定位能力。

## Modifications
- **trace/constants.py**：新增 Prefill 实例专属事件（`ASK_DECODE_RESOURCE_START/END`、`CHECK_CACHE_TRANSFER_START/END`、`PREFILL_INFERENCE_END`）及 Decode 实例专属事件（`DECODE_PROCESS_PREALLOCATE_REQUEST_START/END`、`DECODE_PROCESS_PREFILLED_REQUEST_START/END`、`DECODE_INFERENCE_END`），并补充 `LOGGING_EVENT_TO_STAGE_MAP` 映射。
- **metrics/metrics.py**：新增 `num_requests_queuing`（本地调度队列请求数）、`decode_preallocated_req_num`（D 端预分配请求数 Gauge）、`reschedule_req_num`（重调度次数 Counter）、`failed_recv_first_token_req_num`（首 token 接收失败次数 Counter）；调整 `num_requests_waiting` 描述。
- **engine/sched/resource_manager_v1.py**：`update_metrics` 中区分 running/waiting/queuing 三类队列，分别采集。
- **engine/common_engine.py**：在 P 端申请 D 资源、D 端处理预分配/Prefilled 请求的关键路径上添加 trace 打点，并更新 `reschedule_req_num` / `decode_preallocated_req_num` / `failed_recv_first_token_req_num` 计数。
- **output/token_processor.py**：`_record_completion_metrics` 按 `splitwise_role` 分 prefill/decode/mixed 分别打不同的 inference_end trace；`_recycle_resources` 中 P 端增加 `CHECK_CACHE_TRANSFER_START/END` 打点。
- **docs/**：更新中英文 metrics.md，新增 Trace 事件说明及 PD 请求生命周期时序图。

## Usage or Command
N/A

## Accuracy Tests
N/A

## Checklist

- [ ] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [x] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

问题

级别	文件	概述
🔴 Bug	`fastdeploy/trace/constants.py:50`	枚举值拼写错误 `DECODE_PROCESS_PREALLOCAT_REQUEST_END` 漏字母 E，与文档不一致
🔴 Bug	`fastdeploy/output/token_processor.py:1099`	删除了 `num_requests_running.dec(1)` 但无等效替代，导致 gauge 持续累积

总体评价

PD 分离场景的可观测性补充方向正确，trace 事件覆盖了关键 P/D 协作路径。但存在枚举拼写错误和 metrics gauge 泄漏两个明确 Bug，需修复后合入。

PaddlePaddle-bot · 2026-04-28T14:04:36Z

+    CHECK_CACHE_TRANSFER_START = "CHECK_CACHE_TRANSFER_START"
+    CHECK_CACHE_TRANSFER_END = "CHECK_CACHE_TRANSFER_END"
+    PREFILL_INFERENCE_END = "PREFILL_INFERENCE_END"
+


🔴 Bug 枚举值拼写错误：DECODE_PROCESS_PREALLOCAT_REQUEST_END 漏掉了字母 E，正确应为 DECODE_PROCESS_PREALLOCATE_REQUEST_END。

文档（docs/online_serving/metrics.md）中记录的是正确拼写，但代码实现与文档不一致，会导致 stage mapping 中的事件名称无法与文档对应。

建议修复：

DECODE_PROCESS_PREALLOCATE_REQUEST_END = "DECODE_PROCESS_PREALLOCATE_REQUEST_END"

同时需要同步修复 LOGGING_EVENT_TO_STAGE_MAP 中的引用及 common_engine.py 中的 LoggingEventName.DECODE_PROCESS_PREALLOCAT_REQUEST_END 调用。

PaddlePaddle-bot · 2026-04-28T14:04:36Z

+        if role in ("mixed", "decode"):
+            if metrics.engine_recv_first_token_time:
+                decode_time = current_time - metrics.engine_recv_first_token_time
+                main_process_metrics.request_decode_time.observe(decode_time)


🔴 Bug 删除了 main_process_metrics.num_requests_running.dec(1)，但没有在其他地方补充等效调用。

_record_completion_metrics 原本在请求完成时负责将 num_requests_running 减 1。现在删除后，该 gauge 只有增加没有减少，将导致 num_requests_running 指标持续累积，监控数据严重失真。

如果是因为在 prefill 角色下请求完成时不该减（由 decode 端负责），需要在 prefill 路径上显式跳过，在 mixed/decode 路径上保留该 dec 调用，例如：

if role in ("mixed", "decode"): main_process_metrics.num_requests_running.dec(1)

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.

Copilot · 2026-04-28T14:06:02Z

        main_process_metrics.request_success_total.inc()
        main_process_metrics.request_inference_time.observe(current_time - metrics.inference_start_time)
        main_process_metrics.request_generation_tokens.observe(self.tokens_counter[task.request_id])


现在 prefill 结束也会调用 _record_completion_metrics，并无条件 request_success_total.inc()、request_inference_time.observe(current_time - metrics.inference_start_time)、request_generation_tokens.observe(...)。在 PD 分离场景下，这会让 P 实例也统计“请求成功/推理耗时/生成 token 数”，与指标定义（last token/成功处理请求）不一致，且 Prometheus 汇总多个实例时可能出现双计数。建议把这些“请求级完成”指标限定在 role in ("mixed", "decode")，prefill 侧如果需要可新增独立的 prefill 指标。

Suggested change

main_process_metrics.request_success_total.inc()

main_process_metrics.request_inference_time.observe(current_time - metrics.inference_start_time)

main_process_metrics.request_generation_tokens.observe(self.tokens_counter[task.request_id])

if role in ("mixed", "decode"):

main_process_metrics.request_success_total.inc()

main_process_metrics.request_inference_time.observe(current_time - metrics.inference_start_time)

main_process_metrics.request_generation_tokens.observe(self.tokens_counter[task.request_id])

EmmonsCurse · 2026-04-29T01:57:57Z

❌ Cherry-pick failed: Conflicts detected when cherry-picking to release/2.6. Please resolve manually.

Copilot AI review requested due to automatic review settings April 24, 2026 09:33

juncaipeng had a problem deploying to Metax_ci April 24, 2026 09:33 — with GitHub Actions Failure

Copilot started reviewing on behalf of juncaipeng April 24, 2026 09:33 View session

Copilot AI reviewed Apr 24, 2026

View reviewed changes

Comment thread fastdeploy/engine/common_engine.py

Comment thread fastdeploy/metrics/metrics.py

Comment thread fastdeploy/metrics/metrics.py

Comment thread fastdeploy/output/token_processor.py

Comment thread fastdeploy/trace/constants.py Outdated

Jiang-Jia-Jun previously approved these changes Apr 24, 2026

View reviewed changes

juncaipeng dismissed Jiang-Jia-Jun’s stale review via 6c1cd61 April 24, 2026 09:47

juncaipeng force-pushed the pd_metrics_1 branch from f744868 to 6c1cd61 Compare April 24, 2026 09:47

juncaipeng had a problem deploying to Metax_ci April 24, 2026 09:47 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

Refine metrics and trace for pd

6208342

Copilot AI review requested due to automatic review settings April 28, 2026 13:59

juncaipeng force-pushed the pd_metrics_1 branch from 6c1cd61 to 6208342 Compare April 28, 2026 13:59

juncaipeng temporarily deployed to Metax_ci April 28, 2026 13:59 — with GitHub Actions Inactive

Copilot started reviewing on behalf of juncaipeng April 28, 2026 13:59 View session

juncaipeng added the cherry-pick: release/2.6 label Apr 28, 2026

PaddlePaddle-bot suggested changes Apr 28, 2026

View reviewed changes

Copilot AI reviewed Apr 28, 2026

View reviewed changes

Jiang-Jia-Jun approved these changes Apr 29, 2026

View reviewed changes

Jiang-Jia-Jun merged commit 45350ff into PaddlePaddle:develop Apr 29, 2026
39 of 43 checks passed

Jiang-Jia-Jun pushed a commit that referenced this pull request Apr 29, 2026

Refine metrics and trace for pd (#7613) (#7661)

32d5f5b

Conversation

juncaipeng commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot Bot commented Apr 24, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

codecov-commenter commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

PaddlePaddle-bot commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1 任务总览

2 任务状态汇总

2.1 Required任务 : 2/10 通过

2.2 可选任务 — 26/31 通过

3 失败详情（仅 required）

Approval

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

📝 PR 规范检查

问题

总体评价

Uh oh!

PaddlePaddle-bot Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

EmmonsCurse commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

juncaipeng commented Apr 24, 2026 •

edited

Loading

codecov-commenter commented Apr 24, 2026 •

edited

Loading

PaddlePaddle-bot commented Apr 28, 2026 •

edited

Loading