[PD] Refine metrics and trace for pd by juncaipeng · Pull Request #7613 · PaddlePaddle/FastDeploy

juncaipeng · 2026-04-24T09:33:26Z

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick，PR标题需遵循格式，在最开始加上[Cherry-Pick]标签，以及最后面加上原PR ID，例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

在 trace 侧新增多类 PD 相关事件（Prefill 申请 Decode 资源、cache transfer 检查、Decode 端预分配/接收首 token 等），并补充事件到 stage 映射。
在 metrics 侧新增/调整请求队列相关 Gauge（增加 queuing 概念），并在 ResourceManagerV1 中更新对应采集逻辑。
在引擎/输出处理流程中增加 PD 相关 trace 打点与部分指标计数（预分配请求数、重调度次数、首 token 接收失败次数等）。

Usage or Command

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2026-04-24T09:33:33Z

Thanks for your contribution!

Copilot

Pull request overview

该 PR 主要面向 PD（Prefill/Decode 拆分）场景，补充/细化 trace 埋点与 Prometheus 指标，以便更精确观测 Prefill/Decode 各阶段耗时及队列/资源状态。

Changes:

在 trace 侧新增多类 PD 相关事件（Prefill 申请 Decode 资源、cache transfer 检查、Decode 端预分配/接收首 token 等），并补充事件到 stage 映射。
在 metrics 侧新增/调整请求队列相关 Gauge（增加 queuing 概念），并在 ResourceManagerV1 中更新对应采集逻辑。
在引擎/输出处理流程中增加 PD 相关 trace 打点与部分指标计数（预分配请求数、重调度次数、首 token 接收失败次数等）。

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
fastdeploy/trace/constants.py	新增 PD trace 事件枚举并补齐 EVENT_TO_STAGE_MAP
fastdeploy/output/token_processor.py	Prefill cache transfer trace；完成阶段 trace/指标记录逻辑调整
fastdeploy/metrics/metrics.py	新增 queuing Gauge、PD 相关指标声明与部分指标文案调整
fastdeploy/engine/sched/resource_manager_v1.py	v1 资源管理器按 running/waiting/queuing 口径更新指标
fastdeploy/engine/common_engine.py	Prefill 申请 Decode 资源 trace；Decode 预分配/处理 prefilled 请求 trace 与 PD 指标计数

PaddlePaddle-bot

🤖 AI Code Review | 2026-04-24 17:53:50

📋 Review 摘要

PR 概述：为 PD（Prefill-Decode）分离模式新增 trace 打点事件和 metrics 指标，包括请求队列 Gauge、预分配计数、重调度/首 token 接收失败计数器等。
变更范围：engine/common_engine.py、engine/sched/resource_manager_v1.py、metrics/metrics.py、output/token_processor.py、trace/constants.py
影响面 Tag：[Engine] [PD Disaggregation]

📝 PR 规范检查

PR 标题 [PD] 不在官方 Tag 列表中，应使用 [PD Disaggregation]。

标题建议（可直接复制）：

[PD Disaggregation] Refine metrics and trace for pd

问题

级别	文件	概述
🔴 Bug	`trace/constants.py:50`	枚举名 `DECODE_PROCESS_PREALLOCAT_REQUEST_END` 拼写错误，缺少字母 `E`，导致 START/END 无法配对
🔴 Bug	`trace/constants.py:100`	映射表中引用了拼写错误的枚举名，需同步修正
🔴 Bug	`engine/common_engine.py:2191`	`decode_preallocated_req_num.dec()` 无条件执行，但 `inc()` 仅在 V1 分支执行，计数不对称，非 V1 路径下 Gauge 持续为负
🟡 建议	`engine/common_engine.py:2158`	引用了拼写错误的枚举名，需与 `constants.py` 修正后保持一致
❓ 疑问	`output/token_processor.py:1096`	`_record_completion_metrics` 现对 prefill 角色也调用，`request_success_total` 是否会双倍计数？

总体评价

新增的 trace 打点和 metrics 指标整体思路清晰，PD 分离场景的可观测性有明显提升。但存在一处枚举命名拼写错误（影响 trace 配对分析）和一处 Gauge 计数不对称的逻辑 Bug，建议修复后合入。

PaddlePaddle-bot · 2026-04-24T09:56:18Z

+    CHECK_CACHE_TRANSFER_START = "CHECK_CACHE_TRANSFER_START"
+    CHECK_CACHE_TRANSFER_END = "CHECK_CACHE_TRANSFER_END"
+    PREFILL_INFERENCE_END = "PREFILL_INFERENCE_END"
+


🔴 Bug 枚举命名拼写错误：DECODE_PROCESS_PREALLOCAT_REQUEST_END 缺少字母 E，应为 DECODE_PROCESS_PREALLOCATE_REQUEST_END。

与对应的 DECODE_PROCESS_PREALLOCATE_REQUEST_START（第49行）命名不一致，会导致依赖 START/END 配对做耗时分析的 trace 工具无法正确匹配。

建议修复：

DECODE_PROCESS_PREALLOCATE_REQUEST_END = "DECODE_PROCESS_PREALLOCATE_REQUEST_END"

同时需同步修改 common_engine.py 中的引用：

trace_print( LoggingEventName.DECODE_PROCESS_PREALLOCATE_REQUEST_END, # 修正拼写 ... )

PaddlePaddle-bot · 2026-04-24T09:56:18Z

+    LoggingEventName.DECODE_PROCESS_PREALLOCATE_REQUEST_START: StageName.DECODE,
+    LoggingEventName.DECODE_PROCESS_PREALLOCAT_REQUEST_END: StageName.DECODE,
+    LoggingEventName.DECODE_PROCESS_PREFILLED_REQUEST_START: StageName.DECODE,
+    LoggingEventName.DECODE_PROCESS_PREFILLED_REQUEST_END: StageName.DECODE,


🔴 Bug 映射表中引用了拼写错误的枚举名：LoggingEventName.DECODE_PROCESS_PREALLOCAT_REQUEST_END，应与修正后的枚举名保持一致。

PaddlePaddle-bot · 2026-04-24T09:56:18Z

                        break

            for idx in sorted(processed_indices, reverse=True):
+                trace_print(


🟡 建议 此处引用了拼写错误的 LoggingEventName.DECODE_PROCESS_PREALLOCAT_REQUEST_END（少了字母 E），需与 constants.py 中修正后保持一致。

PaddlePaddle-bot · 2026-04-24T09:56:18Z

            else:
                for req_output in ready_request_outputs:
                    request_id = req_output.request_id
+                    main_process_metrics.decode_preallocated_req_num.dec()


🔴 Bug decode_preallocated_req_num.dec() 与 inc() 逻辑不对称：

inc() 仅在 ENABLE_V1_KVCACHE_SCHEDULER 分支中执行（_process_allocate_resource_requests 第2135行）

dec() 在此处无条件执行（包括非 V1 分支路径）

当 ENABLE_V1_KVCACHE_SCHEDULER=False 时，inc() 从未被调用，而 dec() 仍会执行，导致计数持续为负值，Gauge 语义失真。

建议：同样在 V1 分支条件内执行 dec()，或确认非 V1 分支的 preallocated 语义。

PaddlePaddle-bot · 2026-04-24T09:56:18Z

-            main_process_metrics.request_decode_time.observe(decode_time)
-        trace_print(LoggingEventName.INFERENCE_END, task.request_id, getattr(task, "user", ""))
+
+        if role in ("mixed", "decode"):


❓ 疑问 _record_completion_metrics 原来仅在非 prefill（if not is_prefill）时调用，现在对 prefill 角色也会调用。

其中包含 main_process_metrics.request_success_total.inc() 和 main_process_metrics.request_inference_time.observe(...) 等统计，在 PD 分离模式下，Prefill 实例的任务完成是否应被计入 request_success_total？ 还是只有 Decode 实例完成才代表整个请求成功？

如果 Prefill 完成不等于请求完成，此处会导致 request_success_total 双倍计数。

codecov-commenter · 2026-04-24T11:35:03Z

Codecov Report

❌ Patch coverage is 91.66667% with 4 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@ee81b57). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
fastdeploy/engine/common_engine.py	69.23%	4 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #7613   +/-   ##
==========================================
  Coverage           ?   71.68%           
==========================================
  Files              ?      419           
  Lines              ?    57889           
  Branches           ?     9078           
==========================================
  Hits               ?    41497           
  Misses             ?    13566           
  Partials           ?     2826

Flag	Coverage Δ
GPU	`71.68% <91.66%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot AI review requested due to automatic review settings April 24, 2026 09:33

juncaipeng had a problem deploying to Metax_ci April 24, 2026 09:33 — with GitHub Actions Failure

Copilot started reviewing on behalf of juncaipeng April 24, 2026 09:33 View session

Copilot AI reviewed Apr 24, 2026

View reviewed changes

Comment thread fastdeploy/engine/common_engine.py

Comment thread fastdeploy/metrics/metrics.py

Comment thread fastdeploy/metrics/metrics.py

Comment thread fastdeploy/output/token_processor.py

Comment thread fastdeploy/trace/constants.py Outdated

Jiang-Jia-Jun previously approved these changes Apr 24, 2026

View reviewed changes

Refine metrics and trace for pd

6c1cd61

juncaipeng dismissed Jiang-Jia-Jun’s stale review via 6c1cd61 April 24, 2026 09:47

juncaipeng force-pushed the pd_metrics_1 branch from f744868 to 6c1cd61 Compare April 24, 2026 09:47

juncaipeng had a problem deploying to Metax_ci April 24, 2026 09:47 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

PaddlePaddle-bot suggested changes Apr 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PD] Refine metrics and trace for pd#7613

[PD] Refine metrics and trace for pd#7613
juncaipeng wants to merge 1 commit intoPaddlePaddle:developfrom
juncaipeng:pd_metrics_1

juncaipeng commented Apr 24, 2026 •

edited

Loading

Uh oh!

paddle-bot Bot commented Apr 24, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot left a comment

Uh oh!

PaddlePaddle-bot Apr 24, 2026

Uh oh!

PaddlePaddle-bot Apr 24, 2026

Uh oh!

PaddlePaddle-bot Apr 24, 2026

Uh oh!

PaddlePaddle-bot Apr 24, 2026

Uh oh!

PaddlePaddle-bot Apr 24, 2026

Uh oh!

codecov-commenter commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

juncaipeng commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot Bot commented Apr 24, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

📝 PR 规范检查

问题

总体评价

Uh oh!

PaddlePaddle-bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Apr 24, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

juncaipeng commented Apr 24, 2026 •

edited

Loading