Skip to content

[PD] Refine metrics and trace for pd#7613

Merged
Jiang-Jia-Jun merged 1 commit intoPaddlePaddle:developfrom
juncaipeng:pd_metrics_1
Apr 29, 2026
Merged

[PD] Refine metrics and trace for pd#7613
Jiang-Jia-Jun merged 1 commit intoPaddlePaddle:developfrom
juncaipeng:pd_metrics_1

Conversation

@juncaipeng
Copy link
Copy Markdown
Collaborator

@juncaipeng juncaipeng commented Apr 24, 2026

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

在 trace 侧新增多类 PD 相关事件(Prefill 申请 Decode 资源、cache transfer 检查、Decode 端预分配/接收首 token 等),并补充事件到 stage 映射。
在 metrics 侧新增/调整请求队列相关 Gauge(增加 queuing 概念),并在 ResourceManagerV1 中更新对应采集逻辑。
在引擎/输出处理流程中增加 PD 相关 trace 打点与部分指标计数(预分配请求数、重调度次数、首 token 接收失败次数等)。

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

Copilot AI review requested due to automatic review settings April 24, 2026 09:33
@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented Apr 24, 2026

Thanks for your contribution!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

该 PR 主要面向 PD(Prefill/Decode 拆分)场景,补充/细化 trace 埋点与 Prometheus 指标,以便更精确观测 Prefill/Decode 各阶段耗时及队列/资源状态。

Changes:

  • 在 trace 侧新增多类 PD 相关事件(Prefill 申请 Decode 资源、cache transfer 检查、Decode 端预分配/接收首 token 等),并补充事件到 stage 映射。
  • 在 metrics 侧新增/调整请求队列相关 Gauge(增加 queuing 概念),并在 ResourceManagerV1 中更新对应采集逻辑。
  • 在引擎/输出处理流程中增加 PD 相关 trace 打点与部分指标计数(预分配请求数、重调度次数、首 token 接收失败次数等)。

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
fastdeploy/trace/constants.py 新增 PD trace 事件枚举并补齐 EVENT_TO_STAGE_MAP
fastdeploy/output/token_processor.py Prefill cache transfer trace;完成阶段 trace/指标记录逻辑调整
fastdeploy/metrics/metrics.py 新增 queuing Gauge、PD 相关指标声明与部分指标文案调整
fastdeploy/engine/sched/resource_manager_v1.py v1 资源管理器按 running/waiting/queuing 口径更新指标
fastdeploy/engine/common_engine.py Prefill 申请 Decode 资源 trace;Decode 预分配/处理 prefilled 请求 trace 与 PD 指标计数

Comment thread fastdeploy/engine/common_engine.py
Comment thread fastdeploy/metrics/metrics.py
Comment thread fastdeploy/metrics/metrics.py
Comment thread fastdeploy/output/token_processor.py
Comment thread fastdeploy/trace/constants.py Outdated
Jiang-Jia-Jun
Jiang-Jia-Jun previously approved these changes Apr 24, 2026
PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 24, 2026

Codecov Report

❌ Patch coverage is 91.30435% with 4 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@ee81b57). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/engine/common_engine.py 76.92% 3 Missing ⚠️
fastdeploy/output/token_processor.py 92.30% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #7613   +/-   ##
==========================================
  Coverage           ?   71.57%           
==========================================
  Files              ?      396           
  Lines              ?    55448           
  Branches           ?     8675           
==========================================
  Hits               ?    39688           
  Misses             ?    13022           
  Partials           ?     2738           
Flag Coverage Δ
GPU 71.57% <91.30%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@PaddlePaddle-bot
Copy link
Copy Markdown

PaddlePaddle-bot commented Apr 28, 2026

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-04-28 22:18:27

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

有 1 个 required 任务失败,2 个 required 任务运行中,5 个 required 任务等待中,CI 尚未完成。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
41(0) 41 28 2 3 7 1

2 任务状态汇总

2.1 Required任务 : 2/10 通过

必选任务阻塞合并,失败需优先处理。

状态 任务 耗时 根因 修复建议 日志 重跑
Approval 8s PR流程:PR修改日志行为,需指定RD审批 请 xyxinyang 或 zyyzghb 对 PR 进行 Approve Job -
xpu_4cards_case_test / run_xpu_4cards_cases - 运行中 - Job -
xpu_8cards_case_test / run_xpu_8cards_cases - 运行中 - Job -
⏸️ Extracted partial CE model tasks / run_ce_cases - 等待中 - - -
⏸️ Run Base Tests / base_tests - 等待中 - - -
⏸️ Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage - 等待中 - - -
⏸️ Run Four Cards Tests / run_4_cards_tests - 等待中 - - -
⏸️ Run Stable Tests / stable_tests - 等待中 - - -
其余 2 个必选任务通过 - - - - -

2.2 可选任务 — 26/31 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
Check PR Template 15s Job -
Trigger Jenkins for PR - Job -
⏸️ CI_HPU - - -
⏸️ Run iluvatar Tests / run_iluvatar_cases - - -
其余 26 个可选任务通过 - - -

3 失败详情(仅 required)

Approval — PR流程(置信度: 高)

Approval

  • 状态: ❌ 失败
  • 错误类型: PR流程
  • 置信度: 高
  • 根因摘要: PR修改日志行为,需指定RD审批后才能合并
  • 分析器: 通用分析(fallback)

根因详情:
PR 在 diff 中新增了 llm_logger.info( 调用,触发了 FastDeploy 的日志修改审批流程。脚本 check_approval.sh 要求修改日志行为(.info/.debug/.error/log_request)的 PR 必须获得指定 RD(xyxinyang 或 zyyzghb)的 Approve,当前状态为"1 个审批缺失",以 exit code 6 退出。

关键日志:

Detected log modification in diff:
+            llm_logger.info(
0. You must have one FastDeploy RD (xyxinyang(zhouchong), zyyzghb(zhangyongyue)) approval for modifying logging behavior (.info/.debug/.error/log_request).
There are 1 approved errors.

修复建议:

  1. 请 xyxinyang(zhouchong) 或 zyyzghb(zhangyongyue) 对本 PR 进行 Review & Approve

修复建议摘要: 请 xyxinyang 或 zyyzghb 对 PR 进行 Approve

链接: 查看日志

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-04-28 22:01:43

📋 Review 摘要

PR 概述:为 PD 分离场景新增 trace 事件打点和 metrics 指标采集
变更范围fastdeploy/trace/fastdeploy/metrics/fastdeploy/engine/fastdeploy/output/docs/
影响面 Tag[PD Disaggregation] [Engine] [DataProcessor] [Docs]

📝 PR 规范检查

标题 [PD] 不是官方 Tag,应改为 [PD Disaggregation];PR 描述缺少 ## Motivation 内容,## Usage or Command## Accuracy Tests 段落为空,Checklist 全未勾选。

标题建议(可直接复制):

  • [PD Disaggregation] Refine metrics and trace for PD disaggregation

PR 描述建议(可直接复制,必须复刻 checklist §D2 模板的完整结构):

## Motivation
为 PD 分离(Prefill-Decode Disaggregation)场景补充可观测性支持,提升请求级别延迟瓶颈定位能力。

## Modifications
- **trace/constants.py**:新增 Prefill 实例专属事件(`ASK_DECODE_RESOURCE_START/END``CHECK_CACHE_TRANSFER_START/END``PREFILL_INFERENCE_END`)及 Decode 实例专属事件(`DECODE_PROCESS_PREALLOCATE_REQUEST_START/END``DECODE_PROCESS_PREFILLED_REQUEST_START/END``DECODE_INFERENCE_END`),并补充 `LOGGING_EVENT_TO_STAGE_MAP` 映射。
- **metrics/metrics.py**:新增 `num_requests_queuing`(本地调度队列请求数)、`decode_preallocated_req_num`(D 端预分配请求数 Gauge)、`reschedule_req_num`(重调度次数 Counter)、`failed_recv_first_token_req_num`(首 token 接收失败次数 Counter);调整 `num_requests_waiting` 描述。
- **engine/sched/resource_manager_v1.py**`update_metrics` 中区分 running/waiting/queuing 三类队列,分别采集。
- **engine/common_engine.py**:在 P 端申请 D 资源、D 端处理预分配/Prefilled 请求的关键路径上添加 trace 打点,并更新 `reschedule_req_num` / `decode_preallocated_req_num` / `failed_recv_first_token_req_num` 计数。
- **output/token_processor.py**`_record_completion_metrics``splitwise_role` 分 prefill/decode/mixed 分别打不同的 inference_end trace;`_recycle_resources` 中 P 端增加 `CHECK_CACHE_TRANSFER_START/END` 打点。
- **docs/**:更新中英文 metrics.md,新增 Trace 事件说明及 PD 请求生命周期时序图。

## Usage or Command
N/A

## Accuracy Tests
N/A

## Checklist

- [ ] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [x] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

问题

级别 文件 概述
🔴 Bug fastdeploy/trace/constants.py:50 枚举值拼写错误 DECODE_PROCESS_PREALLOCAT_REQUEST_END 漏字母 E,与文档不一致
🔴 Bug fastdeploy/output/token_processor.py:1099 删除了 num_requests_running.dec(1) 但无等效替代,导致 gauge 持续累积

总体评价

PD 分离场景的可观测性补充方向正确,trace 事件覆盖了关键 P/D 协作路径。但存在枚举拼写错误和 metrics gauge 泄漏两个明确 Bug,需修复后合入。

CHECK_CACHE_TRANSFER_START = "CHECK_CACHE_TRANSFER_START"
CHECK_CACHE_TRANSFER_END = "CHECK_CACHE_TRANSFER_END"
PREFILL_INFERENCE_END = "PREFILL_INFERENCE_END"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug 枚举值拼写错误:DECODE_PROCESS_PREALLOCAT_REQUEST_END 漏掉了字母 E,正确应为 DECODE_PROCESS_PREALLOCATE_REQUEST_END

文档(docs/online_serving/metrics.md)中记录的是正确拼写,但代码实现与文档不一致,会导致 stage mapping 中的事件名称无法与文档对应。

建议修复:

DECODE_PROCESS_PREALLOCATE_REQUEST_END = "DECODE_PROCESS_PREALLOCATE_REQUEST_END"

同时需要同步修复 LOGGING_EVENT_TO_STAGE_MAP 中的引用及 common_engine.py 中的 LoggingEventName.DECODE_PROCESS_PREALLOCAT_REQUEST_END 调用。

if role in ("mixed", "decode"):
if metrics.engine_recv_first_token_time:
decode_time = current_time - metrics.engine_recv_first_token_time
main_process_metrics.request_decode_time.observe(decode_time)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug 删除了 main_process_metrics.num_requests_running.dec(1),但没有在其他地方补充等效调用。

_record_completion_metrics 原本在请求完成时负责将 num_requests_running 减 1。现在删除后,该 gauge 只有增加没有减少,将导致 num_requests_running 指标持续累积,监控数据严重失真。

如果是因为在 prefill 角色下请求完成时不该减(由 decode 端负责),需要在 prefill 路径上显式跳过,在 mixed/decode 路径上保留该 dec 调用,例如:

if role in ("mixed", "decode"):
    main_process_metrics.num_requests_running.dec(1)

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.

Comment on lines 1108 to 1110
main_process_metrics.request_success_total.inc()
main_process_metrics.request_inference_time.observe(current_time - metrics.inference_start_time)
main_process_metrics.request_generation_tokens.observe(self.tokens_counter[task.request_id])
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

现在 prefill 结束也会调用 _record_completion_metrics,并无条件 request_success_total.inc()request_inference_time.observe(current_time - metrics.inference_start_time)request_generation_tokens.observe(...)。在 PD 分离场景下,这会让 P 实例也统计“请求成功/推理耗时/生成 token 数”,与指标定义(last token/成功处理请求)不一致,且 Prometheus 汇总多个实例时可能出现双计数。建议把这些“请求级完成”指标限定在 role in ("mixed", "decode"),prefill 侧如果需要可新增独立的 prefill 指标。

Suggested change
main_process_metrics.request_success_total.inc()
main_process_metrics.request_inference_time.observe(current_time - metrics.inference_start_time)
main_process_metrics.request_generation_tokens.observe(self.tokens_counter[task.request_id])
if role in ("mixed", "decode"):
main_process_metrics.request_success_total.inc()
main_process_metrics.request_inference_time.observe(current_time - metrics.inference_start_time)
main_process_metrics.request_generation_tokens.observe(self.tokens_counter[task.request_id])

Copilot uses AI. Check for mistakes.
Comment thread docs/online_serving/metrics.md
Comment thread docs/zh/online_serving/metrics.md
Comment thread fastdeploy/engine/common_engine.py
Comment thread fastdeploy/output/token_processor.py
@Jiang-Jia-Jun Jiang-Jia-Jun merged commit 45350ff into PaddlePaddle:develop Apr 29, 2026
39 of 43 checks passed
@EmmonsCurse
Copy link
Copy Markdown
Collaborator

❌ Cherry-pick failed: Conflicts detected when cherry-picking to release/2.6. Please resolve manually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants