Skip to content

[PD] Refine metrics and trace for pd#7613

Open
juncaipeng wants to merge 1 commit intoPaddlePaddle:developfrom
juncaipeng:pd_metrics_1
Open

[PD] Refine metrics and trace for pd#7613
juncaipeng wants to merge 1 commit intoPaddlePaddle:developfrom
juncaipeng:pd_metrics_1

Conversation

@juncaipeng
Copy link
Copy Markdown
Collaborator

@juncaipeng juncaipeng commented Apr 24, 2026

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

在 trace 侧新增多类 PD 相关事件(Prefill 申请 Decode 资源、cache transfer 检查、Decode 端预分配/接收首 token 等),并补充事件到 stage 映射。
在 metrics 侧新增/调整请求队列相关 Gauge(增加 queuing 概念),并在 ResourceManagerV1 中更新对应采集逻辑。
在引擎/输出处理流程中增加 PD 相关 trace 打点与部分指标计数(预分配请求数、重调度次数、首 token 接收失败次数等)。

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

Copilot AI review requested due to automatic review settings April 24, 2026 09:33
@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented Apr 24, 2026

Thanks for your contribution!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

该 PR 主要面向 PD(Prefill/Decode 拆分)场景,补充/细化 trace 埋点与 Prometheus 指标,以便更精确观测 Prefill/Decode 各阶段耗时及队列/资源状态。

Changes:

  • 在 trace 侧新增多类 PD 相关事件(Prefill 申请 Decode 资源、cache transfer 检查、Decode 端预分配/接收首 token 等),并补充事件到 stage 映射。
  • 在 metrics 侧新增/调整请求队列相关 Gauge(增加 queuing 概念),并在 ResourceManagerV1 中更新对应采集逻辑。
  • 在引擎/输出处理流程中增加 PD 相关 trace 打点与部分指标计数(预分配请求数、重调度次数、首 token 接收失败次数等)。

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
fastdeploy/trace/constants.py 新增 PD trace 事件枚举并补齐 EVENT_TO_STAGE_MAP
fastdeploy/output/token_processor.py Prefill cache transfer trace;完成阶段 trace/指标记录逻辑调整
fastdeploy/metrics/metrics.py 新增 queuing Gauge、PD 相关指标声明与部分指标文案调整
fastdeploy/engine/sched/resource_manager_v1.py v1 资源管理器按 running/waiting/queuing 口径更新指标
fastdeploy/engine/common_engine.py Prefill 申请 Decode 资源 trace;Decode 预分配/处理 prefilled 请求 trace 与 PD 指标计数

Comment thread fastdeploy/engine/common_engine.py
Comment thread fastdeploy/metrics/metrics.py
Comment thread fastdeploy/metrics/metrics.py
Comment thread fastdeploy/output/token_processor.py
Comment thread fastdeploy/trace/constants.py Outdated
Jiang-Jia-Jun
Jiang-Jia-Jun previously approved these changes Apr 24, 2026
PaddlePaddle-bot

This comment was marked as outdated.

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Review | 2026-04-24 17:53:50

📋 Review 摘要

PR 概述:为 PD(Prefill-Decode)分离模式新增 trace 打点事件和 metrics 指标,包括请求队列 Gauge、预分配计数、重调度/首 token 接收失败计数器等。
变更范围engine/common_engine.pyengine/sched/resource_manager_v1.pymetrics/metrics.pyoutput/token_processor.pytrace/constants.py
影响面 Tag[Engine] [PD Disaggregation]


📝 PR 规范检查

PR 标题 [PD] 不在官方 Tag 列表中,应使用 [PD Disaggregation]

标题建议(可直接复制):

  • [PD Disaggregation] Refine metrics and trace for pd

问题

级别 文件 概述
🔴 Bug trace/constants.py:50 枚举名 DECODE_PROCESS_PREALLOCAT_REQUEST_END 拼写错误,缺少字母 E,导致 START/END 无法配对
🔴 Bug trace/constants.py:100 映射表中引用了拼写错误的枚举名,需同步修正
🔴 Bug engine/common_engine.py:2191 decode_preallocated_req_num.dec() 无条件执行,但 inc() 仅在 V1 分支执行,计数不对称,非 V1 路径下 Gauge 持续为负
🟡 建议 engine/common_engine.py:2158 引用了拼写错误的枚举名,需与 constants.py 修正后保持一致
❓ 疑问 output/token_processor.py:1096 _record_completion_metrics 现对 prefill 角色也调用,request_success_total 是否会双倍计数?

总体评价

新增的 trace 打点和 metrics 指标整体思路清晰,PD 分离场景的可观测性有明显提升。但存在一处枚举命名拼写错误(影响 trace 配对分析)和一处 Gauge 计数不对称的逻辑 Bug,建议修复后合入。

CHECK_CACHE_TRANSFER_START = "CHECK_CACHE_TRANSFER_START"
CHECK_CACHE_TRANSFER_END = "CHECK_CACHE_TRANSFER_END"
PREFILL_INFERENCE_END = "PREFILL_INFERENCE_END"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug 枚举命名拼写错误:DECODE_PROCESS_PREALLOCAT_REQUEST_END 缺少字母 E,应为 DECODE_PROCESS_PREALLOCATE_REQUEST_END

与对应的 DECODE_PROCESS_PREALLOCATE_REQUEST_START(第49行)命名不一致,会导致依赖 START/END 配对做耗时分析的 trace 工具无法正确匹配。

建议修复:

DECODE_PROCESS_PREALLOCATE_REQUEST_END = "DECODE_PROCESS_PREALLOCATE_REQUEST_END"

同时需同步修改 common_engine.py 中的引用:

trace_print(
    LoggingEventName.DECODE_PROCESS_PREALLOCATE_REQUEST_END,  # 修正拼写
    ...
)

LoggingEventName.DECODE_PROCESS_PREALLOCATE_REQUEST_START: StageName.DECODE,
LoggingEventName.DECODE_PROCESS_PREALLOCAT_REQUEST_END: StageName.DECODE,
LoggingEventName.DECODE_PROCESS_PREFILLED_REQUEST_START: StageName.DECODE,
LoggingEventName.DECODE_PROCESS_PREFILLED_REQUEST_END: StageName.DECODE,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug 映射表中引用了拼写错误的枚举名:LoggingEventName.DECODE_PROCESS_PREALLOCAT_REQUEST_END,应与修正后的枚举名保持一致。

break

for idx in sorted(processed_indices, reverse=True):
trace_print(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 此处引用了拼写错误的 LoggingEventName.DECODE_PROCESS_PREALLOCAT_REQUEST_END(少了字母 E),需与 constants.py 中修正后保持一致。

else:
for req_output in ready_request_outputs:
request_id = req_output.request_id
main_process_metrics.decode_preallocated_req_num.dec()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug decode_preallocated_req_num.dec()inc() 逻辑不对称:

  • inc() 仅在 ENABLE_V1_KVCACHE_SCHEDULER 分支中执行(_process_allocate_resource_requests 第2135行)
  • dec() 在此处无条件执行(包括非 V1 分支路径)

ENABLE_V1_KVCACHE_SCHEDULER=False 时,inc() 从未被调用,而 dec() 仍会执行,导致计数持续为负值,Gauge 语义失真。

建议:同样在 V1 分支条件内执行 dec(),或确认非 V1 分支的 preallocated 语义。

main_process_metrics.request_decode_time.observe(decode_time)
trace_print(LoggingEventName.INFERENCE_END, task.request_id, getattr(task, "user", ""))

if role in ("mixed", "decode"):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ 疑问 _record_completion_metrics 原来仅在非 prefill(if not is_prefill)时调用,现在对 prefill 角色也会调用。

其中包含 main_process_metrics.request_success_total.inc()main_process_metrics.request_inference_time.observe(...) 等统计,在 PD 分离模式下,Prefill 实例的任务完成是否应被计入 request_success_total 还是只有 Decode 实例完成才代表整个请求成功?

如果 Prefill 完成不等于请求完成,此处会导致 request_success_total 双倍计数。

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 91.66667% with 4 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@ee81b57). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/engine/common_engine.py 69.23% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #7613   +/-   ##
==========================================
  Coverage           ?   71.68%           
==========================================
  Files              ?      419           
  Lines              ?    57889           
  Branches           ?     9078           
==========================================
  Hits               ?    41497           
  Misses             ?    13566           
  Partials           ?     2826           
Flag Coverage Δ
GPU 71.68% <91.66%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants