[PD] Refine metrics and trace for pd#7613
[PD] Refine metrics and trace for pd#7613juncaipeng wants to merge 1 commit intoPaddlePaddle:developfrom
Conversation
|
Thanks for your contribution! |
There was a problem hiding this comment.
Pull request overview
该 PR 主要面向 PD(Prefill/Decode 拆分)场景,补充/细化 trace 埋点与 Prometheus 指标,以便更精确观测 Prefill/Decode 各阶段耗时及队列/资源状态。
Changes:
- 在 trace 侧新增多类 PD 相关事件(Prefill 申请 Decode 资源、cache transfer 检查、Decode 端预分配/接收首 token 等),并补充事件到 stage 映射。
- 在 metrics 侧新增/调整请求队列相关 Gauge(增加 queuing 概念),并在 ResourceManagerV1 中更新对应采集逻辑。
- 在引擎/输出处理流程中增加 PD 相关 trace 打点与部分指标计数(预分配请求数、重调度次数、首 token 接收失败次数等)。
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| fastdeploy/trace/constants.py | 新增 PD trace 事件枚举并补齐 EVENT_TO_STAGE_MAP |
| fastdeploy/output/token_processor.py | Prefill cache transfer trace;完成阶段 trace/指标记录逻辑调整 |
| fastdeploy/metrics/metrics.py | 新增 queuing Gauge、PD 相关指标声明与部分指标文案调整 |
| fastdeploy/engine/sched/resource_manager_v1.py | v1 资源管理器按 running/waiting/queuing 口径更新指标 |
| fastdeploy/engine/common_engine.py | Prefill 申请 Decode 资源 trace;Decode 预分配/处理 prefilled 请求 trace 与 PD 指标计数 |
f744868 to
6c1cd61
Compare
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 AI Code Review |
2026-04-24 17:53:50
📋 Review 摘要
PR 概述:为 PD(Prefill-Decode)分离模式新增 trace 打点事件和 metrics 指标,包括请求队列 Gauge、预分配计数、重调度/首 token 接收失败计数器等。
变更范围:engine/common_engine.py、engine/sched/resource_manager_v1.py、metrics/metrics.py、output/token_processor.py、trace/constants.py
影响面 Tag:[Engine] [PD Disaggregation]
📝 PR 规范检查
PR 标题 [PD] 不在官方 Tag 列表中,应使用 [PD Disaggregation]。
标题建议(可直接复制):
[PD Disaggregation] Refine metrics and trace for pd
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🔴 Bug | trace/constants.py:50 |
枚举名 DECODE_PROCESS_PREALLOCAT_REQUEST_END 拼写错误,缺少字母 E,导致 START/END 无法配对 |
| 🔴 Bug | trace/constants.py:100 |
映射表中引用了拼写错误的枚举名,需同步修正 |
| 🔴 Bug | engine/common_engine.py:2191 |
decode_preallocated_req_num.dec() 无条件执行,但 inc() 仅在 V1 分支执行,计数不对称,非 V1 路径下 Gauge 持续为负 |
| 🟡 建议 | engine/common_engine.py:2158 |
引用了拼写错误的枚举名,需与 constants.py 修正后保持一致 |
| ❓ 疑问 | output/token_processor.py:1096 |
_record_completion_metrics 现对 prefill 角色也调用,request_success_total 是否会双倍计数? |
总体评价
新增的 trace 打点和 metrics 指标整体思路清晰,PD 分离场景的可观测性有明显提升。但存在一处枚举命名拼写错误(影响 trace 配对分析)和一处 Gauge 计数不对称的逻辑 Bug,建议修复后合入。
| CHECK_CACHE_TRANSFER_START = "CHECK_CACHE_TRANSFER_START" | ||
| CHECK_CACHE_TRANSFER_END = "CHECK_CACHE_TRANSFER_END" | ||
| PREFILL_INFERENCE_END = "PREFILL_INFERENCE_END" | ||
|
|
There was a problem hiding this comment.
🔴 Bug 枚举命名拼写错误:DECODE_PROCESS_PREALLOCAT_REQUEST_END 缺少字母 E,应为 DECODE_PROCESS_PREALLOCATE_REQUEST_END。
与对应的 DECODE_PROCESS_PREALLOCATE_REQUEST_START(第49行)命名不一致,会导致依赖 START/END 配对做耗时分析的 trace 工具无法正确匹配。
建议修复:
DECODE_PROCESS_PREALLOCATE_REQUEST_END = "DECODE_PROCESS_PREALLOCATE_REQUEST_END"同时需同步修改 common_engine.py 中的引用:
trace_print(
LoggingEventName.DECODE_PROCESS_PREALLOCATE_REQUEST_END, # 修正拼写
...
)| LoggingEventName.DECODE_PROCESS_PREALLOCATE_REQUEST_START: StageName.DECODE, | ||
| LoggingEventName.DECODE_PROCESS_PREALLOCAT_REQUEST_END: StageName.DECODE, | ||
| LoggingEventName.DECODE_PROCESS_PREFILLED_REQUEST_START: StageName.DECODE, | ||
| LoggingEventName.DECODE_PROCESS_PREFILLED_REQUEST_END: StageName.DECODE, |
There was a problem hiding this comment.
🔴 Bug 映射表中引用了拼写错误的枚举名:LoggingEventName.DECODE_PROCESS_PREALLOCAT_REQUEST_END,应与修正后的枚举名保持一致。
| break | ||
|
|
||
| for idx in sorted(processed_indices, reverse=True): | ||
| trace_print( |
There was a problem hiding this comment.
🟡 建议 此处引用了拼写错误的 LoggingEventName.DECODE_PROCESS_PREALLOCAT_REQUEST_END(少了字母 E),需与 constants.py 中修正后保持一致。
| else: | ||
| for req_output in ready_request_outputs: | ||
| request_id = req_output.request_id | ||
| main_process_metrics.decode_preallocated_req_num.dec() |
There was a problem hiding this comment.
🔴 Bug decode_preallocated_req_num.dec() 与 inc() 逻辑不对称:
inc()仅在ENABLE_V1_KVCACHE_SCHEDULER分支中执行(_process_allocate_resource_requests第2135行)dec()在此处无条件执行(包括非 V1 分支路径)
当 ENABLE_V1_KVCACHE_SCHEDULER=False 时,inc() 从未被调用,而 dec() 仍会执行,导致计数持续为负值,Gauge 语义失真。
建议:同样在 V1 分支条件内执行 dec(),或确认非 V1 分支的 preallocated 语义。
| main_process_metrics.request_decode_time.observe(decode_time) | ||
| trace_print(LoggingEventName.INFERENCE_END, task.request_id, getattr(task, "user", "")) | ||
|
|
||
| if role in ("mixed", "decode"): |
There was a problem hiding this comment.
❓ 疑问 _record_completion_metrics 原来仅在非 prefill(if not is_prefill)时调用,现在对 prefill 角色也会调用。
其中包含 main_process_metrics.request_success_total.inc() 和 main_process_metrics.request_inference_time.observe(...) 等统计,在 PD 分离模式下,Prefill 实例的任务完成是否应被计入 request_success_total? 还是只有 Decode 实例完成才代表整个请求成功?
如果 Prefill 完成不等于请求完成,此处会导致 request_success_total 双倍计数。
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #7613 +/- ##
==========================================
Coverage ? 71.68%
==========================================
Files ? 419
Lines ? 57889
Branches ? 9078
==========================================
Hits ? 41497
Misses ? 13566
Partials ? 2826
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Motivation
Modifications
在 trace 侧新增多类 PD 相关事件(Prefill 申请 Decode 资源、cache transfer 检查、Decode 端预分配/接收首 token 等),并补充事件到 stage 映射。
在 metrics 侧新增/调整请求队列相关 Gauge(增加 queuing 概念),并在 ResourceManagerV1 中更新对应采集逻辑。
在引擎/输出处理流程中增加 PD 相关 trace 打点与部分指标计数(预分配请求数、重调度次数、首 token 接收失败次数等)。
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.