[PD] Refine metrics and trace for pd#7613
Conversation
|
Thanks for your contribution! |
There was a problem hiding this comment.
Pull request overview
该 PR 主要面向 PD(Prefill/Decode 拆分)场景,补充/细化 trace 埋点与 Prometheus 指标,以便更精确观测 Prefill/Decode 各阶段耗时及队列/资源状态。
Changes:
- 在 trace 侧新增多类 PD 相关事件(Prefill 申请 Decode 资源、cache transfer 检查、Decode 端预分配/接收首 token 等),并补充事件到 stage 映射。
- 在 metrics 侧新增/调整请求队列相关 Gauge(增加 queuing 概念),并在 ResourceManagerV1 中更新对应采集逻辑。
- 在引擎/输出处理流程中增加 PD 相关 trace 打点与部分指标计数(预分配请求数、重调度次数、首 token 接收失败次数等)。
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| fastdeploy/trace/constants.py | 新增 PD trace 事件枚举并补齐 EVENT_TO_STAGE_MAP |
| fastdeploy/output/token_processor.py | Prefill cache transfer trace;完成阶段 trace/指标记录逻辑调整 |
| fastdeploy/metrics/metrics.py | 新增 queuing Gauge、PD 相关指标声明与部分指标文案调整 |
| fastdeploy/engine/sched/resource_manager_v1.py | v1 资源管理器按 running/waiting/queuing 口径更新指标 |
| fastdeploy/engine/common_engine.py | Prefill 申请 Decode 资源 trace;Decode 预分配/处理 prefilled 请求 trace 与 PD 指标计数 |
f744868 to
6c1cd61
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #7613 +/- ##
==========================================
Coverage ? 71.57%
==========================================
Files ? 396
Lines ? 55448
Branches ? 8675
==========================================
Hits ? 39688
Misses ? 13022
Partials ? 2738
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览有 1 个 required 任务失败,2 个 required 任务运行中,5 个 required 任务等待中,CI 尚未完成。
2 任务状态汇总2.1 Required任务 : 2/10 通过
2.2 可选任务 — 26/31 通过
3 失败详情(仅 required)Approval — PR流程(置信度: 高)Approval
根因详情: 关键日志: 修复建议:
修复建议摘要: 请 xyxinyang 或 zyyzghb 对 PR 进行 Approve 链接: 查看日志 |
6c1cd61 to
6208342
Compare
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-04-28 22:01:43
📋 Review 摘要
PR 概述:为 PD 分离场景新增 trace 事件打点和 metrics 指标采集
变更范围:fastdeploy/trace/、fastdeploy/metrics/、fastdeploy/engine/、fastdeploy/output/、docs/
影响面 Tag:[PD Disaggregation] [Engine] [DataProcessor] [Docs]
📝 PR 规范检查
标题 [PD] 不是官方 Tag,应改为 [PD Disaggregation];PR 描述缺少 ## Motivation 内容,## Usage or Command 和 ## Accuracy Tests 段落为空,Checklist 全未勾选。
标题建议(可直接复制):
[PD Disaggregation] Refine metrics and trace for PD disaggregation
PR 描述建议(可直接复制,必须复刻 checklist §D2 模板的完整结构):
## Motivation
为 PD 分离(Prefill-Decode Disaggregation)场景补充可观测性支持,提升请求级别延迟瓶颈定位能力。
## Modifications
- **trace/constants.py**:新增 Prefill 实例专属事件(`ASK_DECODE_RESOURCE_START/END`、`CHECK_CACHE_TRANSFER_START/END`、`PREFILL_INFERENCE_END`)及 Decode 实例专属事件(`DECODE_PROCESS_PREALLOCATE_REQUEST_START/END`、`DECODE_PROCESS_PREFILLED_REQUEST_START/END`、`DECODE_INFERENCE_END`),并补充 `LOGGING_EVENT_TO_STAGE_MAP` 映射。
- **metrics/metrics.py**:新增 `num_requests_queuing`(本地调度队列请求数)、`decode_preallocated_req_num`(D 端预分配请求数 Gauge)、`reschedule_req_num`(重调度次数 Counter)、`failed_recv_first_token_req_num`(首 token 接收失败次数 Counter);调整 `num_requests_waiting` 描述。
- **engine/sched/resource_manager_v1.py**:`update_metrics` 中区分 running/waiting/queuing 三类队列,分别采集。
- **engine/common_engine.py**:在 P 端申请 D 资源、D 端处理预分配/Prefilled 请求的关键路径上添加 trace 打点,并更新 `reschedule_req_num` / `decode_preallocated_req_num` / `failed_recv_first_token_req_num` 计数。
- **output/token_processor.py**:`_record_completion_metrics` 按 `splitwise_role` 分 prefill/decode/mixed 分别打不同的 inference_end trace;`_recycle_resources` 中 P 端增加 `CHECK_CACHE_TRANSFER_START/END` 打点。
- **docs/**:更新中英文 metrics.md,新增 Trace 事件说明及 PD 请求生命周期时序图。
## Usage or Command
N/A
## Accuracy Tests
N/A
## Checklist
- [ ] Add at least a tag in the PR title.
- Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
- You can add new tags based on the PR content, but the semantics must be clear.
- [x] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🔴 Bug | fastdeploy/trace/constants.py:50 |
枚举值拼写错误 DECODE_PROCESS_PREALLOCAT_REQUEST_END 漏字母 E,与文档不一致 |
| 🔴 Bug | fastdeploy/output/token_processor.py:1099 |
删除了 num_requests_running.dec(1) 但无等效替代,导致 gauge 持续累积 |
总体评价
PD 分离场景的可观测性补充方向正确,trace 事件覆盖了关键 P/D 协作路径。但存在枚举拼写错误和 metrics gauge 泄漏两个明确 Bug,需修复后合入。
| CHECK_CACHE_TRANSFER_START = "CHECK_CACHE_TRANSFER_START" | ||
| CHECK_CACHE_TRANSFER_END = "CHECK_CACHE_TRANSFER_END" | ||
| PREFILL_INFERENCE_END = "PREFILL_INFERENCE_END" | ||
|
|
There was a problem hiding this comment.
🔴 Bug 枚举值拼写错误:DECODE_PROCESS_PREALLOCAT_REQUEST_END 漏掉了字母 E,正确应为 DECODE_PROCESS_PREALLOCATE_REQUEST_END。
文档(docs/online_serving/metrics.md)中记录的是正确拼写,但代码实现与文档不一致,会导致 stage mapping 中的事件名称无法与文档对应。
建议修复:
DECODE_PROCESS_PREALLOCATE_REQUEST_END = "DECODE_PROCESS_PREALLOCATE_REQUEST_END"同时需要同步修复 LOGGING_EVENT_TO_STAGE_MAP 中的引用及 common_engine.py 中的 LoggingEventName.DECODE_PROCESS_PREALLOCAT_REQUEST_END 调用。
| if role in ("mixed", "decode"): | ||
| if metrics.engine_recv_first_token_time: | ||
| decode_time = current_time - metrics.engine_recv_first_token_time | ||
| main_process_metrics.request_decode_time.observe(decode_time) |
There was a problem hiding this comment.
🔴 Bug 删除了 main_process_metrics.num_requests_running.dec(1),但没有在其他地方补充等效调用。
_record_completion_metrics 原本在请求完成时负责将 num_requests_running 减 1。现在删除后,该 gauge 只有增加没有减少,将导致 num_requests_running 指标持续累积,监控数据严重失真。
如果是因为在 prefill 角色下请求完成时不该减(由 decode 端负责),需要在 prefill 路径上显式跳过,在 mixed/decode 路径上保留该 dec 调用,例如:
if role in ("mixed", "decode"):
main_process_metrics.num_requests_running.dec(1)| main_process_metrics.request_success_total.inc() | ||
| main_process_metrics.request_inference_time.observe(current_time - metrics.inference_start_time) | ||
| main_process_metrics.request_generation_tokens.observe(self.tokens_counter[task.request_id]) |
There was a problem hiding this comment.
现在 prefill 结束也会调用 _record_completion_metrics,并无条件 request_success_total.inc()、request_inference_time.observe(current_time - metrics.inference_start_time)、request_generation_tokens.observe(...)。在 PD 分离场景下,这会让 P 实例也统计“请求成功/推理耗时/生成 token 数”,与指标定义(last token/成功处理请求)不一致,且 Prometheus 汇总多个实例时可能出现双计数。建议把这些“请求级完成”指标限定在 role in ("mixed", "decode"),prefill 侧如果需要可新增独立的 prefill 指标。
| main_process_metrics.request_success_total.inc() | |
| main_process_metrics.request_inference_time.observe(current_time - metrics.inference_start_time) | |
| main_process_metrics.request_generation_tokens.observe(self.tokens_counter[task.request_id]) | |
| if role in ("mixed", "decode"): | |
| main_process_metrics.request_success_total.inc() | |
| main_process_metrics.request_inference_time.observe(current_time - metrics.inference_start_time) | |
| main_process_metrics.request_generation_tokens.observe(self.tokens_counter[task.request_id]) |
|
❌ Cherry-pick failed: Conflicts detected when cherry-picking to |
Motivation
Modifications
在 trace 侧新增多类 PD 相关事件(Prefill 申请 Decode 资源、cache transfer 检查、Decode 端预分配/接收首 token 等),并补充事件到 stage 映射。
在 metrics 侧新增/调整请求队列相关 Gauge(增加 queuing 概念),并在 ResourceManagerV1 中更新对应采集逻辑。
在引擎/输出处理流程中增加 PD 相关 trace 打点与部分指标计数(预分配请求数、重调度次数、首 token 接收失败次数等)。
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.