[Feature] add traceback to error logs and optimize trace log#7608
[Feature] add traceback to error logs and optimize trace log#7608xyxinyang wants to merge 1 commit intoPaddlePaddle:developfrom
Conversation
|
Thanks for your contribution! |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## develop #7608 +/- ##
==========================================
Coverage ? 71.70%
==========================================
Files ? 419
Lines ? 57870
Branches ? 9077
==========================================
Hits ? 41493
Misses ? 13548
Partials ? 2829
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 AI Code Review |
2026-04-24 17:04:46
📋 Review 摘要
PR 概述:为 30+ 个文件的错误日志添加 traceback,并优化 trace.log(新增 span_id 支持和 3 个缓存事件),同时清理废弃的 FD_TRACE 环境变量。
变更范围:engine/、cache_manager/、entrypoints/、trace/、metrics/
影响面 Tag:Engine KVCache APIServer
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🟡 建议 | fastdeploy/metrics/trace.py:50 |
ImportError 是预期的可选依赖场景,添加 traceback 会产生误导性噪音 |
| 🟡 建议 | fastdeploy/cache_manager/cache_transfer_manager.py:1305 |
高频连接断开错误添加完整 traceback 可能加速磁盘占满 |
| ❓ 疑问 | fastdeploy/cache_manager/prefix_cache_manager.py:630 |
CACHE_SWAP_IN 记录时机在 issue_swap_task 之前,语义需确认 |
总体评价
整体变更方向正确,日志可观测性有显著提升;主要需关注 高频循环路径(连接断开错误)和 预期非错误路径(可选依赖缺失)中无差别添加 traceback 可能引入的日志膨胀问题,以及新增 trace 事件的时序语义。
| assert len(need_transfer_task_gpu_block_ids) == len(need_transfer_task_cpu_block_ids) | ||
| logger.info(f"request_block_ids: req_id {req_id} issue_swap_task transfer_task_id {transfer_task_id}") | ||
| # Record CACHE_SWAP_IN trace event (CPU -> GPU) | ||
| trace_print(LoggingEventName.CACHE_SWAP_IN, req_id, None) |
There was a problem hiding this comment.
❓ 疑问 CACHE_SWAP_IN 事件在 issue_swap_task 之前记录,记录的是「发起换入请求」的时间点,而非换入完成时间点。
如果 issue_swap_task 失败或换入实际未完成,trace 日志中仍会出现 CACHE_SWAP_IN 事件,可能对性能分析产生误导(例如让人误以为换入成功完成)。请确认这是期望的语义:记录「换入开始」还是「换入完成」?
Motivation
针对 FastDeploy 的日志系统进行优化,预计分 4 个 pr 完成。
Modifications
1. 错误日志添加 traceback
log_request_error和.error()调用添加traceback.format_exc()2. trace.log 优化
CACHE_HIT- Prefix Cache 命中,可解释请求 TTFT 较快的原因(复用缓存跳过部分 Prefill)CACHE_MISS- Prefix Cache 未命中,可解释请求 TTFT 较慢的原因(需完整 Prefill)CACHE_SWAP_IN- KV Cache 从 CPU 换入 GPU,可解释 RESOURCE_ALLOCATE 阶段耗时较长的原因3. 清理
FD_TRACE环境变量(envs.py)Usage or Command
Accuracy Tests
N/A
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.