Skip to content

[Feature] add traceback to error logs and optimize trace log#7608

Open
xyxinyang wants to merge 1 commit intoPaddlePaddle:developfrom
xyxinyang:dev-log-v2
Open

[Feature] add traceback to error logs and optimize trace log#7608
xyxinyang wants to merge 1 commit intoPaddlePaddle:developfrom
xyxinyang:dev-log-v2

Conversation

@xyxinyang
Copy link
Copy Markdown
Collaborator

@xyxinyang xyxinyang commented Apr 24, 2026

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

针对 FastDeploy 的日志系统进行优化,预计分 4 个 pr 完成。

pr 内容 状态
1 新增日志相关参数、错误同时输出到终端 已合入
2 日志通道划分、request.log 级别划分和聚合 已合入
3 worker_process.log、cache_manager.log、paddle 日志收敛和简化 已合入
4 trace.log 日志的规范化和整合 当前 pr

Modifications

1. 错误日志添加 traceback

  • 为所有 log_request_error.error() 调用添加 traceback.format_exc()
  • 涉及 30+ 个文件,确保异常时能看到完整调用栈

2. trace.log 优化

  • 新增 span_id:支持与 OpenTelemetry Span 精确关联(trace_logger.py)
  • 新增 4 个缓存事件(constants.py, prefix_cache_manager.py):
    • CACHE_HIT - Prefix Cache 命中,可解释请求 TTFT 较快的原因(复用缓存跳过部分 Prefill)
    • CACHE_MISS - Prefix Cache 未命中,可解释请求 TTFT 较慢的原因(需完整 Prefill)
    • CACHE_SWAP_IN - KV Cache 从 CPU 换入 GPU,可解释 RESOURCE_ALLOCATE 阶段耗时较长的原因

3. 清理

  • 删除未使用的 FD_TRACE 环境变量(envs.py)

Usage or Command

  # trace.log 现在会记录 span_id 和缓存事件
  # 示例日志输出:
  # [request_id=xxx] [event=CACHE_HIT] [stage=SCHEDULE] [trace_id=abc...] [span_id=def...]

Accuracy Tests

N/A

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented Apr 24, 2026

Thanks for your contribution!

PaddlePaddle-bot

This comment was marked as outdated.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 24, 2026

Codecov Report

❌ Patch coverage is 55.29412% with 38 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@5e87930). Learn more about missing BASE report.

Files with missing lines Patch % Lines
...model_executor/layers/moe/routing_indices_cache.py 0.00% 6 Missing ⚠️
fastdeploy/worker/input_batch.py 20.00% 4 Missing ⚠️
fastdeploy/cache_manager/prefix_cache_manager.py 62.50% 2 Missing and 1 partial ⚠️
fastdeploy/entrypoints/openai/run_batch.py 25.00% 3 Missing ⚠️
fastdeploy/metrics/trace.py 25.00% 3 Missing ⚠️
fastdeploy/engine/common_engine.py 33.33% 2 Missing ⚠️
fastdeploy/engine/register_manager.py 0.00% 2 Missing ⚠️
fastdeploy/eplb/experts_manager.py 33.33% 2 Missing ⚠️
fastdeploy/inter_communicator/zmq_server.py 71.42% 2 Missing ⚠️
fastdeploy/router/launch.py 0.00% 2 Missing ⚠️
... and 9 more
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #7608   +/-   ##
==========================================
  Coverage           ?   71.70%           
==========================================
  Files              ?      419           
  Lines              ?    57870           
  Branches           ?     9077           
==========================================
  Hits               ?    41493           
  Misses             ?    13548           
  Partials           ?     2829           
Flag Coverage Δ
GPU 71.70% <55.29%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Review | 2026-04-24 17:04:46

📋 Review 摘要

PR 概述:为 30+ 个文件的错误日志添加 traceback,并优化 trace.log(新增 span_id 支持和 3 个缓存事件),同时清理废弃的 FD_TRACE 环境变量。
变更范围engine/cache_manager/entrypoints/trace/metrics/
影响面 TagEngine KVCache APIServer


问题

级别 文件 概述
🟡 建议 fastdeploy/metrics/trace.py:50 ImportError 是预期的可选依赖场景,添加 traceback 会产生误导性噪音
🟡 建议 fastdeploy/cache_manager/cache_transfer_manager.py:1305 高频连接断开错误添加完整 traceback 可能加速磁盘占满
❓ 疑问 fastdeploy/cache_manager/prefix_cache_manager.py:630 CACHE_SWAP_IN 记录时机在 issue_swap_task 之前,语义需确认

总体评价

整体变更方向正确,日志可观测性有显著提升;主要需关注 高频循环路径(连接断开错误)和 预期非错误路径(可选依赖缺失)中无差别添加 traceback 可能引入的日志膨胀问题,以及新增 trace 事件的时序语义。

Comment thread fastdeploy/metrics/trace.py
Comment thread fastdeploy/cache_manager/cache_transfer_manager.py
assert len(need_transfer_task_gpu_block_ids) == len(need_transfer_task_cpu_block_ids)
logger.info(f"request_block_ids: req_id {req_id} issue_swap_task transfer_task_id {transfer_task_id}")
# Record CACHE_SWAP_IN trace event (CPU -> GPU)
trace_print(LoggingEventName.CACHE_SWAP_IN, req_id, None)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ 疑问 CACHE_SWAP_IN 事件在 issue_swap_task 之前记录,记录的是「发起换入请求」的时间点,而非换入完成时间点。

如果 issue_swap_task 失败或换入实际未完成,trace 日志中仍会出现 CACHE_SWAP_IN 事件,可能对性能分析产生误导(例如让人误以为换入成功完成)。请确认这是期望的语义:记录「换入开始」还是「换入完成」?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants