Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
78 changes: 78 additions & 0 deletions docs/online_serving/metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,3 +46,81 @@ After FastDeploy is launched, it supports continuous monitoring of the FastDeplo

- Access URL: `http://localhost:8000/metrics`
- Metric Type: Prometheus format

## Trace Events

FastDeploy outputs structured trace events to `trace.log` at key stages of request processing, useful for diagnosing per-request latency bottlenecks. Each trace log entry contains fields such as `timestamp` (milliseconds), `request_id`, `event`, and `stage`.

### Common Events (Mixed / All Instances)

| Stage | Event | Description |
| :---: | --- | --- |
| PREPROCESSING | `PREPROCESSING_START` | API Server begins preprocessing the request |
| PREPROCESSING | `PREPROCESSING_END` | Engine receives the request, preprocessing complete |
| SCHEDULE | `REQUEST_SCHEDULE_START` | Request enters the scheduling flow |
| SCHEDULE | `REQUEST_QUEUE_START` | Request enters the scheduling queue |
| SCHEDULE | `REQUEST_QUEUE_END` | Request dequeued from the scheduling queue |
| SCHEDULE | `RESOURCE_ALLOCATE_START` | Resource allocation begins for the request |
Comment thread
juncaipeng marked this conversation as resolved.
| SCHEDULE | `PREPARE_PREFIX_CACHE_START` | Prefix cache block matching begins |
| SCHEDULE | `PREPARE_PREFIX_CACHE_END` | Prefix cache block matching complete |
| SCHEDULE | `RESOURCE_ALLOCATE_END` | Resource allocation complete |
| SCHEDULE | `REQUEST_SCHEDULE_END` | Scheduling flow complete |
| PREFILL | `INFERENCE_START` | Request sent to GPU for inference |
| PREFILL | `FIRST_TOKEN_GENERATED` | First token generated |
| DECODE | `DECODE_START` | Enters Decode phase |
| DECODE | `INFERENCE_END` | Inference complete (all tokens generated) |
| DECODE | `PREEMPTED` | Request preempted |
| DECODE | `RESCHEDULED_INFERENCE_START` | Preempted request rescheduled for execution |
| POSTPROCESSING | `WRITE_CACHE_TO_STORAGE_START` | Begins writing KV Cache to external storage |
| POSTPROCESSING | `WRITE_CACHE_TO_STORAGE_END` | KV Cache written to external storage |
| POSTPROCESSING | `POSTPROCESSING_START` | Post-processing begins |
| POSTPROCESSING | `POSTPROCESSING_END` | Post-processing complete, response sent |

### PD Disaggregation — Prefill (P) Instance Events

| Stage | Event | Description |
| :---: | --- | --- |
| SCHEDULE | `ASK_DECODE_RESOURCE_START` | P begins requesting resources from D (sends ZMQ request) |
| SCHEDULE | `ASK_DECODE_RESOURCE_END` | P receives resource allocation confirmation from D (with dest_block_ids) |
| PREFILL | `PREFILL_INFERENCE_END` | P instance Prefill inference complete |
| POSTPROCESSING | `CHECK_CACHE_TRANSFER_START` | P begins waiting for KV Cache transfer to complete |
| POSTPROCESSING | `CHECK_CACHE_TRANSFER_END` | KV Cache transfer confirmed, ready to send first token to D |

### PD Disaggregation — Decode (D) Instance Events

| Stage | Event | Description |
| :---: | --- | --- |
| DECODE | `DECODE_PROCESS_PREALLOCATE_REQUEST_START` | D begins processing resource allocation request from P |
| DECODE | `DECODE_PROCESS_PREALLOCATE_REQUEST_END` | D completes resource allocation and returns dest_block_ids to P |
| DECODE | `DECODE_PROCESS_PREFILLED_REQUEST_START` | D receives first token from P, begins processing Prefilled request |
| DECODE | `DECODE_PROCESS_PREFILLED_REQUEST_END` | D adds Prefilled request to running queue |
| DECODE | `DECODE_INFERENCE_END` | D instance Decode inference complete |

### Request Lifecycle Sequence

**Mixed mode** (single instance, full inference):
```
PREPROCESSING_START → PREPROCESSING_END → REQUEST_QUEUE_START → REQUEST_QUEUE_END
→ RESOURCE_ALLOCATE_START → RESOURCE_ALLOCATE_END → INFERENCE_START
→ FIRST_TOKEN_GENERATED → DECODE_START → INFERENCE_END
→ POSTPROCESSING_START → POSTPROCESSING_END
```

**PD Disaggregation — Prefill (P) Instance**:
```
PREPROCESSING_START → PREPROCESSING_END → REQUEST_QUEUE_START → REQUEST_QUEUE_END
→ ASK_DECODE_RESOURCE_START → ASK_DECODE_RESOURCE_END
→ RESOURCE_ALLOCATE_START → RESOURCE_ALLOCATE_END
→ INFERENCE_START → PREFILL_INFERENCE_END
→ CHECK_CACHE_TRANSFER_START → CHECK_CACHE_TRANSFER_END → [send first token to D]
```

**PD Disaggregation — Decode (D) Instance**:
```
PREPROCESSING_START → PREPROCESSING_END → REQUEST_QUEUE_START → REQUEST_QUEUE_END
→ DECODE_PROCESS_PREALLOCATE_REQUEST_START → DECODE_PROCESS_PREALLOCATE_REQUEST_END
→ [wait for P to complete prefill and transfer KV Cache]
→ DECODE_PROCESS_PREFILLED_REQUEST_START → DECODE_PROCESS_PREFILLED_REQUEST_END
→ INFERENCE_START → DECODE_INFERENCE_END
→ POSTPROCESSING_START → POSTPROCESSING_END
```
78 changes: 78 additions & 0 deletions docs/zh/online_serving/metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,3 +46,81 @@

- 访问地址:`http://localhost:8000/metrics`
- 指标类型:Prometheus 格式

## Trace 事件

FastDeploy 在请求处理的关键阶段输出结构化 trace 事件到 `trace.log`,用于定位请求级别的延迟瓶颈。每条 trace 日志包含 `timestamp`(毫秒)、`request_id`、`event`、`stage` 等字段。

### 通用事件(Mixed / 所有实例)

| 阶段 | 事件 | 说明 |
| :---: | --- | --- |
| PREPROCESSING | `PREPROCESSING_START` | API Server 开始预处理请求 |
| PREPROCESSING | `PREPROCESSING_END` | Engine 收到请求,预处理完成 |
| SCHEDULE | `REQUEST_SCHEDULE_START` | 请求进入调度流程 |
| SCHEDULE | `REQUEST_QUEUE_START` | 请求进入调度队列等待 |
| SCHEDULE | `REQUEST_QUEUE_END` | 请求从调度队列取出 |
| SCHEDULE | `RESOURCE_ALLOCATE_START` | 开始为请求分配资源 |
Comment thread
juncaipeng marked this conversation as resolved.
| SCHEDULE | `PREPARE_PREFIX_CACHE_START` | 开始匹配前缀缓存块 |
| SCHEDULE | `PREPARE_PREFIX_CACHE_END` | 前缀缓存块匹配完成 |
| SCHEDULE | `RESOURCE_ALLOCATE_END` | 资源分配完成 |
| SCHEDULE | `REQUEST_SCHEDULE_END` | 调度流程结束 |
| PREFILL | `INFERENCE_START` | 请求送入 GPU 执行推理 |
| PREFILL | `FIRST_TOKEN_GENERATED` | 首 token 生成 |
| DECODE | `DECODE_START` | 进入 Decode 阶段 |
| DECODE | `INFERENCE_END` | 推理完成(所有 token 生成完毕) |
| DECODE | `PREEMPTED` | 请求被抢占 |
| DECODE | `RESCHEDULED_INFERENCE_START` | 被抢占的请求重新调度执行 |
| POSTPROCESSING | `WRITE_CACHE_TO_STORAGE_START` | 开始将 KV Cache 写入外部存储 |
| POSTPROCESSING | `WRITE_CACHE_TO_STORAGE_END` | KV Cache 写入外部存储完成 |
| POSTPROCESSING | `POSTPROCESSING_START` | 开始后处理 |
| POSTPROCESSING | `POSTPROCESSING_END` | 后处理完成,响应发送完毕 |

### PD 分离 — Prefill (P) 实例专属事件

| 阶段 | 事件 | 说明 |
| :---: | --- | --- |
| SCHEDULE | `ASK_DECODE_RESOURCE_START` | P 开始向 D 申请资源(发送 ZMQ 请求) |
| SCHEDULE | `ASK_DECODE_RESOURCE_END` | P 收到 D 的资源分配确认(含 dest_block_ids) |
| PREFILL | `PREFILL_INFERENCE_END` | P 实例 Prefill 推理完成 |
| POSTPROCESSING | `CHECK_CACHE_TRANSFER_START` | P 开始等待 KV Cache 传输完成 |
| POSTPROCESSING | `CHECK_CACHE_TRANSFER_END` | KV Cache 传输完成确认,准备发送 first token 到 D |

### PD 分离 — Decode (D) 实例专属事件

| 阶段 | 事件 | 说明 |
| :---: | --- | --- |
| DECODE | `DECODE_PROCESS_PREALLOCATE_REQUEST_START` | D 开始处理 P 发来的资源分配请求 |
| DECODE | `DECODE_PROCESS_PREALLOCATE_REQUEST_END` | D 完成资源分配并返回 dest_block_ids 给 P |
| DECODE | `DECODE_PROCESS_PREFILLED_REQUEST_START` | D 收到 P 的 first token,开始处理 Prefilled 请求 |
| DECODE | `DECODE_PROCESS_PREFILLED_REQUEST_END` | D 将 Prefilled 请求加入 running queue |
| DECODE | `DECODE_INFERENCE_END` | D 实例 Decode 推理完成 |

### 请求生命周期时序图

**Mixed 模式**(单实例完整推理):
```
PREPROCESSING_START → PREPROCESSING_END → REQUEST_QUEUE_START → REQUEST_QUEUE_END
→ RESOURCE_ALLOCATE_START → RESOURCE_ALLOCATE_END → INFERENCE_START
→ FIRST_TOKEN_GENERATED → DECODE_START → INFERENCE_END
→ POSTPROCESSING_START → POSTPROCESSING_END
```

**PD 分离 — Prefill (P) 实例**:
```
PREPROCESSING_START → PREPROCESSING_END → REQUEST_QUEUE_START → REQUEST_QUEUE_END
→ ASK_DECODE_RESOURCE_START → ASK_DECODE_RESOURCE_END
→ RESOURCE_ALLOCATE_START → RESOURCE_ALLOCATE_END
→ INFERENCE_START → PREFILL_INFERENCE_END
→ CHECK_CACHE_TRANSFER_START → CHECK_CACHE_TRANSFER_END → [发送 first token 到 D]
```

**PD 分离 — Decode (D) 实例**:
```
PREPROCESSING_START → PREPROCESSING_END → REQUEST_QUEUE_START → REQUEST_QUEUE_END
→ DECODE_PROCESS_PREALLOCATE_REQUEST_START → DECODE_PROCESS_PREALLOCATE_REQUEST_END
→ [等待 P 完成 prefill 并传输 KV Cache]
→ DECODE_PROCESS_PREFILLED_REQUEST_START → DECODE_PROCESS_PREFILLED_REQUEST_END
→ INFERENCE_START → DECODE_INFERENCE_END
→ POSTPROCESSING_START → POSTPROCESSING_END
```
27 changes: 27 additions & 0 deletions fastdeploy/engine/common_engine.py
Original file line number Diff line number Diff line change
Expand Up @@ -955,6 +955,9 @@ def _fetch_request():
self.llm_logger.debug(
f"P has allocated resources and then ask D resource for request: {task.request_id}"
)
trace_print(
LoggingEventName.ASK_DECODE_RESOURCE_START, task.request_id, getattr(task, "user", "")
)
task.metrics.ask_decode_resource_start_time = time.time()
while True:
self.split_connector.send_splitwise_tasks([task], task.idx)
Expand All @@ -966,6 +969,11 @@ def _fetch_request():
time.sleep(0.05)
else:
task.metrics.ask_decode_resource_finish_time = time.time()
trace_print(
LoggingEventName.ASK_DECODE_RESOURCE_END,
task.request_id,
getattr(task, "user", ""),
)
break
self.llm_logger.debug(f"D has allocated resource for request: {task.request_id}")
else:
Expand All @@ -977,13 +985,19 @@ def _fetch_request():
self.llm_logger.debug(
f"P has allocated resources and then ask D resource for req_id: {task.request_id}"
)
trace_print(
LoggingEventName.ASK_DECODE_RESOURCE_START, task.request_id, getattr(task, "user", "")
)
task.metrics.ask_decode_resource_start_time = time.time()
self.split_connector.send_splitwise_tasks([task], task.idx)

for task in tasks:
# assure fetch block ids from D
status, msg = self.split_connector.check_decode_allocated(task)
task.metrics.ask_decode_resource_finish_time = time.time()
trace_print(
LoggingEventName.ASK_DECODE_RESOURCE_END, task.request_id, getattr(task, "user", "")
)
if not status:
error_msg = (
f"PD Error: prefill failed to apply for resource from decode, "
Expand All @@ -1000,6 +1014,7 @@ def _fetch_request():
)
]
)
main_process_metrics.reschedule_req_num.inc()
need_delete_tasks.append(task)
continue
for tmp_task in need_delete_tasks:
Expand Down Expand Up @@ -1110,6 +1125,7 @@ def _fetch_request():
f"preallocated request. req:{task.request_id} "
)
self.llm_logger.error(msg)
main_process_metrics.reschedule_req_num.inc()
self.scheduler.put_results(
[
RequestOutput(
Expand Down Expand Up @@ -2109,6 +2125,7 @@ def _process_allocate_resource_requests():
processed_indices = []
for idx, task in enumerate(allocate_resource_requests):
is_success = False
trace_print(LoggingEventName.DECODE_PROCESS_PREALLOCATE_REQUEST_START, task.request_id, task.user)

if envs.ENABLE_V1_KVCACHE_SCHEDULER:
Comment thread
juncaipeng marked this conversation as resolved.
if self.resource_manager.preallocate_resource_in_d(task):
Expand All @@ -2118,6 +2135,7 @@ def _process_allocate_resource_requests():
self.llm_logger.debug(f"D has successfully sent cache infos for task {task.request_id}")
processed_indices.append(idx)
is_success = True
main_process_metrics.decode_preallocated_req_num.inc()
else:
if self.resource_manager.is_resource_sufficient(task.prompt_token_ids_len):
self.llm_logger.debug(f"D Resource available, processing task {task.request_id}")
Expand All @@ -2137,6 +2155,11 @@ def _process_allocate_resource_requests():
break

for idx in sorted(processed_indices, reverse=True):

This comment was marked as outdated.

trace_print(
Comment thread
juncaipeng marked this conversation as resolved.
LoggingEventName.DECODE_PROCESS_PREALLOCAT_REQUEST_END,
Comment thread
juncaipeng marked this conversation as resolved.
allocate_resource_requests[idx].request_id,
allocate_resource_requests[idx].user,
)
allocate_resource_requests.pop(idx)

def _process_prefilled_requests():
Expand All @@ -2152,6 +2175,7 @@ def _process_prefilled_requests():
continue
req_output.finished = False
ready_request_outputs.append(req_output)
trace_print(LoggingEventName.DECODE_PROCESS_PREFILLED_REQUEST_START, req_output.request_id, "")
self.llm_logger.debug(f"there are enough resource for prefilled request: {req_output.request_id}")

prefilled_request_ouputs = waiting_request_outputs
Expand All @@ -2164,6 +2188,8 @@ def _process_prefilled_requests():
else:
for req_output in ready_request_outputs:
request_id = req_output.request_id
main_process_metrics.decode_preallocated_req_num.dec()
Comment thread
juncaipeng marked this conversation as resolved.
trace_print(LoggingEventName.DECODE_PROCESS_PREFILLED_REQUEST_END, request_id, "")
if envs.FD_ENABLE_INTERNAL_ADAPTER and not req_output.outputs.token_ids:
# first token is eos in Prefill, just recycle resource and continue
self.llm_logger.warning(f"{request_id} need not decode after first token")
Expand All @@ -2177,6 +2203,7 @@ def _process_prefilled_requests():
self.llm_logger.warning(
f"{request_id} prefill failed with msg:{req_output.error_msg}, recycle resource."
)
main_process_metrics.failed_recv_first_token_req_num.inc()
self.resource_manager.pre_recycle_resource(request_id)
if request_id in self.token_processor.tokens_counter:
del self.token_processor.tokens_counter[request_id]
Expand Down
14 changes: 10 additions & 4 deletions fastdeploy/engine/sched/resource_manager_v1.py
Original file line number Diff line number Diff line change
Expand Up @@ -1649,7 +1649,9 @@ def clear_data(self):

def update_metrics(self, verbose=False):
# Update metrics
num_tasks = sum([1 if task else 0 for task in self.tasks_list])
num_requests_running = len(self.running)
num_requests_waiting = len(self.waiting)
num_requests_queuing = max(int(getattr(self, "scheduler_unhandled_request_num", 0) or 0), 0)
blocks_used_by_tasks = set()
for task in self.tasks_list:
if task is not None:
Expand All @@ -1658,10 +1660,14 @@ def update_metrics(self, verbose=False):
main_process_metrics.available_gpu_block_num.set(self.total_block_number() - len(blocks_used_by_tasks))
main_process_metrics.batch_size.set(self.max_num_seqs - self.available_batch())
main_process_metrics.gpu_cache_usage_perc.set(self.get_gpu_cache_usage_perc())
main_process_metrics.num_requests_running.set(len(self.running))
main_process_metrics.num_requests_waiting.set(num_tasks - len(self.running))
main_process_metrics.num_requests_running.set(num_requests_running)
main_process_metrics.num_requests_waiting.set(num_requests_waiting)
main_process_metrics.num_requests_queuing.set(num_requests_queuing)
if verbose:
llm_logger.info(f"update metrics: running={len(self.running)}, waiting={num_tasks - len(self.running)}")
llm_logger.info(
f"update metrics: running={num_requests_running}, "
f"waiting={num_requests_waiting}, queuing={num_requests_queuing}"
)

def log_status(self):
llm_logger.info(
Expand Down
33 changes: 31 additions & 2 deletions fastdeploy/metrics/metrics.py
Original file line number Diff line number Diff line change
Expand Up @@ -136,6 +136,7 @@ class MetricsManager:

num_requests_running: "Gauge"
num_requests_waiting: "Gauge"
num_requests_queuing: "Gauge"
time_to_first_token: "Histogram"
time_per_output_token: "Histogram"
request_inference_time: "Histogram"
Expand All @@ -153,7 +154,6 @@ class MetricsManager:
spec_decode_num_emitted_tokens_total: "Gauge"
spec_decode_draft_single_head_acceptance_rate: "list[Gauge]"

# for YIYAN Adapter
prefix_cache_token_num: "Counter"
prefix_gpu_cache_token_num: "Counter"
prefix_cpu_cache_token_num: "Counter"
Expand Down Expand Up @@ -192,6 +192,11 @@ class MetricsManager:
request_prompt_tokens: "Histogram"
request_token_ratio: "Histogram"

# for pd
decode_preallocated_req_num: "Gauge"
reschedule_req_num: "Counter"

This comment was marked as outdated.

failed_recv_first_token_req_num: "Counter"

Comment thread
juncaipeng marked this conversation as resolved.
# 定义所有指标配置

# gauge指标在多进程中,会有pid隔离,需要特殊处理,因此手动定义出来
Expand All @@ -205,7 +210,13 @@ class MetricsManager:
"num_requests_waiting": {
"type": Gauge,
"name": "fastdeploy:num_requests_waiting",
"description": "Number of requests currently waiting",
"description": "Number of requests currently waiting in resource manager",
Comment thread
juncaipeng marked this conversation as resolved.
"kwargs": {},
},
"num_requests_queuing": {
"type": Gauge,
"name": "fastdeploy:num_requests_queuing",
"description": "Number of requests currently queuing in local scheduler",
"kwargs": {},
},
"gpu_cache_usage_perc": {
Expand Down Expand Up @@ -298,6 +309,12 @@ class MetricsManager:
"description": "Token-level GPU prefix cache hit rate",
"kwargs": {},
},
"decode_preallocated_req_num": {
"type": Gauge,
"name": "fastdeploy:decode_preallocated_req_num",
"description": "Number of preallocated requests in decode instance",
"kwargs": {},
},
}

METRICS = {
Expand Down Expand Up @@ -459,6 +476,18 @@ class MetricsManager:
],
},
},
"reschedule_req_num": {
"type": Counter,
"name": "fastdeploy:reschedule_req_num",
"description": "Total number of reschedule requests",
"kwargs": {},
},
"failed_recv_first_token_req_num": {
"type": Counter,
"name": "fastdeploy:failed_recv_first_token_req_num",
"description": "Total number of failed requests to receive the first token in decode",
"kwargs": {},
},
}

SPECULATIVE_METRICS = {}
Expand Down
Loading
Loading