Non-intrusive per-request latency-breakdown tracer + CLI for SGLang. Shows each request's pipeline stages — queue → prefill(TTFT) → decode (per-token ITL) → e2e — both live (in-flight) and as completed history, queried on-demand from a CLI. No SGLang source edits.
SGLang already records per-request stage timestamps in SchedulerReqTimeStats.
This package attaches via SGLang's native plugin system (sglang.srt.plugins
entry_point) and registers three AROUND hooks in the scheduler process:
| Hook target | Purpose |
|---|---|
Scheduler.get_next_batch_to_run |
throttled (0.2s) live in-flight snapshot |
SchedulerReqTimeStats.set_completion_time |
finalize completed stage breakdown |
SchedulerReqTimeStats.set_last_decode_finish_time |
per-token decode span (always on) |
A tiny HTTP sidecar thread per scheduler process exposes the data as JSON;
the CLI discovers all sidecars (via /tmp/sglang_vis/<pid>.json) and renders.
Performance / transparency: queue/prefill/decode come from already-computed
fields → zero added cost. Per-token decode spans are captured always-on (one
perf_counter + append per token). That tiny cost is measured by the hook
itself and surfaced as the HOOK_OVH(us) column, so the per-request
overhead this tool adds is fully transparent. (To eliminate even that, you can
later gate hook 3 — see config — but by default it is on.)
In the SGLang environment:
pip install -e /path/to/sglang-vis # or with CLI table colors: pip install -e ".[rich]"Start SGLang normally (no special server flags needed for stage breakdown):
python -m sglang.launch_server --model <model> ...
# optional whitelist: SGLANG_PLUGINS=sglang_vissglang-vis done -f # append-scroll: one line per completed request
# as it finishes (tail -f style, no clearing)
sglang-vis done -n 30 # one-shot: last 30 completed requests
sglang-vis detail <rid> # drill down: per-stage bounds + per-token decode gaps
sglang-vis live # one-shot in-flight snapshot
sglang-vis status # sidecar statusdone output (rid-centric, leftmost column = completion time in UTC+8):
TIME (UTC+8) RID QUEUE PREFILL DECODE (ms: total/tok/p50/p90/p99) E2E(ms) HOOK_OVH(us)
2026-06-24 22:23:07 a1b2c3d4e5f6 708.0ms[22:23:01.204→22:23:01.912] 48.0ms[→22:23:01.960] 3950.0ms / 300tok / 11.0/15.0/40.0 4706.0 18.4
- QUEUE spans
wait_queue_entry → forward_entry(排队/调度/等待 fold into this one span — SGLang does not separately timestamp them in non-disagg mode). - PREFILL is one parallel span ending at first token.
- DECODE shows total / token count / inter-token p50·p90·p99; full per-token
gaps are kept and shown by
detail <rid>. - HOOK_OVH(us) is the tool's own measured overhead for that request.
Typical flow: run your bench_serving load, then sglang-vis done -f to watch
each request's breakdown scroll past; sglang-vis detail <rid> to inspect one.
| env | default | meaning |
|---|---|---|
SV_SIDECAR_BASE_PORT |
30100 |
base port; actual = base + dp64+pp16+tp |
SV_DISCOVERY_DIR |
/tmp/sglang_vis |
per-process discovery files |
SV_DONE_RING |
512 |
completed-record ring buffer / detail ring size |
SV_SNAPSHOT_INTERVAL |
0.2 |
live-snapshot throttle (seconds) |
SV_MAX_ITL_TOKENS / SV_MAX_ITL_REQS |
4096/256 |
per-token span memory caps |
SV_ONLY_RANK0 |
0 |
=1: only attn_tp_rank==0 runs a sidecar |
- Requires SGLang with
sglang.srt.pluginsandobservability/req_time_stats.py(SchedulerReqTimeStats). On field/name drift the plugin self-disables with a warning instead of crashing. - Stage timestamps are
perf_counter(monotonic); durations are exact, wall clock only used for ordering. - Concurrency relies on CPython GIL atomicity (ref swap + deque). Free-threaded builds would need a lock.
- Focuses on NULL (non-PD-disaggregated) mode; PD stages can be added later.