sglang-vis

Non-intrusive per-request latency-breakdown tracer + CLI for SGLang. Shows each request's pipeline stages — queue → prefill(TTFT) → decode (per-token ITL) → e2e — both live (in-flight) and as completed history, queried on-demand from a CLI. No SGLang source edits.

How it works

SGLang already records per-request stage timestamps in SchedulerReqTimeStats. This package attaches via SGLang's native plugin system (sglang.srt.plugins entry_point) and registers three AROUND hooks in the scheduler process:

Hook target	Purpose
`Scheduler.get_next_batch_to_run`	throttled (0.2s) live in-flight snapshot
`SchedulerReqTimeStats.set_completion_time`	finalize completed stage breakdown
`SchedulerReqTimeStats.set_last_decode_finish_time`	per-token decode span (always on)

A tiny HTTP sidecar thread per scheduler process exposes the data as JSON; the CLI discovers all sidecars (via /tmp/sglang_vis/<pid>.json) and renders.

Performance / transparency: queue/prefill/decode come from already-computed fields → zero added cost. Per-token decode spans are captured always-on (one perf_counter + append per token). That tiny cost is measured by the hook itself and surfaced as the HOOK_OVH(us) column, so the per-request overhead this tool adds is fully transparent. (To eliminate even that, you can later gate hook 3 — see config — but by default it is on.)

Install

In the SGLang environment:

pip install -e /path/to/sglang-vis          # or with CLI table colors: pip install -e ".[rich]"

Run

Start SGLang normally (no special server flags needed for stage breakdown):

python -m sglang.launch_server --model <model> ...
# optional whitelist: SGLANG_PLUGINS=sglang_vis

CLI

sglang-vis done -f           # append-scroll: one line per completed request
                             #   as it finishes (tail -f style, no clearing)
sglang-vis done -n 30        # one-shot: last 30 completed requests
sglang-vis detail <rid>      # drill down: per-stage bounds + per-token decode gaps
sglang-vis live              # one-shot in-flight snapshot
sglang-vis status            # sidecar status

done output (rid-centric, leftmost column = completion time in UTC+8):

TIME (UTC+8)         RID           QUEUE                            PREFILL                DECODE (ms: total/tok/p50/p90/p99)   E2E(ms)  HOOK_OVH(us)
2026-06-24 22:23:07  a1b2c3d4e5f6  708.0ms[22:23:01.204→22:23:01.912]  48.0ms[→22:23:01.960]  3950.0ms / 300tok / 11.0/15.0/40.0   4706.0          18.4

QUEUE spans wait_queue_entry → forward_entry (排队/调度/等待 fold into this one span — SGLang does not separately timestamp them in non-disagg mode).
PREFILL is one parallel span ending at first token.
DECODE shows total / token count / inter-token p50·p90·p99; full per-token gaps are kept and shown by detail <rid>.
HOOK_OVH(us) is the tool's own measured overhead for that request.

Typical flow: run your bench_serving load, then sglang-vis done -f to watch each request's breakdown scroll past; sglang-vis detail <rid> to inspect one.

Configuration (env vars, set on the server process)

env	default	meaning
`SV_SIDECAR_BASE_PORT`	`30100`	base port; actual = base + dp64+pp16+tp
`SV_DISCOVERY_DIR`	`/tmp/sglang_vis`	per-process discovery files
`SV_DONE_RING`	`512`	completed-record ring buffer / detail ring size
`SV_SNAPSHOT_INTERVAL`	`0.2`	live-snapshot throttle (seconds)
`SV_MAX_ITL_TOKENS` / `SV_MAX_ITL_REQS`	`4096`/`256`	per-token span memory caps
`SV_ONLY_RANK0`	`0`	=1: only attn_tp_rank==0 runs a sidecar

Notes / compatibility

Requires SGLang with sglang.srt.plugins and observability/req_time_stats.py (SchedulerReqTimeStats). On field/name drift the plugin self-disables with a warning instead of crashing.
Stage timestamps are perf_counter (monotonic); durations are exact, wall clock only used for ordering.
Concurrency relies on CPython GIL atomicity (ref swap + deque). Free-threaded builds would need a lock.
Focuses on NULL (non-PD-disaggregated) mode; PD stages can be added later.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
sglang_vis		sglang_vis
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

sglang-vis

How it works

Install

Run

CLI

Configuration (env vars, set on the server process)

Notes / compatibility

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

sglang-vis

How it works

Install

Run

CLI

Configuration (env vars, set on the server process)

Notes / compatibility

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages