Skip to content

scitix/sglang-vis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

sglang-vis

Non-intrusive per-request latency-breakdown tracer + CLI for SGLang. Shows each request's pipeline stages — queue → prefill(TTFT) → decode (per-token ITL) → e2e — both live (in-flight) and as completed history, queried on-demand from a CLI. No SGLang source edits.

How it works

SGLang already records per-request stage timestamps in SchedulerReqTimeStats. This package attaches via SGLang's native plugin system (sglang.srt.plugins entry_point) and registers three AROUND hooks in the scheduler process:

Hook target Purpose
Scheduler.get_next_batch_to_run throttled (0.2s) live in-flight snapshot
SchedulerReqTimeStats.set_completion_time finalize completed stage breakdown
SchedulerReqTimeStats.set_last_decode_finish_time per-token decode span (always on)

A tiny HTTP sidecar thread per scheduler process exposes the data as JSON; the CLI discovers all sidecars (via /tmp/sglang_vis/<pid>.json) and renders.

Performance / transparency: queue/prefill/decode come from already-computed fields → zero added cost. Per-token decode spans are captured always-on (one perf_counter + append per token). That tiny cost is measured by the hook itself and surfaced as the HOOK_OVH(us) column, so the per-request overhead this tool adds is fully transparent. (To eliminate even that, you can later gate hook 3 — see config — but by default it is on.)

Install

In the SGLang environment:

pip install -e /path/to/sglang-vis          # or with CLI table colors: pip install -e ".[rich]"

Run

Start SGLang normally (no special server flags needed for stage breakdown):

python -m sglang.launch_server --model <model> ...
# optional whitelist: SGLANG_PLUGINS=sglang_vis

CLI

sglang-vis done -f           # append-scroll: one line per completed request
                             #   as it finishes (tail -f style, no clearing)
sglang-vis done -n 30        # one-shot: last 30 completed requests
sglang-vis detail <rid>      # drill down: per-stage bounds + per-token decode gaps
sglang-vis live              # one-shot in-flight snapshot
sglang-vis status            # sidecar status

done output (rid-centric, leftmost column = completion time in UTC+8):

TIME (UTC+8)         RID           QUEUE                            PREFILL                DECODE (ms: total/tok/p50/p90/p99)   E2E(ms)  HOOK_OVH(us)
2026-06-24 22:23:07  a1b2c3d4e5f6  708.0ms[22:23:01.204→22:23:01.912]  48.0ms[→22:23:01.960]  3950.0ms / 300tok / 11.0/15.0/40.0   4706.0          18.4
  • QUEUE spans wait_queue_entry → forward_entry (排队/调度/等待 fold into this one span — SGLang does not separately timestamp them in non-disagg mode).
  • PREFILL is one parallel span ending at first token.
  • DECODE shows total / token count / inter-token p50·p90·p99; full per-token gaps are kept and shown by detail <rid>.
  • HOOK_OVH(us) is the tool's own measured overhead for that request.

Typical flow: run your bench_serving load, then sglang-vis done -f to watch each request's breakdown scroll past; sglang-vis detail <rid> to inspect one.

Configuration (env vars, set on the server process)

env default meaning
SV_SIDECAR_BASE_PORT 30100 base port; actual = base + dp64+pp16+tp
SV_DISCOVERY_DIR /tmp/sglang_vis per-process discovery files
SV_DONE_RING 512 completed-record ring buffer / detail ring size
SV_SNAPSHOT_INTERVAL 0.2 live-snapshot throttle (seconds)
SV_MAX_ITL_TOKENS / SV_MAX_ITL_REQS 4096/256 per-token span memory caps
SV_ONLY_RANK0 0 =1: only attn_tp_rank==0 runs a sidecar

Notes / compatibility

  • Requires SGLang with sglang.srt.plugins and observability/req_time_stats.py (SchedulerReqTimeStats). On field/name drift the plugin self-disables with a warning instead of crashing.
  • Stage timestamps are perf_counter (monotonic); durations are exact, wall clock only used for ordering.
  • Concurrency relies on CPython GIL atomicity (ref swap + deque). Free-threaded builds would need a lock.
  • Focuses on NULL (non-PD-disaggregated) mode; PD stages can be added later.

About

cli for vis sglang inference details

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages