This document describes the full analysis pipeline for llm-d-benchmark, covering both in-container and local analysis, the available plot types, and when to use each mode.
Analysis happens at two stages:
- In-container analysis -- Runs automatically inside the harness pod after the benchmark completes. Always happens.
- Local analysis -- Runs on the experimenter's workstation after results are collected. Requires
--analyze.
Both stages produce output in the results directory. Local analysis augments (does not replace) the in-container analysis with additional plots that require matplotlib.
The harness entrypoint script (llm-d-benchmark.sh by default) orchestrates in-container analysis after the load generator finishes. The flow is:
- Harness execution -- The load generator (inference-perf, guidellm, vllm-benchmark) runs the workload and writes raw results.
- Metrics processing --
process_metrics.pyaggregates raw Prometheus scrapes into summary statistics. - Harness-native analysis -- For
inference-perf, runsinference-perf --analyzeto produce its native report format. - Benchmark report generation -- Converts harness-native results into v0.2 benchmark report YAML/JSON.
The in-container analysis produces:
- Benchmark report v0.1 and v0.2 (YAML + JSON) — one report per stage file
- For
inference-perfmulti-turn workloads: additional benchmark reports from*_session_lifecycle_metrics.jsonfiles, withresults.session_performancepopulated - Processed metrics summaries (
metrics/processed/) - Harness-native analysis output (varies by harness)
per_request_lifecycle_metrics.json(per-request raw data, if supported by the harness)
When --analyze is passed to llmdbenchmark run, step 11 (analyze_results) runs additional analysis on the local machine after results have been collected from the PVC.
llmdbenchmark --spec gpu run -l inference-perf -w sanity_random.yaml --analyzeThe local analysis pipeline runs the following in sequence:
For inference-perf, invokes inference-perf --analyze <results_dir> if the CLI is available locally. This regenerates the inference-perf analysis output using the local installation, which may be newer than the version in the container image.
Module: llmdbenchmark/analysis/per_request_plots.py
Reads per_request_lifecycle_metrics.json and generates distribution plots for individual request metrics. This provides visibility into the full distribution of latencies, not just aggregate statistics.
Generated plots (saved to <results_dir>/analysis/distributions/):
| Plot | Description |
|---|---|
ttft_histogram.png |
Histogram of time to first token with mean/P50/P99 vertical lines |
tpot_histogram.png |
Histogram of time per output token with mean/P50/P99 lines |
itl_histogram.png |
Histogram of inter-token latency with mean/P50/P99 lines |
e2e_histogram.png |
Histogram of end-to-end request latency with mean/P50/P99 lines |
ttft_cdf.png |
Cumulative distribution function of TTFT |
tpot_cdf.png |
CDF of time per output token |
itl_cdf.png |
CDF of inter-token latency |
e2e_cdf.png |
CDF of end-to-end latency |
ttft_vs_input_length.png |
Scatter: TTFT vs input token count |
e2e_vs_output_length.png |
Scatter: E2E latency vs output token count |
itl_distribution.png |
Distribution of ITL values across all tokens in all requests |
Each histogram includes vertical lines marking the mean, P50 (median), and P99 values for quick reference.
Module: llmdbenchmark/analysis/__init__.py (_run_session_plots)
For inference-perf runs, reads all benchmark_report_v0.2,_*_session_lifecycle_metrics.json.yaml files in the results directory and generates per-stage bar charts. This is skipped automatically for harnesses other than inference-perf.
Generated plots (saved to <results_dir>/analysis/session/):
| Plot | Description |
|---|---|
session_session_rate_qps.png |
Session completion rate per stage (sessions/s) |
session_session_duration_mean_s.png |
Mean session wall-clock duration per stage |
session_session_duration_p99_s.png |
P99 session duration per stage |
session_events_per_session_mean.png |
Mean event (request) count per session per stage |
session_events_cancelled_per_session_mean.png |
Mean cancelled event count per session per stage |
session_output_tokens_per_session_mean.png |
Mean output tokens per session per stage |
session_failed_sessions.png |
Failed session count per stage |
Module: llmdbenchmark/analysis/cross_treatment.py
When multiple treatments were executed (via --experiments), this module reads the v0.2 benchmark report from each treatment's result directory and produces:
treatment_comparison.csv-- One row per treatment with key metrics (TTFT, TPOT, ITL, E2E latency stats, throughput, request counts, and session metrics when available). See Benchmark Report for the full column list.- Bar charts -- Side-by-side bars comparing aggregate metrics across treatments.
- Session comparison bar charts -- Side-by-side bars comparing session lifecycle metrics across treatments, with error bars showing min/max across stages. Generated only when at least two treatments have session data.
- Session scatter plots -- Line/scatter plots showing relationships between session rate and session duration, events per session, and output tokens per session across treatments.
- Latency vs throughput curves -- Scatter/line plots showing the latency-throughput trade-off.
- Overlaid CDFs -- CDF plots from per-request data overlaid across treatments for direct comparison.
Output is saved to <results_dir>/cross-treatment-comparison/.
Module: llmdbenchmark/analysis/visualize_metrics.py
Generates time series plots from raw Prometheus metric files collected by collect_metrics.sh. See Metrics Collection for the full list of generated plots.
Output is saved to <results_dir>/analysis/metrics/.
| Scenario | Recommendation |
|---|---|
| Quick sanity check | Skip --analyze -- in-container analysis provides benchmark reports and summary stats |
| Investigating latency distributions | Use --analyze -- per-request plots show the full distribution, not just aggregates |
| Comparing multiple treatments | Use --analyze -- cross-treatment comparison generates the summary CSV and comparison charts |
| CI/CD pipeline | Skip --analyze -- rely on benchmark report YAML/JSON for programmatic consumption |
| Presentation-quality plots | Use --analyze -- generates publication-ready PNG plots with matplotlib |
| Container image lacks matplotlib | Use --analyze -- local analysis uses the experimenter's Python environment |
For interactive exploration beyond what --analyze provides, use the Jupyter notebook at docs/analysis/analysis.ipynb. This notebook imports benchmark report files into a Pandas DataFrame for custom plotting and analysis. See docs/analysis/README.md for setup instructions.
After a full analysis run, the results directory contains:
<results_dir>/
<treatment_1>/
benchmark_report,_stage_<N>_lifecycle_metrics.json.yaml # per-stage request report (v0.1)
benchmark_report_v0.2,_stage_<N>_lifecycle_metrics.json.yaml # per-stage request report (v0.2)
benchmark_report_v0.2,_stage_<N>_session_lifecycle_metrics.json.yaml # session report (inference-perf multi-turn)
per_request_lifecycle_metrics.json
metrics/
raw/ # Timestamped Prometheus scrapes
processed/ # Aggregated metric summaries
analysis/
distributions/ # Per-request histograms and CDFs
metrics/ # Prometheus time series plots
session/ # Per-stage session lifecycle bar charts (inference-perf only)
<treatment_2>/
...
cross-treatment-comparison/
treatment_comparison.csv
<comparison_plots>.png # Aggregate metric comparisons
session_compare_<metric>.png # Session metric comparisons (when session data present)
session_scatter_<x>_vs_<y>.png # Session metric scatter plots