This guide walks you through running your first llm-d-benchmark deployment on a local Kind cluster - no GPU required. By the end you will have stood up a simulated inference deployment, run a sanity benchmark workload against it, and torn everything down cleanly.
This is the same scenario our CI runs on every PR (see ci-pr-benchmark.yaml), so if the walkthrough works here it will work the same way in CI.
Use Kind only if you don't have access to a real cluster with GPU / accelerator resources, or if you're doing local development on a laptop.
Kind is a local-Docker-in-Docker Kubernetes distribution. It is ideal for:
- First-time walkthroughs of the framework - you can exercise the full
standup -> smoketest -> run -> teardownlifecycle without any cloud account, cluster access, or GPU hardware.- Iterating on framework code - testing your changes to steps, templates, or scenarios locally in a fast feedback loop.
- Reproducing CI failures - the PR-benchmark workflow uses this exact
cicd/kindscenario on a Kind cluster, so a local repro is one./util/test-scenarios.shinvocation away.Kind is not a benchmarking target. It runs a simulated inference engine (
llm-d-inference-sim) on CPU, so any latency, throughput, or GPU-utilization numbers you collect here are meaningless as performance data. When you have access to a cluster with real accelerators, switch to one of the GPU-backed scenarios underconfig/specification/examples/gpu.yaml.j2orconfig/specification/guides/and skip steps 1 and 2 of this guide - jump straight to step 3 (Install llmdbenchmark) and use your existing kubeconfig.
- What you will build
- Prerequisites
- 1. Install Kind locally
- 2. Create the Kind cluster
- 3. Install llmdbenchmark
- 4. First deployment: standup + smoketest + run (modelservice)
- 5. Alternate path: standalone deployment
- 6. Tear down
- Troubleshooting
- Next steps
| Item | Value |
|---|---|
| Cluster | Kind (local Docker-in-Docker, CPU-only) |
| Scenario | cicd/kind |
| Model | facebook/opt-125m (small - chosen so the quickstart works on a laptop) |
| Inference engine | llm-d-inference-sim - fake inference, no GPU |
| Deploy methods | modelservice (default) or standalone |
| Harness | inference-perf with the sanity_random.yaml workload profile |
Because the inference engine is simulated, the entire stack runs on a CPU-only machine in a single Kind node. Nothing in this walkthrough requires a GPU, a cluster operator, or a cloud account.
You need these installed before starting:
| Tool | Minimum | Check |
|---|---|---|
| Docker or Podman | any recent version | docker info or podman info |
| Python | 3.11+ | python3 --version |
git |
any | git --version |
| Container runtime resources | 4 CPUs / 8 GiB RAM | docker info | grep -E "CPUs|Total Memory" |
Resource note: The
cicd/kindscenario deploys ~7 pods on a single Kind node. With the default 2 CPUs that Docker Desktop, Colima, and Podman ship with, the harness pod (and sometimes the gateway) cannot schedule due toInsufficient cpu. Bump your container runtime to 4 CPUs before creating the Kind cluster. See Troubleshooting if you hit this.
Everything else - kubectl, helm, helmfile, kind, skopeo, crane, helm-diff, jq, yq, kustomize - will be installed for you by ./install.sh in step 3, with one exception: kind itself, which we install first below because we want the cluster up before the installer runs.
Kind runs Kubernetes clusters inside Docker containers. Pick the line for your OS:
brew install kind# v0.31.0 is the version CI uses. Pin to it for parity.
curl -Lo ./kind "https://kind.sigs.k8s.io/dl/v0.31.0/kind-linux-amd64"
chmod +x ./kind
sudo mv ./kind /usr/local/bin/kindcurl -Lo ./kind "https://kind.sigs.k8s.io/dl/v0.31.0/kind-linux-arm64"
chmod +x ./kind
sudo mv ./kind /usr/local/bin/kindkind version
# Expected: kind v0.31.0 ...If you prefer a different installation path or version manager, see the upstream Kind install docs. Any version v0.20+ should work; v0.31.0 is what CI exercises.
Create a single-node cluster. The default Kind configuration is enough - we do not need any special port mappings, extra mounts, or registry config for cicd/kind.
kind create cluster --name llmd-quickstartThe first run pulls the Kind node image, which can take a while depending on your network. When it finishes, your kubectl context is automatically pointed at the new cluster. Verify:
kubectl cluster-info --context kind-llmd-quickstart
kubectl get nodes
# Expected: one node in Ready stateThat's all the cluster prep you need. The cicd/kind scenario uses affinity.nodeSelector: kubernetes.io/os: linux, and Kubernetes sets that label automatically on every node via the kubelet (it is one of the well-known labels), so no manual labeling step is required.
Clone the repository and run the installer. It creates .venv/, installs the llmdbenchmark and planner (from llm-d-planner) Python packages, and provisions every system tool the framework calls out to.
git clone https://github.com/llm-d/llm-d-benchmark.git
cd llm-d-benchmark
./install.sh
source .venv/bin/activateWe intentionally do not pass -y to install.sh. The -y flag forces the installer to use your system Python instead of creating a virtualenv, which is appropriate on CI runners (they are already isolated containers) but wrong for local development - it would pollute your system site-packages and skip the .venv/ that source .venv/bin/activate on the next line expects. Always run ./install.sh without flags on your laptop.
Verify the CLI is on PATH:
llmdbenchmark --helpYou should see the llmdbenchmark help banner with plan, standup, smoketest, run, teardown, and experiment subcommands.
Tip: The installer caches its own "already checked" state in
~/.llmdbench_dependencies_checked. Subsequent./install.shruns skip dependencies that have already been verified.
Now we run the full four-phase lifecycle against the Kind cluster we created in step 2. modelservice is the default deploy method - no -t flag needed.
Pick a namespace for your run. Anything unique is fine:
export NS=llmd-quickstartstandup renders the scenario templates, installs the llm-d charts, creates the model PVC, downloads the model, and deploys the prefill + decode pods.
llmdbenchmark --spec cicd/kind standup -p "$NS" --skip-smoketestWhat's happening under the hood:
- The
cicd/kindspecification is rendered into a self-containedplan/directory under$LLMDBENCH_WORKSPACE(defaults to/tmp/<user>-<timestamp>). - The
modelserviceHelm chart is deployed into$NSwith CPU-only image overrides. - A PVC is provisioned and
facebook/opt-125mis downloaded into it from HuggingFace. - Prefill, decode, and gateway pods are rolled out and wait for Ready.
The first run is dominated by the model download and image pulls. Subsequent runs (different namespace) reuse pulled images and are noticeably faster.
Progress banners at the start of each step make it easy to follow along. If a step fails, the banner at the top shows which phase and which step number hit the error, and the pod logs are printed inline.
Once standup succeeds, run the smoketest to verify the inference endpoint actually answers:
llmdbenchmark --spec cicd/kind smoketest -p "$NS"This sends a handful of real requests through the gateway, validates the responses, and exits 0 on success.
Now run the inference-perf harness with the sanity_random.yaml workload. This is the smallest benchmark profile we ship - perfect for a first run.
llmdbenchmark --spec cicd/kind run -p "$NS" \
-l inference-perf \
-w sanity_random.yamlWhat to expect:
- A harness pod is launched in
$NS. - It fires a burst of requests against the gateway.
- Per-request metrics are collected into a results directory printed at the end of the run.
- The analysis phase generates summary CSVs and plots in that same directory.
The results directory path is printed in the final log line - something like /tmp/<user>-<timestamp>/<phase>/<stack>/results/. You can open the plots with any image viewer or the CSVs with any spreadsheet.
The standalone method skips the llm-d-modelservice chart entirely and deploys a single vLLM pod directly. It's simpler, has fewer moving parts, and is a good second step after the modelservice path succeeds.
Use a different namespace so you don't clash with the modelservice run:
export NS_SA=llmd-quickstart-sa
llmdbenchmark --spec cicd/kind standup -p "$NS_SA" -t standalone --skip-smoketest
llmdbenchmark --spec cicd/kind smoketest -p "$NS_SA" -t standalone
llmdbenchmark --spec cicd/kind run -p "$NS_SA" -t standalone \
-l inference-perf -w sanity_random.yamlThe -t standalone flag is the only difference from step 4. Every other argument - spec, namespace, harness, workload - is identical.
Clean up the deployment(s) but leave the Kind cluster itself running:
# Tear down the modelservice namespace
llmdbenchmark --spec cicd/kind teardown -p "$NS"
# Tear down the standalone namespace (if you ran step 5)
llmdbenchmark --spec cicd/kind teardown -p "$NS_SA" -t standaloneOr, if you're done with the cluster entirely, delete it wholesale:
kind delete cluster --name llmd-quickstart- Docker not running: start Docker Desktop / Colima / Podman and retry.
- Low disk space: Kind needs free space in
/tmpand/var/lib/docker.docker system prune -afrees cache space. - Previous cluster still around:
kind get clustersthenkind delete cluster --name <name>.
-
Insufficient CPU or memory on the Kind node: this is the most common issue on laptops. Run
kubectl describe pod -n "$NS" <pod>and look for events like:Warning FailedScheduling 0/1 nodes are available: 1 Insufficient cpu, 1 Insufficient memory.The
cicd/kindscenario needs roughly 2.5 CPU across all pods (decode, prefill, EPP, gateway, harness). If your container runtime (Docker Desktop, Colima, Podman) defaults to 2 CPUs, the harness pod won't fit alongside everything else.Check your current allocation:
# Docker Desktop / Colima / Podman - any of these will work: docker info 2>/dev/null | grep -E "CPUs|Total Memory" podman info 2>/dev/null | grep -E "cpus|memTotal" colima status 2>/dev/null # Or check what Kubernetes actually sees: kubectl describe node | grep -A6 "Allocated resources"
Fix - increase CPUs to at least 4 (8 GiB RAM recommended):
# Docker Desktop: Settings > Resources > CPUs: 4, Memory: 8 GiB # (no CLI option - must be done through the GUI) # Colima colima stop && colima start --cpu 4 --memory 8 # Podman podman machine stop && podman machine set --cpus 4 --memory 8192 && podman machine start
After changing resources, recreate the Kind cluster (the kubelet captures allocatable resources at node boot):
kind delete cluster --name llmd-quickstart kind create cluster --name llmd-quickstart
Then re-run standup from scratch.
-
PVC stuck:
kubectl get pvc -n "$NS"- thestandardKind storage class should provision immediately. If it does not, you're probably out of disk; see above. -
Image pull backoff: check
kubectl describe pod -n "$NS" <pod>for the failing image and make sure your machine has network access toghcr.io. -
Node selector mismatch: if
kubectl describe pod -n "$NS" <pod>shows0/1 nodes are available: 1 node(s) didn't match Pod's node affinity/selector, print the node's labels withkubectl get node -o jsonpath='{.items[0].metadata.labels}' | jqand cross-check against the scenario'saffinity.nodeSelectorinconfig/scenarios/cicd/kind.yaml. On a standard Kind cluster this should always match becausekubernetes.io/os=linuxis a well-known label the kubelet sets automatically.
You probably exited the shell between steps. Re-activate the venv:
cd llm-d-benchmark
source .venv/bin/activate./install.sh installs these for you. If you skipped it or ran from outside the repo, re-run:
./install.sh(no -y - we want the .venv/ path, not system Python)
The facebook/opt-125m model is public and small. If the download fails, you most likely have:
- No network access from inside Kind pods (corporate proxy, air-gapped laptop): run
kubectl logs -n "$NS" job/download-model --tail=50to see the actual error. - HuggingFace rate limiting: retry after a short wait, or set a
HUGGING_FACE_HUB_TOKENvia-v HUGGING_FACE_HUB_TOKEN=<token>.
-
kubectl get pods -n "$NS"- check if the harness pod isPending. Ifkubectl describe pod -n "$NS" <harness-pod>showsInsufficient cpuorInsufficient memory, see Pods stuck in Pending above. -
If a previous run failed and left a stale harness pod, clean it up before retrying:
kubectl delete pod -n "$NS" -l app=llmdbench-harness-launcher --ignore-not-found -
If you edited
harness.resourcesin your scenario to reduce requests, you must re-runplanbeforerun(no standup needed - the cluster infra is unchanged):llmdbenchmark --spec cicd/kind plan -p "$NS" llmdbenchmark --spec cicd/kind run -p "$NS"
Run with LLMDBENCH_LOG_LEVEL=DEBUG for verbose output:
LLMDBENCH_LOG_LEVEL=DEBUG llmdbenchmark --spec cicd/kind standup -p "$NS"The workspace directory printed at the top of every run contains all rendered templates, step-by-step logs, and pod manifests. That's usually enough to pinpoint any failure without needing extra flags.
You just ran the same lifecycle CI exercises every PR. From here, natural next steps are:
- Try a real GPU scenario: see
config/specification/examples/gpu.yaml.j2and run it against a cluster that has GPU nodes. - Explore the well-lit paths:
config/specification/guides/has scenarios forinference-scheduling,inference-scheduling-wva,multi-model-wva,pd-disaggregation,precise-prefix-cache-aware,tiered-prefix-cache, andwide-ep-lws- each worth a read even if you don't run them. - Try multi-model with WVA:
multi-model-wvadeploys two models behind one gateway with a single shared HTTPRoute and a single WVA controller autoscaling each pool independently. Standup:llmdbenchmark --spec examples/multi-model-wva standup -p <namespace>. - Write a custom scenario: see the Developer Guide, Section 7 - "How to Add a New Scenario".
- Add a new benchmark step: see the Developer Guide, Section 2 - "How to Add a New Step".
- Set up pre-commit so your first PR passes CI on the first try: see Local Development Checks in CONTRIBUTING.md.
If you hit anything that didn't work for you in this guide, please open an issue - that's the fastest way to get the guide improved for the next person.