Usage

How to actually run vLLM v0.19.0 on Windows after installing it. Three modes covered: (A) Python embedding, (B) OpenAI-compatible HTTP server via vllm_launcher.py, (C) the raw vllm serve upstream CLI.

(A) Python embedding

The simplest way — load a model and call .generate().

import os
# Required env vars on Windows
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
os.environ["VLLM_HOST_IP"] = "127.0.0.1"

# Make sure CUDA + torch DLLs are findable
os.add_dll_directory(r"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\bin")
os.add_dll_directory(r"C:\path\to\venv\Lib\site-packages\torch\lib")

# Stub uvloop (not available on Windows)
import sys
sys.modules.setdefault("uvloop", type(sys)("uvloop"))

from vllm import LLM, SamplingParams

llm = LLM(
    model=r"E:\models\Qwen3-14B-AWQ-4bit",
    dtype="float16",
    kv_cache_dtype="auto",          # or one of: isoquant3, isoquant4,
                                    # planarquant3, planarquant4,
                                    # turboquant25, turboquant35
    max_model_len=2048,
    gpu_memory_utilization=0.85,
    enforce_eager=True,             # cudagraph capture is slow on Win
    trust_remote_code=True,
)

params = SamplingParams(temperature=0.7, max_tokens=200)
outputs = llm.generate(
    ["Explain quantum entanglement to a 10-year-old:"],
    params,
)
print(outputs[0].outputs[0].text)

Multi-GPU note

Single GPU only on Windows. NCCL doesn't ship with PyTorch on Windows and the patch wires up FakeProcessGroup for single-rank operation. For multi-GPU, run separate vLLM instances on different GPUs and load-balance externally.

(B) OpenAI-compatible HTTP server (vllm_launcher.py)

vllm_launcher.py is a Windows-friendly OpenAI-compatible server. It's a single file, ~30 KB, that wraps the embedding API in FastAPI endpoints.

Quick start

launch.bat

Without arguments, launch.bat shows an interactive model picker that scans models\ next to the script. With arguments:

launch.bat --model E:\models\Qwen3-14B-AWQ-4bit --port 8000

Full options

python vllm_launcher.py ^
    --model E:\models\Qwen3-14B-AWQ-4bit ^
    --port 8000 ^
    --host 127.0.0.1 ^
    --gpu-memory-utilization 0.85 ^
    --max-model-len 2048 ^
    --max-num-seqs 64 ^
    --enforce-eager ^
    --trust-remote-code

Flag	Default	Notes
`--model`	(required)	Path to a HuggingFace-format model directory
`--port`	8000	HTTP port
`--host`	127.0.0.1	Bind address
`--gpu-memory-utilization`	0.85	Fraction of GPU memory to use
`--max-model-len`	2048	Maximum context length
`--max-num-seqs`	64	Concurrent request limit
`--max-num-batched-tokens`	(auto)	Tokens per forward pass
`--enforce-eager`	False	Skip CUDAGraph capture (recommended on Windows)
`--gpu-id`	0	Which GPU to pin to (multi-GPU systems)
`--enable-prefix-caching`	True	Cache common prompt prefixes
`--task`	"generate"	"generate" or "embed"
`--trust-remote-code`	False	Required for some models

Endpoints

Method	Path	Notes
GET	`/v1/models`	List loaded models (OpenAI-compatible)
POST	`/v1/chat/completions`	Chat completions (streaming + non-streaming)
POST	`/v1/completions`	Legacy text completions
GET	`/health`	Liveness check
POST	`/shutdown`	Graceful shutdown

Example: chat completion

import requests

resp = requests.post(
    "http://127.0.0.1:8000/v1/chat/completions",
    json={
        "model": "qwen3-14b",
        "messages": [
            {"role": "user", "content": "Explain CUDA streams in 3 sentences."},
        ],
        "temperature": 0.7,
        "max_tokens": 200,
    },
)
print(resp.json()["choices"][0]["message"]["content"])

Streaming

Set "stream": true in the request body. The server returns Server-Sent Events compatible with the OpenAI streaming format.

Tool calling

vllm_launcher.py parses tool calls from model output in two formats:

<tool_call>{...}</tool_call> tags (Qwen3 format)
Bare JSON objects with "name" and "arguments" keys

The parsed tool calls are returned in the OpenAI tool_calls field.

Multi-GPU multi-server

For multi-GPU systems, run one server per GPU on different ports:

start "vLLM GPU 0" python vllm_launcher.py --model M --gpu-id 0 --port 8000
start "vLLM GPU 1" python vllm_launcher.py --model M --gpu-id 1 --port 8001

Then load-balance with nginx or your own router.

(C) Upstream `vllm serve`

vLLM v0.19.0 ships an OpenAI-compatible server out of the box at vllm serve. It works on Windows after the patches are applied:

vllm serve E:\models\Qwen3-14B-AWQ-4bit ^
    --dtype float16 ^
    --kv-cache-dtype isoquant3 ^
    --max-model-len 2048 ^
    --gpu-memory-utilization 0.85 ^
    --enforce-eager

The --kv-cache-dtype flag accepts any of the 6 Multi-TurboQuant methods plus the standard auto, fp8, fp8_e4m3, fp8_e5m2.

vllm_launcher.py (mode B) is generally the more reliable choice on Windows because it bypasses the multiprocess engine path. Try vllm serve first; if you hit ZMQ or asyncio errors, fall back to mode B.

Memory tuning

For a 24 GB GPU loading a 14B AWQ-4bit model:

Setting	Recommended	Why
`gpu_memory_utilization`	0.85	Leaves headroom for activations and Triton workspace
`max_model_len`	2048-4096	Larger contexts eat KV cache linearly
`max_num_seqs`	64-128	Higher = more parallelism, more KV cache
`enforce_eager`	True	CUDAGraph capture is slow + uses extra memory
`kv_cache_dtype`	`auto` or `isoquant4`	iso4 doubles KV capacity at small quality cost

For longer contexts, drop max_num_seqs to give each sequence more KV budget. For high concurrency, drop max_model_len.

If you see OutOfMemoryError:

Lower gpu_memory_utilization by 0.1
Set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
Switch to a TQ KV cache dtype to halve cache pressure
Drop max_num_seqs and max_model_len

Picking a KV cache dtype

Method	When to use
`auto` (fp16)	Default. Best speed, most memory.
`isoquant4`	Recommended TQ. Half the memory, near-FP16 quality, no calibration needed.
`planarquant4`	Same as iso4, simpler transform.
`isoquant3`	Aggressive — 3.25 bits, visible quality loss.
`planarquant3`	Same as iso3.
`turboquant35`	TurboQuant balanced — calibrated outlier handling.
`turboquant25`	Most aggressive — 2.25 bits, only for offline batch.

Throughput note: all 6 TQ methods currently run with PyTorch-only encode/decode (no fused Triton kernel). Throughput drops ~30-300× depending on the method. Memory savings are real, throughput cost is the trade-off until the kernels get fused. See turboquant.md.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Usage

(A) Python embedding

Multi-GPU note

(B) OpenAI-compatible HTTP server (vllm_launcher.py)

Quick start

Full options

Endpoints

Example: chat completion

Streaming

Tool calling

Multi-GPU multi-server

(C) Upstream `vllm serve`

Memory tuning

Picking a KV cache dtype

See also

FilesExpand file tree

usage.md

Latest commit

History

usage.md

File metadata and controls

Usage

(A) Python embedding

Multi-GPU note

(B) OpenAI-compatible HTTP server (vllm_launcher.py)

Quick start

Full options

Endpoints

Example: chat completion

Streaming

Tool calling

Multi-GPU multi-server

(C) Upstream vllm serve

Memory tuning

Picking a KV cache dtype

See also

(C) Upstream `vllm serve`