How to actually run vLLM v0.19.0 on Windows after installing it. Three
modes covered: (A) Python embedding, (B) OpenAI-compatible HTTP
server via vllm_launcher.py, (C) the raw vllm serve upstream
CLI.
The simplest way — load a model and call .generate().
import os
# Required env vars on Windows
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
os.environ["VLLM_HOST_IP"] = "127.0.0.1"
# Make sure CUDA + torch DLLs are findable
os.add_dll_directory(r"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\bin")
os.add_dll_directory(r"C:\path\to\venv\Lib\site-packages\torch\lib")
# Stub uvloop (not available on Windows)
import sys
sys.modules.setdefault("uvloop", type(sys)("uvloop"))
from vllm import LLM, SamplingParams
llm = LLM(
model=r"E:\models\Qwen3-14B-AWQ-4bit",
dtype="float16",
kv_cache_dtype="auto", # or one of: isoquant3, isoquant4,
# planarquant3, planarquant4,
# turboquant25, turboquant35
max_model_len=2048,
gpu_memory_utilization=0.85,
enforce_eager=True, # cudagraph capture is slow on Win
trust_remote_code=True,
)
params = SamplingParams(temperature=0.7, max_tokens=200)
outputs = llm.generate(
["Explain quantum entanglement to a 10-year-old:"],
params,
)
print(outputs[0].outputs[0].text)Single GPU only on Windows. NCCL doesn't ship with PyTorch on Windows
and the patch wires up FakeProcessGroup for single-rank operation.
For multi-GPU, run separate vLLM instances on different GPUs and
load-balance externally.
vllm_launcher.py is a Windows-friendly OpenAI-compatible server.
It's a single file, ~30 KB, that wraps the embedding API in FastAPI
endpoints.
launch.batWithout arguments, launch.bat shows an interactive model picker that
scans models\ next to the script. With arguments:
launch.bat --model E:\models\Qwen3-14B-AWQ-4bit --port 8000python vllm_launcher.py ^
--model E:\models\Qwen3-14B-AWQ-4bit ^
--port 8000 ^
--host 127.0.0.1 ^
--gpu-memory-utilization 0.85 ^
--max-model-len 2048 ^
--max-num-seqs 64 ^
--enforce-eager ^
--trust-remote-code| Flag | Default | Notes |
|---|---|---|
--model |
(required) | Path to a HuggingFace-format model directory |
--port |
8000 | HTTP port |
--host |
127.0.0.1 | Bind address |
--gpu-memory-utilization |
0.85 | Fraction of GPU memory to use |
--max-model-len |
2048 | Maximum context length |
--max-num-seqs |
64 | Concurrent request limit |
--max-num-batched-tokens |
(auto) | Tokens per forward pass |
--enforce-eager |
False | Skip CUDAGraph capture (recommended on Windows) |
--gpu-id |
0 | Which GPU to pin to (multi-GPU systems) |
--enable-prefix-caching |
True | Cache common prompt prefixes |
--task |
"generate" | "generate" or "embed" |
--trust-remote-code |
False | Required for some models |
| Method | Path | Notes |
|---|---|---|
| GET | /v1/models |
List loaded models (OpenAI-compatible) |
| POST | /v1/chat/completions |
Chat completions (streaming + non-streaming) |
| POST | /v1/completions |
Legacy text completions |
| GET | /health |
Liveness check |
| POST | /shutdown |
Graceful shutdown |
import requests
resp = requests.post(
"http://127.0.0.1:8000/v1/chat/completions",
json={
"model": "qwen3-14b",
"messages": [
{"role": "user", "content": "Explain CUDA streams in 3 sentences."},
],
"temperature": 0.7,
"max_tokens": 200,
},
)
print(resp.json()["choices"][0]["message"]["content"])Set "stream": true in the request body. The server returns
Server-Sent Events compatible with the OpenAI streaming format.
vllm_launcher.py parses tool calls from model output in two formats:
<tool_call>{...}</tool_call>tags (Qwen3 format)- Bare JSON objects with
"name"and"arguments"keys
The parsed tool calls are returned in the OpenAI tool_calls field.
For multi-GPU systems, run one server per GPU on different ports:
start "vLLM GPU 0" python vllm_launcher.py --model M --gpu-id 0 --port 8000
start "vLLM GPU 1" python vllm_launcher.py --model M --gpu-id 1 --port 8001Then load-balance with nginx or your own router.
vLLM v0.19.0 ships an OpenAI-compatible server out of the box at
vllm serve. It works on Windows after the patches are applied:
vllm serve E:\models\Qwen3-14B-AWQ-4bit ^
--dtype float16 ^
--kv-cache-dtype isoquant3 ^
--max-model-len 2048 ^
--gpu-memory-utilization 0.85 ^
--enforce-eagerThe --kv-cache-dtype flag accepts any of the 6 Multi-TurboQuant
methods plus the standard auto, fp8, fp8_e4m3, fp8_e5m2.
vllm_launcher.py (mode B) is generally the more reliable choice on
Windows because it bypasses the multiprocess engine path. Try vllm serve first; if you hit ZMQ or asyncio errors, fall back to mode B.
For a 24 GB GPU loading a 14B AWQ-4bit model:
| Setting | Recommended | Why |
|---|---|---|
gpu_memory_utilization |
0.85 | Leaves headroom for activations and Triton workspace |
max_model_len |
2048-4096 | Larger contexts eat KV cache linearly |
max_num_seqs |
64-128 | Higher = more parallelism, more KV cache |
enforce_eager |
True | CUDAGraph capture is slow + uses extra memory |
kv_cache_dtype |
auto or isoquant4 |
iso4 doubles KV capacity at small quality cost |
For longer contexts, drop max_num_seqs to give each sequence more KV
budget. For high concurrency, drop max_model_len.
If you see OutOfMemoryError:
- Lower
gpu_memory_utilizationby 0.1 - Set
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True - Switch to a TQ KV cache dtype to halve cache pressure
- Drop
max_num_seqsandmax_model_len
| Method | When to use |
|---|---|
auto (fp16) |
Default. Best speed, most memory. |
isoquant4 |
Recommended TQ. Half the memory, near-FP16 quality, no calibration needed. |
planarquant4 |
Same as iso4, simpler transform. |
isoquant3 |
Aggressive — 3.25 bits, visible quality loss. |
planarquant3 |
Same as iso3. |
turboquant35 |
TurboQuant balanced — calibrated outlier handling. |
turboquant25 |
Most aggressive — 2.25 bits, only for offline batch. |
Throughput note: all 6 TQ methods currently run with PyTorch-only encode/decode (no fused Triton kernel). Throughput drops ~30-300× depending on the method. Memory savings are real, throughput cost is the trade-off until the kernels get fused. See turboquant.md.
- install.md — install or build first
- troubleshooting.md — common errors
- turboquant.md — how the compression methods work
- benchmarks.md — real numbers