Home

vllm-windows-build wiki

Native Windows builds of vLLM with full CUDA + Triton + Multi-TurboQuant KV cache compression. No WSL, no Docker, no Linux VM.

The current release is vLLM 0.19.0 for Windows (download). It ships:

Pre-built wheel for Python 3.10 + PyTorch 2.10.0 + CUDA 12.6
33-file Windows compatibility patch against upstream vLLM 0.19.0
New multi_turboquant_kv.py integration providing 6 KV cache compression methods with 2× cache capacity
Custom Windows safetensors reader (model loading 29× faster than upstream on Windows)
End-to-end test suite that proves each compression method actually affects inference (not a placebo)

Quick links


📦 Latest release	Pre-built wheel — just download and `pip install`
🚀 Install	One-click installer or wheel install in your own venv
📖 Usage	Python embedding + OpenAI HTTP server
🎯 Multi-TurboQuant	All 6 KV cache compression methods
📊 Benchmarks	Real numbers — load time, KV cache, throughput
🛠️ Build from source	Apply the patch and compile vLLM yourself
🏗️ Architecture	How the integration hangs together
🐛 Troubleshooting	Common errors and fixes

Releases

Release	vLLM	PyTorch	KV compression	Date
v0.19.0-win	0.19.0	2.10.0+cu126	Multi-TurboQuant (6 methods) + fp8	2026-04-12
v0.17.1-win	0.17.1	2.10.0+cu126	TurboQuant (2 recipes)	2026-03-21
v0.14.2-win	0.14.2	2.9.1+cu126	fp8 only	2026-02-28

At a glance

import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
os.add_dll_directory(r"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\bin")
os.add_dll_directory(r"C:\path\to\venv\Lib\site-packages\torch\lib")

import sys; sys.modules.setdefault("uvloop", type(sys)("uvloop"))
from vllm import LLM, SamplingParams

llm = LLM(
    model="path/to/Qwen3-14B-AWQ-4bit",
    dtype="float16",
    kv_cache_dtype="isoquant4",   # 2× more KV cache, near-FP16 quality
    max_model_len=2048,
    gpu_memory_utilization=0.85,
    enforce_eager=True,
)

print(llm.generate(
    ["Explain CUDA streams in 3 sentences:"],
    SamplingParams(temperature=0.7, max_tokens=200),
)[0].outputs[0].text)

What is Multi-TurboQuant?

A unified KV cache compression library with six methods:

Method	Bits	Calibration	Memory savings	Notes
`isoquant4`	4.25	none	2×	Recommended default
`planarquant4`	4.25	none	2×	Same memory, simpler transform
`isoquant3`	3.25	none	2×	Aggressive
`planarquant3`	3.25	none	2×	Aggressive
`turboquant35`	3.25	runtime	2×	Calibrated outliers
`turboquant25`	2.25	runtime	2×	Most aggressive

The cache is stored as packed uint8 instead of fp16, doubling KV capacity at the same gpu_memory_utilization. The attention backend decodes only active blocks on each forward pass and runs the standard Triton kernel on the decoded fp16 result.

Trade-off: TQ throughput drops ~30-300× because encode/decode runs in PyTorch (no fused Triton kernel yet). Memory savings are real, throughput is the cost. Best for offline / long-context / batch workloads. See Multi-TurboQuant for the full picture.

Why this exists

vLLM is the most popular open-source LLM serving engine, but it officially only supports Linux. Building it from upstream on Windows fails in dozens of places: MSVC vs gcc keyword operators, designated initializers, __builtin_clz, variable templates with attributes, NCCL, Gloo TCP, ZMQ IPC sockets, fcntl, fork-vs-spawn, the Windows commit-charge limit, paging file size, …

This repo collects all the fixes into a single auditable patch and ships it as both source patches and pre-built wheels. The most recent release also adds Multi-TurboQuant KV cache compression as a native integrated feature, not an external monkeypatch.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

vllm-windows-build wiki

Quick links

Releases

At a glance

What is Multi-TurboQuant?

Why this exists

See also

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vllm-windows-build

Getting started

Multi-TurboQuant

Reference

Releases

Clone this wiki locally