-
Notifications
You must be signed in to change notification settings - Fork 2
Home
Native Windows builds of vLLM with full CUDA + Triton + Multi-TurboQuant KV cache compression. No WSL, no Docker, no Linux VM.
The current release is vLLM 0.19.0 for Windows (download). It ships:
- Pre-built wheel for Python 3.10 + PyTorch 2.10.0 + CUDA 12.6
- 33-file Windows compatibility patch against upstream vLLM 0.19.0
- New
multi_turboquant_kv.pyintegration providing 6 KV cache compression methods with 2× cache capacity - Custom Windows safetensors reader (model loading 29× faster than upstream on Windows)
- End-to-end test suite that proves each compression method actually affects inference (not a placebo)
| 📦 Latest release | Pre-built wheel — just download and pip install
|
| 🚀 Install | One-click installer or wheel install in your own venv |
| 📖 Usage | Python embedding + OpenAI HTTP server |
| 🎯 Multi-TurboQuant | All 6 KV cache compression methods |
| 📊 Benchmarks | Real numbers — load time, KV cache, throughput |
| 🛠️ Build from source | Apply the patch and compile vLLM yourself |
| 🏗️ Architecture | How the integration hangs together |
| 🐛 Troubleshooting | Common errors and fixes |
| Release | vLLM | PyTorch | KV compression | Date |
|---|---|---|---|---|
| v0.19.0-win | 0.19.0 | 2.10.0+cu126 | Multi-TurboQuant (6 methods) + fp8 | 2026-04-12 |
| v0.17.1-win | 0.17.1 | 2.10.0+cu126 | TurboQuant (2 recipes) | 2026-03-21 |
| v0.14.2-win | 0.14.2 | 2.9.1+cu126 | fp8 only | 2026-02-28 |
import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
os.add_dll_directory(r"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\bin")
os.add_dll_directory(r"C:\path\to\venv\Lib\site-packages\torch\lib")
import sys; sys.modules.setdefault("uvloop", type(sys)("uvloop"))
from vllm import LLM, SamplingParams
llm = LLM(
model="path/to/Qwen3-14B-AWQ-4bit",
dtype="float16",
kv_cache_dtype="isoquant4", # 2× more KV cache, near-FP16 quality
max_model_len=2048,
gpu_memory_utilization=0.85,
enforce_eager=True,
)
print(llm.generate(
["Explain CUDA streams in 3 sentences:"],
SamplingParams(temperature=0.7, max_tokens=200),
)[0].outputs[0].text)A unified KV cache compression library with six methods:
| Method | Bits | Calibration | Memory savings | Notes |
|---|---|---|---|---|
isoquant4 |
4.25 | none | 2× | Recommended default |
planarquant4 |
4.25 | none | 2× | Same memory, simpler transform |
isoquant3 |
3.25 | none | 2× | Aggressive |
planarquant3 |
3.25 | none | 2× | Aggressive |
turboquant35 |
3.25 | runtime | 2× | Calibrated outliers |
turboquant25 |
2.25 | runtime | 2× | Most aggressive |
The cache is stored as packed uint8 instead of fp16, doubling KV
capacity at the same gpu_memory_utilization. The attention backend
decodes only active blocks on each forward pass and runs the standard
Triton kernel on the decoded fp16 result.
Trade-off: TQ throughput drops ~30-300× because encode/decode runs in PyTorch (no fused Triton kernel yet). Memory savings are real, throughput is the cost. Best for offline / long-context / batch workloads. See Multi-TurboQuant for the full picture.
vLLM is the most popular open-source LLM serving engine, but it
officially only supports Linux. Building it from upstream on Windows
fails in dozens of places: MSVC vs gcc keyword operators, designated
initializers, __builtin_clz, variable templates with attributes,
NCCL, Gloo TCP, ZMQ IPC sockets, fcntl, fork-vs-spawn, the Windows
commit-charge limit, paging file size, …
This repo collects all the fixes into a single auditable patch and ships it as both source patches and pre-built wheels. The most recent release also adds Multi-TurboQuant KV cache compression as a native integrated feature, not an external monkeypatch.