Skip to content
rookiemann edited this page Apr 12, 2026 · 2 revisions

vllm-windows-build wiki

Native Windows builds of vLLM with full CUDA + Triton + Multi-TurboQuant KV cache compression. No WSL, no Docker, no Linux VM.

The current release is vLLM 0.19.0 for Windows (download). It ships:

  • Pre-built wheel for Python 3.10 + PyTorch 2.10.0 + CUDA 12.6
  • 33-file Windows compatibility patch against upstream vLLM 0.19.0
  • New multi_turboquant_kv.py integration providing 6 KV cache compression methods with 2× cache capacity
  • Custom Windows safetensors reader (model loading 29× faster than upstream on Windows)
  • End-to-end test suite that proves each compression method actually affects inference (not a placebo)

Quick links

📦 Latest release Pre-built wheel — just download and pip install
🚀 Install One-click installer or wheel install in your own venv
📖 Usage Python embedding + OpenAI HTTP server
🎯 Multi-TurboQuant All 6 KV cache compression methods
📊 Benchmarks Real numbers — load time, KV cache, throughput
🛠️ Build from source Apply the patch and compile vLLM yourself
🏗️ Architecture How the integration hangs together
🐛 Troubleshooting Common errors and fixes

Releases

Release vLLM PyTorch KV compression Date
v0.19.0-win 0.19.0 2.10.0+cu126 Multi-TurboQuant (6 methods) + fp8 2026-04-12
v0.17.1-win 0.17.1 2.10.0+cu126 TurboQuant (2 recipes) 2026-03-21
v0.14.2-win 0.14.2 2.9.1+cu126 fp8 only 2026-02-28

At a glance

import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
os.add_dll_directory(r"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\bin")
os.add_dll_directory(r"C:\path\to\venv\Lib\site-packages\torch\lib")

import sys; sys.modules.setdefault("uvloop", type(sys)("uvloop"))
from vllm import LLM, SamplingParams

llm = LLM(
    model="path/to/Qwen3-14B-AWQ-4bit",
    dtype="float16",
    kv_cache_dtype="isoquant4",   # 2× more KV cache, near-FP16 quality
    max_model_len=2048,
    gpu_memory_utilization=0.85,
    enforce_eager=True,
)

print(llm.generate(
    ["Explain CUDA streams in 3 sentences:"],
    SamplingParams(temperature=0.7, max_tokens=200),
)[0].outputs[0].text)

What is Multi-TurboQuant?

A unified KV cache compression library with six methods:

Method Bits Calibration Memory savings Notes
isoquant4 4.25 none Recommended default
planarquant4 4.25 none Same memory, simpler transform
isoquant3 3.25 none Aggressive
planarquant3 3.25 none Aggressive
turboquant35 3.25 runtime Calibrated outliers
turboquant25 2.25 runtime Most aggressive

The cache is stored as packed uint8 instead of fp16, doubling KV capacity at the same gpu_memory_utilization. The attention backend decodes only active blocks on each forward pass and runs the standard Triton kernel on the decoded fp16 result.

Trade-off: TQ throughput drops ~30-300× because encode/decode runs in PyTorch (no fused Triton kernel yet). Memory savings are real, throughput is the cost. Best for offline / long-context / batch workloads. See Multi-TurboQuant for the full picture.

Why this exists

vLLM is the most popular open-source LLM serving engine, but it officially only supports Linux. Building it from upstream on Windows fails in dozens of places: MSVC vs gcc keyword operators, designated initializers, __builtin_clz, variable templates with attributes, NCCL, Gloo TCP, ZMQ IPC sockets, fcntl, fork-vs-spawn, the Windows commit-charge limit, paging file size, …

This repo collects all the fixes into a single auditable patch and ships it as both source patches and pre-built wheels. The most recent release also adds Multi-TurboQuant KV cache compression as a native integrated feature, not an external monkeypatch.

See also

vllm-windows-build

🏠 Home

Getting started

Multi-TurboQuant

Reference

Releases

Clone this wiki locally