-
Notifications
You must be signed in to change notification settings - Fork 2
Troubleshooting
Common errors when building or running vLLM v0.19.0 on Windows.
vLLM's compiled _C.pyd extension can't find its CUDA / torch DLLs.
The fix is to add the CUDA bin and the torch lib directories to the
Python DLL search path before importing vLLM:
import os
os.add_dll_directory(r"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\bin")
os.add_dll_directory(r"E:\path\to\venv\Lib\site-packages\torch\lib")
import vllmThe test scripts in tests/ already do this. If you're embedding vLLM
in your own code, copy the pattern.
Windows uses commit charge (RAM + pagefile) to back any process allocation. If your pagefile is small or set to zero, even a process with plenty of free physical RAM can fail to allocate large buffers because the system commit limit is exhausted.
This shows up most often when loading large model weights — the embedding tensor of a 14B model is ~1.5 GB and needs a contiguous allocation.
Fix 1: enable a system pagefile.
- Win+R →
sysdm.cpl→ Advanced → Performance → Settings → Advanced → Virtual memory → Change - Uncheck "Automatically manage paging file size for all drives"
- Pick a drive with ≥16 GB free, choose "System managed size"
- Reboot
Fix 2: this build's custom safetensors reader (in
vllm/model_executor/model_loader/weight_utils.py) bypasses the
problem by using numpy.memmap (file-backed, no commit charge) plus
chunked GPU streaming. It's already enabled — if you're seeing this
error you may have an older patched build. Re-apply
vllm-windows-v3.patch.
PyTorch's caching allocator can't satisfy a contiguous allocation even when the GPU has plenty of free memory. This is fragmentation.
Fix: enable expandable segments before importing vLLM:
set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:TrueOr in Python:
import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
import vllm # must be after the env varIf you still get OOM, lower gpu_memory_utilization (try 0.5 first,
then 0.4).
You're running an old PyTorch + new vLLM combo. The custom code uses
torch.unique(t, sorted=True) which returns a single tensor on
PyTorch 2.10+. Make sure your venv has torch==2.10.0.
safetensors crashed inside vLLM's engine context. This is the same root cause as OSError 1455 — Windows commit limit. Enable a pagefile or use the patched safetensors reader. See above.
Same root cause family as the two errors above. Apply the OSError 1455 fix.
There are stale CMake caches in vllm-source/.deps/ from a previous
build that used a different generator. Clean them:
rmdir /s /q vllm-source\.deps
del /s /q vllm-source\buildThen re-run build.bat.
MSVC doesn't accept or / and / not as keywords by default. The
patch fixes every known instance — make sure vllm-windows-v3.patch
applied cleanly:
cd vllm-source
git apply --check ..\vllm-windows-v3.patchIf the check fails, the patch is partially applied or the source has been modified. Reset:
cd vllm-source
git checkout v0.19.0
git reset --hard v0.19.0
git apply ..\vllm-windows-v3.patchYour CUDA toolkit doesn't support Blackwell (SM 12.0). Either:
- Lower
TORCH_CUDA_ARCH_LIST(drop12.0from the list) - Or upgrade to CUDA 12.8+
You hit MSVC's nested-block limit on the auto-generated Marlin kernel
selector. The patch converts these from else if chains to flat if
chains, which avoids the limit. If you still see this, the patch isn't
applied — see the previous section.
nvcc ran out of memory during template instantiation. This is most
common on the Marlin kernel files. Lower MAX_JOBS:
set MAX_JOBS=2Or MAX_JOBS=1 on a 16 GB RAM machine.
The Windows SDK Resource Compiler (rc.exe) isn't on PATH. The build
needs both MSVC and the Windows SDK 10.0.19041 or newer. Open a
"Developer Command Prompt for VS 2022" and run where rc.exe — if
nothing is found, install the Windows SDK component in the VS Installer.
The patch is partially applied. Run:
cd vllm-source
git statusto see what's modified. If the modifications don't match the patch's expected changes, reset and re-apply:
cd vllm-source
git checkout .
cd ..
build.batNCCL doesn't ship with PyTorch on Windows. The patch wires up
FakeProcessGroup so single-GPU operation still works. This warning
is expected.
This is from multi_turboquant. The build uses triton-windows
(separate package, ships only Triton) so Triton itself does work. The
warning is misleading — your build does have Triton kernels.
libnvrtc.so is a Linux library; Windows uses nvrtc64_*.dll. Harmless.