-
Notifications
You must be signed in to change notification settings - Fork 2
Benchmarks
All numbers below are from the same hardware running the same model with
the same settings, varying only kv_cache_dtype. Source data is in
tests/.
| GPU | NVIDIA RTX 3090 (24 GB, SM 8.6) |
| CPU | (16-core, 64 GB RAM) |
| OS | Windows 10 Pro 22H2 |
| Driver | 591.86 |
| CUDA | 12.6 |
| PyTorch | 2.10.0+cu126 |
| vLLM | 0.19.0+cu126 |
Qwen3-14B-abliterated-AWQ-4bit
| Architecture | Qwen3ForCausalLM |
| Parameters | 14.8B |
| Quantization | AWQ-4bit (compressed-tensors / Marlin) |
| Layers | 40 |
| Hidden size | 5120 |
| Attention heads | 40 |
| KV heads | 8 |
| Head dim | 128 |
| Tokenizer vocab | 151,936 |
| Disk size | 9.4 GB |
| GPU weights | 9.36 GiB |
| Step | Time |
|---|---|
| CMake configure | ~3 min |
| Compile (140 targets, MAX_JOBS=4) | ~28 min |
| Total | ~31 min |
| Wheel size | 201 MB |
| Time | Notes | |
|---|---|---|
| Custom mmap reader (load weights) | 6.5 s | numpy.memmap + chunked GPU streaming |
| Marlin layout repack + Triton compile | ~3 s | one-time per process |
| Total LLM() init | ~10 s | |
| Compare: original safetensors mmap | ~189 s | unworkable on systems with small pagefile |
The Windows-specific safetensors reader is 29× faster than the upstream mmap path on Windows. The original path also fails outright when the pagefile is too small to satisfy fragmented allocations mid-load — see Troubleshooting.
Configuration: max_model_len=512, gpu_memory_utilization=0.5,
max_num_seqs=4, block_size=16, single 24 GB RTX 3090.
| dtype | Bytes/slot | KV tokens | Concurrency @ 512 | Cache vs FP16 |
|---|---|---|---|---|
auto (fp16) |
256 | 16,336 | 31.91× | 1.00× |
isoquant3 |
128 | 32,672 | 63.94× | 2.00× |
isoquant4 |
128 | 32,672 | 63.94× | 2.00× |
planarquant3 |
128 | 32,672 | 63.94× | 2.00× |
planarquant4 |
128 | 32,672 | 63.94× | 2.00× |
turboquant25 |
128 | 32,672 | 63.94× | 2.00× |
turboquant35 |
128 | 32,672 | 63.94× | 2.00× |
Each TQ method also produces a unique numerical signature in inference
output (verified by tests/test_tq_real.py):
auto ' Paris. What is the capital of Italy? The capital of Italy is Rome.'
isoquant3 ' Paris, and the capital of Canada is Ottawa.'
isoquant4 ' Paris. The capital of Italy is Rome. The capital of Spain is Madrid.'
planarquant3 ' Paris. The capital of France is Paris. The capital of France is...'
planarquant4 ' Paris. What is the capital of Italy? The answer is Rome.'
turboquant25 ' Paris. The capital of France is Paris. So, the capital of France...'
turboquant35 ' Paris. The capital of Italy is Rome. The capital of Spain is Madrid.'
Single 24 GB RTX 3090, single prompt, 20 tokens, deterministic sampling
(temperature=0.0, seed=0).
| dtype | tok/s | Generation cost |
|---|---|---|
auto (fp16) |
~37 | (baseline) |
isoquant3 |
~0.12 | encode + decode in PyTorch, no fused kernel |
isoquant4 |
~0.12 | same |
planarquant3 |
~0.12 | same |
planarquant4 |
~0.12 | same |
turboquant25 |
~1.0 | turbo path is more vectorised |
turboquant35 |
~1.0 | same |
The compression methods are correct but not fast. The encode and decode currently run in PyTorch-vectorised mode without a fused Triton kernel. They give you real memory savings (the cache is half the size on disk and in VRAM) but cost ~30-300× in throughput.
This makes them suitable for:
- Offline / batch inference where memory is the bottleneck
- Long-context workloads where the KV cache dominates
- Validation that the methods work end-to-end
They're not yet suitable for:
- Latency-sensitive online serving
- High-QPS workloads
A fused Triton kernel for encode and decode would close most of this gap. The wiring is in place — the optimization is the next milestone.
set VLLM_MODEL_PATH=E:\models\Qwen3-14B-abliterated-AWQ-4bit
set VLLM_PYTHON=E:\vllm-windows-build\venv\Scripts\python.exe
set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
REM Smoke test (~30 sec)
%VLLM_PYTHON% tests\test_v19.py
REM TQ correctness sweep (~15 min)
%VLLM_PYTHON% tests\test_tq_real.py
REM Full benchmark (~70 min)
%VLLM_PYTHON% tests\test_tq_thorough.py