Common questions and misconceptions, updated 2026-04-17. For the broader 2026 landscape, see LANDSCAPE_2026.md. For concrete integrations, see INTEGRATIONS.md.
No. TurboQuant compresses the KV cache at inference time. It does not touch model weights. It stacks on top of AWQ / GPTQ / GGUF / NVFP4 / MXFP4.
A typical 2026 production stack:
Weights: INT4 (AWQ / GPTQ) or NVFP4 (Blackwell) → ~4× parameter memory savings
×
KV cache: TurboQuant 3.5-bit → ~4.9× KV memory savings
You multiply the savings. A Llama-3.1-8B serving rig that used to need 40 GB (16 GB weights + 16 GB KV + overhead) at 128K context drops to roughly 8 GB.
See Mustafa Genc's Towards AI walkthrough (Apr 15, 2026) for a detailed breakdown of the Q4_K_M × TurboQuant stack.
The paper shows three regimes:
| Bits | LongBench vs FP16 | Use when |
|---|---|---|
| 3.5 (default) | Identical (50.06 vs 50.06) | You want zero perceptible quality loss |
| 2.5 | Marginal degradation (49.44) | You need maximum memory savings and can tolerate <1 pt |
| 4.5 / 5.5 | Indistinguishable from 3.5 | Rarely worth it — diminishing returns |
"3.5-bit" is a mode name from the paper, not a literal bit count. In the default
"3.5-bit" configuration, TurboQuant splits channels into 32 outlier channels at 4
MSE bits and 96 regular channels at 3 MSE bits, plus 1 QJL residual bit per coordinate
— weighted-average bpv is (32×4 + 96×3)/128 + 1 = 4.25 total, about 3.25 effective
after accounting for the small norm overhead (see BENCHMARKS.md).
The "3.5-bit" label is the paper's chosen name; the exact per-channel budget is controlled
by MixedPrecisionConfig in src/cache.py via b_mse / b_outlier —
see IMPLEMENTATION_NOTES.md for the exact outlier handling.
The default in TurboQuantCache(mixed_precision=True, b_mse=2) is actually closer to
the "2.5-bit" mode (3-bit outliers, 2-bit regular); bump b_mse=3 to get the paper's
"3.5-bit" mode.
No. TurboQuant is data-oblivious — the random rotation and Lloyd-Max codebook
are fixed at initialization based only on the head dimension d and seed. This is the
key differentiator from KVQuant / SmoothQuant / AWQ-style methods, all of which require a
calibration dataset.
The only "online" piece is outlier-channel detection, which reads one batch of K/V vectors to decide which 32 of the 128 channels get extra bits. This happens once per layer/head at model load, not per-request.
| Property | FP8 KV | KIVI (2-bit) | KVQuant | TurboQuant 3.5-bit |
|---|---|---|---|---|
| Bits/value | 8 | 2.25 | ~3.5 | 3.5 |
| Compression vs FP16 | 2× | 7.1× | ~4.5× | 4.9× |
| LongBench Avg (Llama-3.1-8B) | 50.0 | 48.50 | ~49.5 | 50.06 |
| Requires calibration | No | No | Yes | No |
| Unbiased attention scores | No | No | No | Yes (QJL residual) |
| Outlier handling | Per-tensor scale | Per-channel K | Dense-sparse split | Two-rotation split |
| Works on RoPE'd K | Yes | Per-channel helps here | Pre-RoPE preferred | Yes, post-RoPE |
The unbiased inner product estimator (the 1-bit QJL residual stage) is what makes TurboQuant's attention scores mathematically unbiased, not just MSE-minimizing. This is why it preserves needle-in-haystack retrieval at 128K context where the other methods start to miss specific tokens.
Yes. TurboQuant applies after RoPE has been applied to K (i.e., on the same K tensor the attention kernel would see). The random rotation Π is orthogonal and applied to the whole vector, so it doesn't interact with RoPE's per-pair rotation.
That said, for eviction methods (H2O, SnapKV, TriAttention) the story is different — TriAttention (Apr 2026) specifically exploits the pre-RoPE space where Q/K cluster around fixed centers. The methods are complementary: TriAttention decides which tokens survive; TurboQuant compresses each survivor.
- GQA / MQA — yes, trivially. TurboQuant operates per-head independently, so sharing K/V across query heads is transparent.
- MLA (DeepSeek-V3 style) — yes. MLA stores a down-projected latent; TurboQuant compresses the latent vectors the same way it compresses standard K/V vectors.
- LRKV (Fin.AI, Apr 2026) — yes. LRKV stores a shared basis + per-head residual; both are vectors that can be TurboQuant-compressed.
- FlashAttention-2 (H100 mainstream): The reference path doesn't fuse with FA2 directly. Use the vLLM/SGLang plugin, which dequantizes the KV tile inside the attention kernel.
- FlashAttention-3 (H100 optimized): Same story — use the engine plugin.
- FlashAttention-4 (Blackwell native, Mar 2026): FA4 was co-designed with FP4 tensor cores in mind. The natural stack on B200 is FA4 + NVFP4 container + TurboQuant 3.5-bit encoding. See LANDSCAPE_2026.md §NVFP4 / MXFP4 KV.
| Hardware | Status | Notes |
|---|---|---|
| NVIDIA H100 / H200 (Hopper) | First-class | FP8 weights + TurboQuant KV is the standard datacenter combo in 2026 |
| NVIDIA B100 / B200 (Blackwell, SM100) | First-class | Native FP4 → NVFP4 weights + TurboQuant KV |
| NVIDIA RTX PRO 6000 Blackwell 96 GB (SM120) | Working with workarounds | See Allen Kuo (Apr 16) for the current WSL2/flashinfer workaround |
| NVIDIA RTX 5090 (SM120) | Working with workarounds | Same SM120 path as RTX PRO 6000, smaller VRAM |
| NVIDIA RTX 4090 / 4080 (Ada) | Fully supported | AWQ-INT4 + TurboQuant KV is the go-to consumer combo. TriAttention enables single-4090 OpenClaw |
| NVIDIA A100 (Ampere) | Fully supported | FP8 not native; use INT8 weights + TurboQuant KV |
| AMD MI300X / MI325X | Supported via PyTorch | vLLM AMD backend works; ROCm Triton kernels a bit behind CUDA |
| Apple M3 / M4 / M5 | PyTorch / MPS path | No FP4; first M5 Max LLM benchmarks vs RTX PRO 6000 |
| Jetson Orin / Thor | Edge path via Adaptive KV-Quant | Per-token bit allocation beats static at this budget |
Not inherently. But most 2026 long-context deployments combine TurboQuant with an eviction method like H2O, SnapKV, StreamingLLM, PyramidKV, or TriAttention. Those methods already handle attention sinks (usually by pinning the first N tokens in FP16).
Attention-sink preservation for a pure-TurboQuant deployment is listed as a planned benchmark in BENCHMARKS.md. For now, if you're running beyond 64K context, stack TurboQuant with KVPress or TriAttention.
On current hardware (Apr 2026):
| Metric | FP16 baseline | TurboQuant 3.5-bit | TurboQuant 2.5-bit |
|---|---|---|---|
| LongBench Avg (Llama-3.1-8B) | 50.06 | 50.06 | 49.44 |
| Needle @ 128K | 0.997 | 0.997 | 0.997 |
| KV memory | 1.0× | 0.20× | 0.14× |
| H100 attention throughput (estimated, BENCHMARKS.md) | 1.0× | 4.4× | 4.6× |
| RTX 5090 attention throughput | 1.0× | 4.3× | 4.5× |
| On-paper distortion gap vs info-theoretic optimal | — | 2.7× | 2.7× |
The throughput win comes from memory bandwidth: the attention kernel reads 52 bytes per KV vector instead of 256.
Probably, yes. KV cache is a small fraction of memory at short contexts, so the absolute GB saved is small. Stick with FP16 or FP8 KV for chat-bot workloads that stay under 8K.
The crossover point where TurboQuant starts earning its keep:
- Long context (32K+): worth it.
- Reasoning / chain-of-thought (tens of thousands of generated tokens): worth it.
- RAG with long documents: worth it.
- Chat under 4K: marginal.
The quote is from the LMCache blog post (Apr 15, 2026), paraphrasing discussion on X/Twitter. Our read:
- The hype is largely earned. Data-oblivious, provably near-optimal, no calibration, 3.5 bpv at FP16 quality is genuinely new.
- It's not the only 2026 breakthrough. TriAttention (Apr 6), LRKV (Apr 9), Adaptive KV-Quant (Apr 6), and NVFP4 KV (Apr 2) are all major shifts. TurboQuant is the precision story; the others are selection, architecture, and hardware stories.
- The best 2026 stack combines them. See LANDSCAPE_2026.md §Decision guide.
Use the paper's methodology — LongBench + Needle-In-A-Haystack at your target
context length. Cosine similarity on random vectors (what src/demo.py reports) is a
useful sanity check but not a substitute for end-to-end eval.
Concrete steps:
- Pick your target engine (vLLM, SGLang, llama.cpp, KVPress) — see INTEGRATIONS.md.
- Run LongBench with FP16 KV to establish a baseline.
- Switch to
turboquant_3bit(or our plugin's default config) and re-run. - Expected delta on Llama-3.1-8B: < 0.1 pts on LongBench Avg, zero change on needle.
- If you see larger deltas, check:
- Is your outlier-channel detection actually being applied? (It's only applied when
MixedPrecisionConfig.enable=True.) - Is
qjl_score_weight=1.0? (Lower values trade bias for variance.) - Is your head dim a power of 2? (Hadamard rotation pads if not; pad token can skew norms.)
- Is your outlier-channel detection actually being applied? (It's only applied when
Cite the paper, not this implementation:
@inproceedings{zandieh2026turboquant,
title={TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate},
author={Zandieh, Amir and Daliri, Majid and Hadian, Majid and Mirrokni, Vahab},
booktitle={International Conference on Learning Representations (ICLR)},
year={2026}
}If you want to credit this open-source port specifically, you can link to the repo.
Start here (sorted by how layperson-friendly they are):
- LMCache Blog (Apr 15, 2026) — "in laymen's term" explainer, no math. Best starting point.
- Towards AI: Running a 35B Model Locally with TurboQuant (Apr 15, 2026) — hands-on consumer-GPU walkthrough.
- MarkTechPost: NVIDIA KVPress end-to-end guide (Apr 9, 2026) — how to stack KVPress eviction under a TurboQuant-style precision backend.
- ArXivIQ FlashAttention-4 analysis (Mar 13, 2026) — FA4 paper summary relevant to Blackwell quantized KV.
- The original paper (arXiv 2504.19874) — 14 pages, math-heavy but worth it.
- ICLR 2026 poster (Apr 25, 2026 PDT) — meet the authors.