vllm-qdq-plugin

Out-of-tree vLLM plugin that simulates activation quant-dequant (QDQ) before quantized GEMM kernels. Useful for studying the accuracy impact of "real" quantized compute vs weight-only dequant approaches.

How It Works

The plugin registers as a vllm.general_plugins entry point, which vLLM loads automatically in all processes (main + workers). It monkey-patches the low-level op wrappers in vllm._custom_ops to inject QDQ on input activations before the actual kernel call. This means:

Zero vLLM source modifications
Works with both LLM() Python API and vllm serve
Covers all call sites automatically (dense linear, MoE gate+up, MoE down)

Installation

pip install git+https://github.com/yiliu30/vllm-qdq-plugin.git

# Or for development:
git clone https://github.com/yiliu30/vllm-qdq-plugin.git
pip install -e vllm-qdq-plugin/

Usage

# Enable QDQ
VLLM_QDQ=1 python my_script.py

# With vllm serve
VLLM_QDQ=1 vllm serve /path/to/model --tensor-parallel-size 2

# Enable trace logging (prints shape/dtype for each QDQ call)
VLLM_QDQ=1 VLLM_QDQ_TRACE=1 vllm serve /path/to/model

# Force MXFP4 QDQ on Marlin MoE when dtype-based detection is not enough
VLLM_QDQ=1 VLLM_MARLIN_MOE_QDQ_MODE=FORCE_MXFP4 vllm serve /path/to/model

Environment Variables

Variable	Default	Description
`VLLM_QDQ`	`0`	Set to `1` to enable QDQ
`VLLM_QDQ_TRACE`	`0`	Set to `1` to print trace lines (up to 200)
`VLLM_MARLIN_MOE_QDQ_MODE`	`0`	Set to `FORCE_MXFP4` to apply MXFP4 QDQ in `moe_wna16_marlin_gemm` when dtype-based routing is not sufficient. Matching is case-insensitive.

Support Status

Dtype	Op	Status	Notes
MXFP4 (E2M1 + E8M0 scales)	`marlin_gemm`	✅ Supported	Dense quantized linear (MXFP4 via Marlin)
MXFP4 (E2M1 + E8M0 scales)	`moe_wna16_marlin_gemm`	✅ Supported	MoE quantized linear (MXFP4 via Marlin)

How QDQ Works

For MXFP4, the QDQ simulates:

Quantize: Scale activations per group of 32 using E8M0 (power-of-2) scales, then round to nearest FP4 E2M1 value {0, 0.5, 1, 1.5, 2, 3, 4, 6}
Dequantize: Multiply back by the scale to restore the original dtype

This introduces the same quantization noise that a "real" MXFP4 GEMM would produce on the input side, while keeping the actual computation in bf16 via Marlin's weight-only dequant kernel.

Adding New Dtypes

Create a new QDQ implementation in src/vllm_qdq_plugin/qdq/ (e.g., fp8.py)
Add an elif branch in patch.py where the dtype check happens
The QDQ function signature: (x: Tensor, **config) -> Tensor — same shape and dtype in/out

License

Apache-2.0

Sage3 Triton Attention Backend (vllm-omni)

This plugin also provides an out-of-tree diffusion attention backend for vllm-omni, using the SageAttention3 standalone Triton kernel.

How It Works

Registers via the vllm_omni.general_plugins entry_point. When VLLM_SAGE3_TRITON=1, overrides the SAGE_ATTN diffusion attention backend with the sage3 Triton implementation. When disabled (default), the original in-tree backend is used unchanged.

Zero vllm-omni source modifications
Conditional activation — doesn't affect normal operation when off
Falls back to torch SDPA for cross-attention (different Q/K sequence lengths)

Usage

# Enable sage3 Triton attention for diffusion models
VLLM_SAGE3_TRITON=1 \
SAGE3_QUANT_FORMAT=mxfp4 \
DIFFUSION_ATTENTION_BACKEND=SAGE_ATTN \
python examples/offline_inference/text_to_image/text_to_image.py \
  --model /path/to/model ...

# Use original in-tree sage_attn (sageattention v2) — default
DIFFUSION_ATTENTION_BACKEND=SAGE_ATTN python ...

# Use torch SDPA (no sage at all)
DIFFUSION_ATTENTION_BACKEND=TORCH_SDPA python ...

Environment Variables

Variable	Default	Description
`VLLM_SAGE3_TRITON`	`0`	Set to `1` to enable sage3 Triton backend override
`SAGE3_QUANT_FORMAT`	`mxfp4`	Quantization config for K/V (`mxfp4`, `nvfp4`, `mxfp8_s1`, `mxfp4_s1`)
`SAGE3_ACC_DTYPE`	`fp32`	Accumulator dtype (`fp32`, `bf16_both_dot`, `bf16_pv_only`, etc.)

Notes

Shared memory requirement: The sage3 fp32 kernel needs ~192KB shared memory per SM. On GPUs with less (e.g., RTX 6000D with 100KB), use SAGE3_ACC_DTYPE=bf16_both_dot or switch to TORCH_SDPA.
Cross-attention: sage3 requires Q and K to have the same sequence length. Cross-attention calls automatically fall back to torch SDPA.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
docs/plan		docs/plan
examples		examples
scripts		scripts
src/vllm_qdq_plugin		src/vllm_qdq_plugin
tests		tests
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
run_wan_sparge_example.sh		run_wan_sparge_example.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

vllm-qdq-plugin

How It Works

Installation

Usage

Environment Variables

Support Status

How QDQ Works

Adding New Dtypes

License

Sage3 Triton Attention Backend (vllm-omni)

How It Works

Usage

Environment Variables

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

vllm-qdq-plugin

How It Works

Installation

Usage

Environment Variables

Support Status

How QDQ Works

Adding New Dtypes

License

Sage3 Triton Attention Backend (vllm-omni)

How It Works

Usage

Environment Variables

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages