Add option of fp4 QMoE for gpt-oss in model builder#2229
Open
tianleiwu wants to merge 5 commits into
Open
Conversation
The CUDA QMoE expert weights produced by the model builder were not in the layout/encoding the fpA_intB mixed-GEMM kernel consumes, so INT4 exports of Qwen3.5/3.6-MoE generated incoherent output while fp16 was correct. base.py: - Add _cutlass_prepacked_blockwise_quantize: quantize each expert weight with ORT's blockwise quantize_matmul_4bits (on the transposed [K, N] weight), keep the SIGNED scales, and offline CUTLASS-prepack via pack_weights_for_cuda_mixed_gemm (force_arch=80, which the kernel expects for all SM >= 80). This is the encoding validated by the com.microsoft QMoE CUDA parity tests. Taking abs() of the scales (as the previous path did) corrupts every block whose anchor element is negative and yields garbage weights. - make_qmoe_weights: route CUDA QMoE through the new prepacked path and assert block_size is 64 or 128 (the only sizes the CUDA kernel supports). - Plumb a tri-state weights_prepacked QMoE attribute (default None = omit = kernel's prepacked default; override via extra_options qmoe_weights_prepacked). qwen.py: - Exclude the MoE router and shared-expert gate MatMuls from INT4/INT8 quantization; 4-bit rounding of these tiny routing matmuls flips top-k expert selection and injects large error into every MoE layer. - Fix a node-name collision in the shared expert (rename the gated Mul to .../gate/Mul) that produced a duplicate value name and a ShapeInferenceError.
- make_qmoe_weights: treat weights_prepacked in {None, 1} as prepacked so an
explicit qmoe_weights_prepacked=1 still produces CUTLASS-prepacked weights
(previously only None did, while make_qmoe_op emitted weights_prepacked=1 on
raw weights).
- Replace the block_size assert with a ValueError (asserts are stripped by
python -O) and apply it to all CUDA QMoE paths.
- Gate the raw (weights_prepacked=0) path and the emitted weights_prepacked
attribute on the CUDA EP.
- Fix the misleading 'ships raw weights' comment and document that the
blockwise scales are SIGNED (abs() reintroduces the garbage-output bug).
- Add test/python/models/test_qmoe_weights.py covering path dispatch, EP
gating, block-size validation, and the signed-scale regression guard.
Contributor
There was a problem hiding this comment.
Pull request overview
This PR extends the Python model builder to support exporting gpt-oss MoE layers using FP4 (MXFP4) QMoE on CUDA, and tightens CUDA INT4 QMoE weight handling by adding explicit “prepacked vs raw” layout control and regression tests around the CUDA quantization/packing path.
Changes:
- Add FP4 (MXFP4) QMoE export path for gpt-oss MoE experts, including FLOAT8E8M0 block-scale initializers and per-expert global scales.
- Update CUDA INT4 QMoE weight export to explicitly support prepacked vs raw weight layouts and to preserve signed blockwise scales for the CUTLASS-prepacked path.
- Add unit tests guarding CUDA QMoE dispatch behavior and scale-sign regression.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| test/python/models/test_qmoe_weights.py | Adds regression tests for CUDA QMoE weight quantization/packing dispatch and signed-scale preservation. |
| src/python/py/models/builders/qwen.py | Excludes routing-critical projections from INT4/INT8 quantization and adjusts a node name to avoid collisions. |
| src/python/py/models/builders/gptoss.py | Adds FP4 QMoE option for MoE experts and corrects expert-weight layout handling for QMoE. |
| src/python/py/models/builders/base.py | Introduces FP4 initializer/quantization helpers and adds CUDA-specific QMoE prepacked/raw weight handling and op wiring. |
| src/python/py/models/builder.py | Documents the new use_fp4_moe extra option in CLI help. |
Comment on lines
+305
to
+313
| use_fp4_moe = extra_options.get("use_fp4_moe", False) | ||
| if use_fp4_moe and extra_options.get("use_8bits_moe", False): | ||
| raise ValueError("use_fp4_moe and use_8bits_moe are mutually exclusive.") | ||
| if use_fp4_moe and self.ep != "cuda": | ||
| raise ValueError(f"use_fp4_moe is only supported on the CUDA EP, got ep='{self.ep}'.") | ||
| # FP4 (MXFP4) is a 4-bit format; INT8 path is disabled when FP4 is requested. | ||
| expert_weight_bits = 4 if use_fp4_moe else (8 if extra_options.get("use_8bits_moe", False) else 4) | ||
| # QMoE quantization type: "int" (default INT4/INT8) or "fp4" (MXFP4). | ||
| moe_quant_type = "fp4" if use_fp4_moe else "int" |
Comment on lines
+81
to
+82
| @pytest.mark.parametrize("weights_prepacked", [None, 0, 1]) | ||
| @pytest.mark.parametrize("bad_block", [16, 32, 256]) |
Comment on lines
+87
to
+88
| with pytest.raises(ValueError, match="block_size 64 or 128"): | ||
| model.make_qmoe_weights(_W) |
| # prepacked) is exactly what we want and the attribute is omitted (None). | ||
| # Override via extra_options["qmoe_weights_prepacked"] (e.g. 0 to ship raw | ||
| # [E, N, K/pack] weights and let the runtime PrePack hook transform them). | ||
| weights_prepacked = int(extra_options["qmoe_weights_prepacked"]) if "qmoe_weights_prepacked" in extra_options else None |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add option to build gpt-oss model with FP4 QMoE.