Add option of fp4 QMoE for gpt-oss in model builder by tianleiwu · Pull Request #2229 · microsoft/onnxruntime-genai

tianleiwu · 2026-06-15T19:13:32Z

Add option to build gpt-oss model with FP4 QMoE.

The CUDA QMoE expert weights produced by the model builder were not in the layout/encoding the fpA_intB mixed-GEMM kernel consumes, so INT4 exports of Qwen3.5/3.6-MoE generated incoherent output while fp16 was correct. base.py: - Add _cutlass_prepacked_blockwise_quantize: quantize each expert weight with ORT's blockwise quantize_matmul_4bits (on the transposed [K, N] weight), keep the SIGNED scales, and offline CUTLASS-prepack via pack_weights_for_cuda_mixed_gemm (force_arch=80, which the kernel expects for all SM >= 80). This is the encoding validated by the com.microsoft QMoE CUDA parity tests. Taking abs() of the scales (as the previous path did) corrupts every block whose anchor element is negative and yields garbage weights. - make_qmoe_weights: route CUDA QMoE through the new prepacked path and assert block_size is 64 or 128 (the only sizes the CUDA kernel supports). - Plumb a tri-state weights_prepacked QMoE attribute (default None = omit = kernel's prepacked default; override via extra_options qmoe_weights_prepacked). qwen.py: - Exclude the MoE router and shared-expert gate MatMuls from INT4/INT8 quantization; 4-bit rounding of these tiny routing matmuls flips top-k expert selection and injects large error into every MoE layer. - Fix a node-name collision in the shared expert (rename the gated Mul to .../gate/Mul) that produced a duplicate value name and a ShapeInferenceError.

- make_qmoe_weights: treat weights_prepacked in {None, 1} as prepacked so an explicit qmoe_weights_prepacked=1 still produces CUTLASS-prepacked weights (previously only None did, while make_qmoe_op emitted weights_prepacked=1 on raw weights). - Replace the block_size assert with a ValueError (asserts are stripped by python -O) and apply it to all CUDA QMoE paths. - Gate the raw (weights_prepacked=0) path and the emitted weights_prepacked attribute on the CUDA EP. - Fix the misleading 'ships raw weights' comment and document that the blockwise scales are SIGNED (abs() reintroduces the garbage-output bug). - Add test/python/models/test_qmoe_weights.py covering path dispatch, EP gating, block-size validation, and the signed-scale regression guard.

…t_oss_fp4

Copilot

Pull request overview

This PR extends the Python model builder to support exporting gpt-oss MoE layers using FP4 (MXFP4) QMoE on CUDA, and tightens CUDA INT4 QMoE weight handling by adding explicit “prepacked vs raw” layout control and regression tests around the CUDA quantization/packing path.

Changes:

Add FP4 (MXFP4) QMoE export path for gpt-oss MoE experts, including FLOAT8E8M0 block-scale initializers and per-expert global scales.
Update CUDA INT4 QMoE weight export to explicitly support prepacked vs raw weight layouts and to preserve signed blockwise scales for the CUTLASS-prepacked path.
Add unit tests guarding CUDA QMoE dispatch behavior and scale-sign regression.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
test/python/models/test_qmoe_weights.py	Adds regression tests for CUDA QMoE weight quantization/packing dispatch and signed-scale preservation.
src/python/py/models/builders/qwen.py	Excludes routing-critical projections from INT4/INT8 quantization and adjusts a node name to avoid collisions.
src/python/py/models/builders/gptoss.py	Adds FP4 QMoE option for MoE experts and corrects expert-weight layout handling for QMoE.
src/python/py/models/builders/base.py	Introduces FP4 initializer/quantization helpers and adds CUDA-specific QMoE prepacked/raw weight handling and op wiring.
src/python/py/models/builder.py	Documents the new `use_fp4_moe` extra option in CLI help.

+        use_fp4_moe = extra_options.get("use_fp4_moe", False)
+        if use_fp4_moe and extra_options.get("use_8bits_moe", False):
+            raise ValueError("use_fp4_moe and use_8bits_moe are mutually exclusive.")
+        if use_fp4_moe and self.ep != "cuda":
+            raise ValueError(f"use_fp4_moe is only supported on the CUDA EP, got ep='{self.ep}'.")
+        # FP4 (MXFP4) is a 4-bit format; INT8 path is disabled when FP4 is requested.
+        expert_weight_bits = 4 if use_fp4_moe else (8 if extra_options.get("use_8bits_moe", False) else 4)
+        # QMoE quantization type: "int" (default INT4/INT8) or "fp4" (MXFP4).
+        moe_quant_type = "fp4" if use_fp4_moe else "int"


+@pytest.mark.parametrize("weights_prepacked", [None, 0, 1])
+@pytest.mark.parametrize("bad_block", [16, 32, 256])


+    with pytest.raises(ValueError, match="block_size 64 or 128"):
+        model.make_qmoe_weights(_W)


+        # prepacked) is exactly what we want and the attribute is omitted (None).
+        # Override via extra_options["qmoe_weights_prepacked"] (e.g. 0 to ship raw
+        # [E, N, K/pack] weights and let the runtime PrePack hook transform them).
+        weights_prepacked = int(extra_options["qmoe_weights_prepacked"]) if "qmoe_weights_prepacked" in extra_options else None


tianleiwu added 4 commits June 15, 2026 19:11

fix gpt-oss rotary cache

9a6e907

support fp4

b204ceb

Copilot AI review requested due to automatic review settings June 15, 2026 19:13

tianleiwu requested a review from a team as a code owner June 15, 2026 19:13

Copilot started reviewing on behalf of tianleiwu June 15, 2026 19:14 View session

Merge remote-tracking branch 'origin/main' into tlwu/model_builder_gp…

9cd0a90

…t_oss_fp4

Copilot AI reviewed Jun 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add option of fp4 QMoE for gpt-oss in model builder#2229

Add option of fp4 QMoE for gpt-oss in model builder#2229
tianleiwu wants to merge 5 commits into
mainfrom
tlwu/model_builder_gpt_oss_fp4

tianleiwu commented Jun 15, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@pytest.mark.parametrize("weights_prepacked", [None, 0, 1])
		@pytest.mark.parametrize("bad_block", [16, 32, 256])

		with pytest.raises(ValueError, match="block_size 64 or 128"):
		model.make_qmoe_weights(_W)

Uh oh!

Conversation

tianleiwu commented Jun 15, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants