Skip to content

Add option of fp4 QMoE for gpt-oss in model builder#2229

Open
tianleiwu wants to merge 5 commits into
mainfrom
tlwu/model_builder_gpt_oss_fp4
Open

Add option of fp4 QMoE for gpt-oss in model builder#2229
tianleiwu wants to merge 5 commits into
mainfrom
tlwu/model_builder_gpt_oss_fp4

Conversation

@tianleiwu

Copy link
Copy Markdown
Contributor

Add option to build gpt-oss model with FP4 QMoE.

The CUDA QMoE expert weights produced by the model builder were not in the
layout/encoding the fpA_intB mixed-GEMM kernel consumes, so INT4 exports of
Qwen3.5/3.6-MoE generated incoherent output while fp16 was correct.

base.py:
- Add _cutlass_prepacked_blockwise_quantize: quantize each expert weight with
  ORT's blockwise quantize_matmul_4bits (on the transposed [K, N] weight),
  keep the SIGNED scales, and offline CUTLASS-prepack via
  pack_weights_for_cuda_mixed_gemm (force_arch=80, which the kernel expects for
  all SM >= 80). This is the encoding validated by the com.microsoft QMoE CUDA
  parity tests. Taking abs() of the scales (as the previous path did) corrupts
  every block whose anchor element is negative and yields garbage weights.
- make_qmoe_weights: route CUDA QMoE through the new prepacked path and assert
  block_size is 64 or 128 (the only sizes the CUDA kernel supports).
- Plumb a tri-state weights_prepacked QMoE attribute (default None = omit =
  kernel's prepacked default; override via extra_options qmoe_weights_prepacked).

qwen.py:
- Exclude the MoE router and shared-expert gate MatMuls from INT4/INT8
  quantization; 4-bit rounding of these tiny routing matmuls flips top-k expert
  selection and injects large error into every MoE layer.
- Fix a node-name collision in the shared expert (rename the gated Mul to
  .../gate/Mul) that produced a duplicate value name and a ShapeInferenceError.
- make_qmoe_weights: treat weights_prepacked in {None, 1} as prepacked so an
  explicit qmoe_weights_prepacked=1 still produces CUTLASS-prepacked weights
  (previously only None did, while make_qmoe_op emitted weights_prepacked=1 on
  raw weights).
- Replace the block_size assert with a ValueError (asserts are stripped by
  python -O) and apply it to all CUDA QMoE paths.
- Gate the raw (weights_prepacked=0) path and the emitted weights_prepacked
  attribute on the CUDA EP.
- Fix the misleading 'ships raw weights' comment and document that the
  blockwise scales are SIGNED (abs() reintroduces the garbage-output bug).
- Add test/python/models/test_qmoe_weights.py covering path dispatch, EP
  gating, block-size validation, and the signed-scale regression guard.
Copilot AI review requested due to automatic review settings June 15, 2026 19:13
@tianleiwu tianleiwu requested a review from a team as a code owner June 15, 2026 19:13

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends the Python model builder to support exporting gpt-oss MoE layers using FP4 (MXFP4) QMoE on CUDA, and tightens CUDA INT4 QMoE weight handling by adding explicit “prepacked vs raw” layout control and regression tests around the CUDA quantization/packing path.

Changes:

  • Add FP4 (MXFP4) QMoE export path for gpt-oss MoE experts, including FLOAT8E8M0 block-scale initializers and per-expert global scales.
  • Update CUDA INT4 QMoE weight export to explicitly support prepacked vs raw weight layouts and to preserve signed blockwise scales for the CUTLASS-prepacked path.
  • Add unit tests guarding CUDA QMoE dispatch behavior and scale-sign regression.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
test/python/models/test_qmoe_weights.py Adds regression tests for CUDA QMoE weight quantization/packing dispatch and signed-scale preservation.
src/python/py/models/builders/qwen.py Excludes routing-critical projections from INT4/INT8 quantization and adjusts a node name to avoid collisions.
src/python/py/models/builders/gptoss.py Adds FP4 QMoE option for MoE experts and corrects expert-weight layout handling for QMoE.
src/python/py/models/builders/base.py Introduces FP4 initializer/quantization helpers and adds CUDA-specific QMoE prepacked/raw weight handling and op wiring.
src/python/py/models/builder.py Documents the new use_fp4_moe extra option in CLI help.

Comment on lines +305 to +313
use_fp4_moe = extra_options.get("use_fp4_moe", False)
if use_fp4_moe and extra_options.get("use_8bits_moe", False):
raise ValueError("use_fp4_moe and use_8bits_moe are mutually exclusive.")
if use_fp4_moe and self.ep != "cuda":
raise ValueError(f"use_fp4_moe is only supported on the CUDA EP, got ep='{self.ep}'.")
# FP4 (MXFP4) is a 4-bit format; INT8 path is disabled when FP4 is requested.
expert_weight_bits = 4 if use_fp4_moe else (8 if extra_options.get("use_8bits_moe", False) else 4)
# QMoE quantization type: "int" (default INT4/INT8) or "fp4" (MXFP4).
moe_quant_type = "fp4" if use_fp4_moe else "int"
Comment on lines +81 to +82
@pytest.mark.parametrize("weights_prepacked", [None, 0, 1])
@pytest.mark.parametrize("bad_block", [16, 32, 256])
Comment on lines +87 to +88
with pytest.raises(ValueError, match="block_size 64 or 128"):
model.make_qmoe_weights(_W)
# prepacked) is exactly what we want and the attribute is omitted (None).
# Override via extra_options["qmoe_weights_prepacked"] (e.g. 0 to ship raw
# [E, N, K/pack] weights and let the runtime PrePack hook transform them).
weights_prepacked = int(extra_options["qmoe_weights_prepacked"]) if "qmoe_weights_prepacked" in extra_options else None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants