Update model builder for gpt-oss#2234
Open
tianleiwu wants to merge 8 commits into
Open
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR updates the Python model builder’s quantization and QMoE expert-weight handling (notably for GPT-OSS/Qwen MoE scenarios), adding clearer configuration semantics and CUDA-specific QMoE weight encodings, plus documentation and tests.
Changes:
- Decouples INT4 quantization base method (
default/rtn/k_quant) from orthogonal INT8 bit-placement flags (last_matmul_weight_int8,int8_mixed_layers,int8_linear_attn) while keeping legacy aliases working. - Adjusts QMoE expert-weight layout handling (including CUDA prepacked vs raw pathways) and updates GPT-OSS MoE weight transposition assumptions.
- Adds documentation for INT4/INT8 quantization configuration and introduces targeted QMoE CUDA-path unit tests.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| test/python/models/test_qmoe_weights.py | Adds unit tests validating CUDA QMoE weight-quantization dispatch, block-size validation, and signed-scale regression behavior. |
| src/python/py/models/quantization.md | New documentation describing INT4 base methods and orthogonal INT8 bit-placement flags (and legacy aliases). |
| src/python/py/models/builders/gptoss.py | Ensures expert-weight tensors are transposed consistently to match downstream MoE/QMoE consumers. |
| src/python/py/models/builders/base.py | Implements the new quant config resolution, DEFAULT+INT8 two-pass logic, and CUDA QMoE prepacked/raw quantization paths. |
| src/python/py/models/builder.py | Adds new boolean extra-options for bit-placement flags and updates CLI help text for quantization configuration. |
Contributor
Author
|
/azp run Integration Tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
…add QMoE N check, harden test stub
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Contributor
Author
|
/azp run Integration Tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Update model builder for GPT-OSS (QMoE encoding fix, decoupled INT4/INT8 quantization, QK-norm fusion)
Description
This PR overhauls the Python model builder to correctly export GPT-OSS (and related MoE/hybrid) models. It fixes a CUDA QMoE expert-weight encoding bug that produced garbage INT4 weights, decouples the INT4 quantization method from INT8 bit placement (so any base method can be combined with any INT8 upgrade policy), adds per-channel QMoE quantization, and introduces optional Q/K-norm fusion into
GroupQueryAttention. It supersedes #2209.Summary of Changes
QMoE expert-weight quantization (CUDA)
_cutlass_prepacked_blockwise_quantizequantizes each expert[N, K]with ONNX Runtime's blockwisequantize_matmul_4bits/8bitson the transposed[K, N]weight, keeps the signed scales, and offline CUTLASS-prepacks viapack_weights_for_cuda_mixed_gemm(..., force_arch=80)— the exact encoding validated by thecom.microsoftQMoE CUDA parity tests._matmulnbits_blockwise_quantizeemits raw[N, K/pack]weights + blockwise scales for the runtime PrePack hook (used whenqmoe_weights_prepacked=0)._symmetric_per_channel_quantize/_cuda_per_channel_quantizeadd per-channel QMoE quantization (qmoe_block_size <= 0), omitting theblock_sizeattribute.make_qmoe_weightsroutes CUDA QMoE through the prepacked/raw/per-channel paths, validatesblock_size ∈ {32, 64, 128}andN % pack == 0with real exceptions, and removes the deprecated column-wise (TensorRT-LLM) path.weights_prepackedQMoE attribute (-1auto/omit,1prepacked,0raw), CUDA-only, validated to{-1, 0, 1}; override viaextra_options["qmoe_weights_prepacked"].[E, N, K]for all EPs (the old CUDA-only special case swappedN/Kand silently corrupted quantized weights/scales).Root cause fixed: the previous CUDA path applied
abs()to the blockwise scales.quantize_matmul_4bitsfolds the block-anchor sign into the scale and the kernel dequantizes as(q − 8) · scale, soabs()corrupted every block whose anchor element is negative (~half the blocks), yielding garbage weights that merely looked like INT4 quality loss.Decoupled INT4 method vs. INT8 bit placement
last_matmul_weight_int8,int8_mixed_layers,int8_linear_attn;int4_algo_configreduced to base methodsdefault/rtn/k_quantwith help text updated.resolve_int4_quant_configsplitsint4_algo_configinto a base method + placement flags; legacy compound names (rtn_last,k_quant_last,k_quant_mixed,k_quant_linear) are kept as aliases.make_bit_placement_configbuilds the per-node INT8customized_weight_config;make_algo_confignow takes a base method + config and raises on unknown methods.to_int4supports INT8 placement on top of the DEFAULT method via a two-pass scheme (INT4 body first, then an RTN INT8 pass for designated nodes, since DEFAULT 8-bit QOperator output is not consumed correctly by the runtime kernel).Q/K-norm fusion into GroupQueryAttention
is_fused_qk_norm_gqa_supportedandq_norm_weight/k_norm_weightGQA inputs to pass Q/K-norm weights directly intoGroupQueryAttention(CUDA/WebGPU). Gated by thefuse_qk_norm_gqaextra option (default off).fuse_qk_norm_gqaextra option.Robustness / misc
enable_skip_layer_norm_strict_modeis now version-gated — disabled on ONNX Runtime ≥ 1.27 (FP32 accumulation by default, faster) and kept enabled on older supported versions to avoid an accuracy regression (newonnxruntime_version_at_leasthelper).raise ... from e) added to QMoE quantization failures.Tests
onnxruntimewheel.Testing
quantize_matmul_4bits,pack_weights_for_cuda_mixed_gemm) is unavailable.int4_algo_configcompound values (rtn_last,k_quant_last,k_quant_mixed,k_quant_linear) still produce identical models via the alias mapping.Motivation and Context
Replaces #2209. The CUDA QMoE export path was producing weights the
com.microsoftQMoE kernel could not consume correctly (sign-stripped scales + wrongN/Korientation), which manifested as severe quality loss on GPT-OSS / Qwen3.5 MoE models. Alongside the fix, the quantization configuration surface is reworked so INT4 rounding method and INT8 placement are independent, which is needed for the mixed-precision recipes these models rely on.Checklist
quantization.md,README.md)int4_algo_configvalues preserved as aliases)