Update model builder for gpt-oss by tianleiwu · Pull Request #2234 · microsoft/onnxruntime-genai

tianleiwu · 2026-06-19T17:22:05Z

Update model builder for GPT-OSS (QMoE encoding fix, decoupled INT4/INT8 quantization, QK-norm fusion)

Description

This PR overhauls the Python model builder to correctly export GPT-OSS (and related MoE/hybrid) models. It fixes a CUDA QMoE expert-weight encoding bug that produced garbage INT4 weights, decouples the INT4 quantization method from INT8 bit placement (so any base method can be combined with any INT8 upgrade policy), adds per-channel QMoE quantization, and introduces optional Q/K-norm fusion into GroupQueryAttention. It supersedes #2209.

Summary of Changes

QMoE expert-weight quantization (CUDA)

File	Change
src/python/py/models/builders/base.py	New `_cutlass_prepacked_blockwise_quantize` quantizes each expert `[N, K]` with ONNX Runtime's blockwise `quantize_matmul_4bits`/`8bits` on the transposed `[K, N]` weight, keeps the signed scales, and offline CUTLASS-prepacks via `pack_weights_for_cuda_mixed_gemm(..., force_arch=80)` — the exact encoding validated by the `com.microsoft` QMoE CUDA parity tests.
src/python/py/models/builders/base.py	New `_matmulnbits_blockwise_quantize` emits raw `[N, K/pack]` weights + blockwise scales for the runtime PrePack hook (used when `qmoe_weights_prepacked=0`).
src/python/py/models/builders/base.py	New `_symmetric_per_channel_quantize` / `_cuda_per_channel_quantize` add per-channel QMoE quantization (`qmoe_block_size <= 0`), omitting the `block_size` attribute.
src/python/py/models/builders/base.py	`make_qmoe_weights` routes CUDA QMoE through the prepacked/raw/per-channel paths, validates `block_size ∈ {32, 64, 128}` and `N % pack == 0` with real exceptions, and removes the deprecated column-wise (TensorRT-LLM) path.
src/python/py/models/builders/base.py	New tri-state `weights_prepacked` QMoE attribute (`-1` auto/omit, `1` prepacked, `0` raw), CUDA-only, validated to `{-1, 0, 1}`; override via `extra_options["qmoe_weights_prepacked"]`.
src/python/py/models/builders/gptoss.py	Always transpose HF expert weights to output-major `[E, N, K]` for all EPs (the old CUDA-only special case swapped `N`/`K` and silently corrupted quantized weights/scales).

Root cause fixed: the previous CUDA path applied abs() to the blockwise scales. quantize_matmul_4bits folds the block-anchor sign into the scale and the kernel dequantizes as (q − 8) · scale, so abs() corrupted every block whose anchor element is negative (~half the blocks), yielding garbage weights that merely looked like INT4 quality loss.

Decoupled INT4 method vs. INT8 bit placement

File	Change
src/python/py/models/builder.py	New orthogonal flags `last_matmul_weight_int8`, `int8_mixed_layers`, `int8_linear_attn`; `int4_algo_config` reduced to base methods `default`/`rtn`/`k_quant` with help text updated.
src/python/py/models/builders/base.py	New `resolve_int4_quant_config` splits `int4_algo_config` into a base method + placement flags; legacy compound names (`rtn_last`, `k_quant_last`, `k_quant_mixed`, `k_quant_linear`) are kept as aliases.
src/python/py/models/builders/base.py	New `make_bit_placement_config` builds the per-node INT8 `customized_weight_config`; `make_algo_config` now takes a base method + config and raises on unknown methods.
src/python/py/models/builders/base.py	`to_int4` supports INT8 placement on top of the DEFAULT method via a two-pass scheme (INT4 body first, then an RTN INT8 pass for designated nodes, since DEFAULT 8-bit QOperator output is not consumed correctly by the runtime kernel).
src/python/py/models/quantization.md	New doc explaining base methods, INT8 placement flags, legacy aliases, accuracy notes (MMLU numbers), and examples.

Q/K-norm fusion into GroupQueryAttention

File	Change
src/python/py/models/builders/base.py	New `is_fused_qk_norm_gqa_supported` and `q_norm_weight`/`k_norm_weight` GQA inputs to pass Q/K-norm weights directly into `GroupQueryAttention` (CUDA/WebGPU). Gated by the `fuse_qk_norm_gqa` extra option (default off).
src/python/py/models/builder.py	Register `fuse_qk_norm_gqa` extra option.
src/python/py/models/README.md	Document the QK-norm fusion toggle.

Robustness / misc

src/python/py/models/builders/base.py: enable_skip_layer_norm_strict_mode is now version-gated — disabled on ONNX Runtime ≥ 1.27 (FP32 accumulation by default, faster) and kept enabled on older supported versions to avoid an accuracy regression (new onnxruntime_version_at_least helper).
Error chaining (raise ... from e) added to QMoE quantization failures.

Tests

test/python/builder/test_qmoe_weights.py: New unit tests covering CUDA path dispatch (prepacked/raw/per-channel), EP gating, unsupported block-size rejection, per-channel shapes, and a regression guard asserting blockwise scales stay signed. Includes a hardened dependency stub that no longer shadows a real onnxruntime wheel.

Testing

Run the new builder unit tests:
```
python -m pytest -sv test/python/builder/test_qmoe_weights.py
```
CUDA-only shape/sign regression tests auto-skip when the ONNX Runtime CUDA pybind (quantize_matmul_4bits, pack_weights_for_cuda_mixed_gemm) is unavailable.
End-to-end: build a GPT-OSS QMoE model on CUDA and verify generation quality (the encoding fix restores expected INT4 accuracy vs. the previous garbage output).
Backward compatibility: legacy int4_algo_config compound values (rtn_last, k_quant_last, k_quant_mixed, k_quant_linear) still produce identical models via the alias mapping.

Motivation and Context

Replaces #2209. The CUDA QMoE export path was producing weights the com.microsoft QMoE kernel could not consume correctly (sign-stripped scales + wrong N/K orientation), which manifested as severe quality loss on GPT-OSS / Qwen3.5 MoE models. Alongside the fix, the quantization configuration surface is reworked so INT4 rounding method and INT8 placement are independent, which is needed for the mixed-precision recipes these models rely on.

Checklist

Tests added/updated
Documentation updated (quantization.md, README.md)
No breaking changes (legacy int4_algo_config values preserved as aliases)
CI passes

Copilot

Pull request overview

This PR updates the Python model builder’s quantization and QMoE expert-weight handling (notably for GPT-OSS/Qwen MoE scenarios), adding clearer configuration semantics and CUDA-specific QMoE weight encodings, plus documentation and tests.

Changes:

Decouples INT4 quantization base method (default/rtn/k_quant) from orthogonal INT8 bit-placement flags (last_matmul_weight_int8, int8_mixed_layers, int8_linear_attn) while keeping legacy aliases working.
Adjusts QMoE expert-weight layout handling (including CUDA prepacked vs raw pathways) and updates GPT-OSS MoE weight transposition assumptions.
Adds documentation for INT4/INT8 quantization configuration and introduces targeted QMoE CUDA-path unit tests.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
test/python/models/test_qmoe_weights.py	Adds unit tests validating CUDA QMoE weight-quantization dispatch, block-size validation, and signed-scale regression behavior.
src/python/py/models/quantization.md	New documentation describing INT4 base methods and orthogonal INT8 bit-placement flags (and legacy aliases).
src/python/py/models/builders/gptoss.py	Ensures expert-weight tensors are transposed consistently to match downstream MoE/QMoE consumers.
src/python/py/models/builders/base.py	Implements the new quant config resolution, DEFAULT+INT8 two-pass logic, and CUDA QMoE prepacked/raw quantization paths.
src/python/py/models/builder.py	Adds new boolean extra-options for bit-placement flags and updates CLI help text for quantization configuration.

tianleiwu · 2026-06-23T07:13:45Z

/azp run Integration Tests

azure-pipelines · 2026-06-23T07:13:56Z

Azure Pipelines successfully started running 1 pipeline(s).

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

…add QMoE N check, harden test stub

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

tianleiwu · 2026-06-25T22:23:44Z

/azp run Integration Tests

azure-pipelines · 2026-06-25T22:23:56Z

Azure Pipelines successfully started running 1 pipeline(s).

update model builder for GPT-OSS

5238e7e

tianleiwu requested a review from a team as a code owner June 19, 2026 17:22

Copilot AI review requested due to automatic review settings June 19, 2026 17:22

Copilot started reviewing on behalf of tianleiwu June 19, 2026 17:22 View session

Copilot AI reviewed Jun 19, 2026

View reviewed changes

kunal-vaishnavi reviewed Jun 19, 2026

View reviewed changes

Comment thread src/python/py/models/builders/base.py Outdated

kunal-vaishnavi reviewed Jun 19, 2026

View reviewed changes

Comment thread src/python/py/models/builders/base.py Outdated

tianleiwu mentioned this pull request Jun 20, 2026

[CUDA] GPT-OSS-20B Throughput Optimization microsoft/onnxruntime#29160

Open

tianleiwu added 4 commits June 20, 2026 01:15

support qk norm fusion

6329639

address feedbacks

54699e2

address feedbacks in #2209

2e7f0ae

support per-channel quantization

be79bb5

tianleiwu requested review from Copilot and kunal-vaishnavi June 23, 2026 07:10

Copilot started reviewing on behalf of tianleiwu June 23, 2026 07:10 View session

Copilot AI reviewed Jun 23, 2026

View reviewed changes

Comment thread src/python/py/models/builders/base.py

Comment thread src/python/py/models/builders/base.py

Comment thread src/python/py/models/builders/base.py

Comment thread test/python/builder/test_qmoe_weights.py

tianleiwu added 2 commits June 23, 2026 00:26

fuse_qk_norm_gqa=0 default

324d2cc

address review feedback: version-gate strict mode, fix k_quant int8, …

d462d62

…add QMoE N check, harden test stub

tianleiwu requested a review from Copilot June 24, 2026 01:48

Copilot started reviewing on behalf of tianleiwu June 24, 2026 01:49 View session

Copilot AI reviewed Jun 24, 2026

View reviewed changes

Comment thread src/python/py/models/quantization.md Outdated

Comment thread src/python/py/models/README.md Outdated

Comment thread src/python/py/models/README.md Outdated

Comment thread src/python/py/models/builder.py Outdated

Apply suggestions from code review

09a173c

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

tianleiwu mentioned this pull request Jun 26, 2026

QMoE per-channel quantization produces garbage results on CUDA EP, correct on CPU EP microsoft/onnxruntime#29138

Open

Uh oh!

Conversation

tianleiwu commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Update model builder for GPT-OSS (QMoE encoding fix, decoupled INT4/INT8 quantization, QK-norm fusion)

Description

Summary of Changes

QMoE expert-weight quantization (CUDA)

Decoupled INT4 method vs. INT8 bit placement

Q/K-norm fusion into GroupQueryAttention

Robustness / misc

Tests

Testing

Motivation and Context

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tianleiwu commented Jun 23, 2026

Uh oh!

azure-pipelines Bot commented Jun 23, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tianleiwu commented Jun 25, 2026

Uh oh!

azure-pipelines Bot commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tianleiwu commented Jun 19, 2026 •

edited

Loading