Skip to content

Update model builder for gpt-oss#2234

Open
tianleiwu wants to merge 8 commits into
mainfrom
tlwu/update_model_builder_gpt_oss
Open

Update model builder for gpt-oss#2234
tianleiwu wants to merge 8 commits into
mainfrom
tlwu/update_model_builder_gpt_oss

Conversation

@tianleiwu

@tianleiwu tianleiwu commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Update model builder for GPT-OSS (QMoE encoding fix, decoupled INT4/INT8 quantization, QK-norm fusion)

Description

This PR overhauls the Python model builder to correctly export GPT-OSS (and related MoE/hybrid) models. It fixes a CUDA QMoE expert-weight encoding bug that produced garbage INT4 weights, decouples the INT4 quantization method from INT8 bit placement (so any base method can be combined with any INT8 upgrade policy), adds per-channel QMoE quantization, and introduces optional Q/K-norm fusion into GroupQueryAttention. It supersedes #2209.

Summary of Changes

QMoE expert-weight quantization (CUDA)

File Change
src/python/py/models/builders/base.py New _cutlass_prepacked_blockwise_quantize quantizes each expert [N, K] with ONNX Runtime's blockwise quantize_matmul_4bits/8bits on the transposed [K, N] weight, keeps the signed scales, and offline CUTLASS-prepacks via pack_weights_for_cuda_mixed_gemm(..., force_arch=80) — the exact encoding validated by the com.microsoft QMoE CUDA parity tests.
src/python/py/models/builders/base.py New _matmulnbits_blockwise_quantize emits raw [N, K/pack] weights + blockwise scales for the runtime PrePack hook (used when qmoe_weights_prepacked=0).
src/python/py/models/builders/base.py New _symmetric_per_channel_quantize / _cuda_per_channel_quantize add per-channel QMoE quantization (qmoe_block_size <= 0), omitting the block_size attribute.
src/python/py/models/builders/base.py make_qmoe_weights routes CUDA QMoE through the prepacked/raw/per-channel paths, validates block_size ∈ {32, 64, 128} and N % pack == 0 with real exceptions, and removes the deprecated column-wise (TensorRT-LLM) path.
src/python/py/models/builders/base.py New tri-state weights_prepacked QMoE attribute (-1 auto/omit, 1 prepacked, 0 raw), CUDA-only, validated to {-1, 0, 1}; override via extra_options["qmoe_weights_prepacked"].
src/python/py/models/builders/gptoss.py Always transpose HF expert weights to output-major [E, N, K] for all EPs (the old CUDA-only special case swapped N/K and silently corrupted quantized weights/scales).

Root cause fixed: the previous CUDA path applied abs() to the blockwise scales. quantize_matmul_4bits folds the block-anchor sign into the scale and the kernel dequantizes as (q − 8) · scale, so abs() corrupted every block whose anchor element is negative (~half the blocks), yielding garbage weights that merely looked like INT4 quality loss.

Decoupled INT4 method vs. INT8 bit placement

File Change
src/python/py/models/builder.py New orthogonal flags last_matmul_weight_int8, int8_mixed_layers, int8_linear_attn; int4_algo_config reduced to base methods default/rtn/k_quant with help text updated.
src/python/py/models/builders/base.py New resolve_int4_quant_config splits int4_algo_config into a base method + placement flags; legacy compound names (rtn_last, k_quant_last, k_quant_mixed, k_quant_linear) are kept as aliases.
src/python/py/models/builders/base.py New make_bit_placement_config builds the per-node INT8 customized_weight_config; make_algo_config now takes a base method + config and raises on unknown methods.
src/python/py/models/builders/base.py to_int4 supports INT8 placement on top of the DEFAULT method via a two-pass scheme (INT4 body first, then an RTN INT8 pass for designated nodes, since DEFAULT 8-bit QOperator output is not consumed correctly by the runtime kernel).
src/python/py/models/quantization.md New doc explaining base methods, INT8 placement flags, legacy aliases, accuracy notes (MMLU numbers), and examples.

Q/K-norm fusion into GroupQueryAttention

File Change
src/python/py/models/builders/base.py New is_fused_qk_norm_gqa_supported and q_norm_weight/k_norm_weight GQA inputs to pass Q/K-norm weights directly into GroupQueryAttention (CUDA/WebGPU). Gated by the fuse_qk_norm_gqa extra option (default off).
src/python/py/models/builder.py Register fuse_qk_norm_gqa extra option.
src/python/py/models/README.md Document the QK-norm fusion toggle.

Robustness / misc

  • src/python/py/models/builders/base.py: enable_skip_layer_norm_strict_mode is now version-gated — disabled on ONNX Runtime ≥ 1.27 (FP32 accumulation by default, faster) and kept enabled on older supported versions to avoid an accuracy regression (new onnxruntime_version_at_least helper).
  • Error chaining (raise ... from e) added to QMoE quantization failures.

Tests

  • test/python/builder/test_qmoe_weights.py: New unit tests covering CUDA path dispatch (prepacked/raw/per-channel), EP gating, unsupported block-size rejection, per-channel shapes, and a regression guard asserting blockwise scales stay signed. Includes a hardened dependency stub that no longer shadows a real onnxruntime wheel.

Testing

  • Run the new builder unit tests:
    python -m pytest -sv test/python/builder/test_qmoe_weights.py
    CUDA-only shape/sign regression tests auto-skip when the ONNX Runtime CUDA pybind (quantize_matmul_4bits, pack_weights_for_cuda_mixed_gemm) is unavailable.
  • End-to-end: build a GPT-OSS QMoE model on CUDA and verify generation quality (the encoding fix restores expected INT4 accuracy vs. the previous garbage output).
  • Backward compatibility: legacy int4_algo_config compound values (rtn_last, k_quant_last, k_quant_mixed, k_quant_linear) still produce identical models via the alias mapping.

Motivation and Context

Replaces #2209. The CUDA QMoE export path was producing weights the com.microsoft QMoE kernel could not consume correctly (sign-stripped scales + wrong N/K orientation), which manifested as severe quality loss on GPT-OSS / Qwen3.5 MoE models. Alongside the fix, the quantization configuration surface is reworked so INT4 rounding method and INT8 placement are independent, which is needed for the mixed-precision recipes these models rely on.

Checklist

  • Tests added/updated
  • Documentation updated (quantization.md, README.md)
  • No breaking changes (legacy int4_algo_config values preserved as aliases)
  • CI passes

@tianleiwu tianleiwu requested a review from a team as a code owner June 19, 2026 17:22
Copilot AI review requested due to automatic review settings June 19, 2026 17:22

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the Python model builder’s quantization and QMoE expert-weight handling (notably for GPT-OSS/Qwen MoE scenarios), adding clearer configuration semantics and CUDA-specific QMoE weight encodings, plus documentation and tests.

Changes:

  • Decouples INT4 quantization base method (default/rtn/k_quant) from orthogonal INT8 bit-placement flags (last_matmul_weight_int8, int8_mixed_layers, int8_linear_attn) while keeping legacy aliases working.
  • Adjusts QMoE expert-weight layout handling (including CUDA prepacked vs raw pathways) and updates GPT-OSS MoE weight transposition assumptions.
  • Adds documentation for INT4/INT8 quantization configuration and introduces targeted QMoE CUDA-path unit tests.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
test/python/models/test_qmoe_weights.py Adds unit tests validating CUDA QMoE weight-quantization dispatch, block-size validation, and signed-scale regression behavior.
src/python/py/models/quantization.md New documentation describing INT4 base methods and orthogonal INT8 bit-placement flags (and legacy aliases).
src/python/py/models/builders/gptoss.py Ensures expert-weight tensors are transposed consistently to match downstream MoE/QMoE consumers.
src/python/py/models/builders/base.py Implements the new quant config resolution, DEFAULT+INT8 two-pass logic, and CUDA QMoE prepacked/raw quantization paths.
src/python/py/models/builder.py Adds new boolean extra-options for bit-placement flags and updates CLI help text for quantization configuration.

Comment thread test/python/models/test_qmoe_weights.py
Comment thread src/python/py/models/builders/base.py Outdated
Comment thread src/python/py/models/builders/base.py Outdated
Comment thread src/python/py/models/builders/base.py Outdated
Comment thread src/python/py/models/builders/base.py Outdated
Comment thread src/python/py/models/builders/base.py Outdated
Comment thread src/python/py/models/builders/base.py
Comment thread src/python/py/models/builders/base.py Outdated
Comment thread src/python/py/models/builders/base.py Outdated
@tianleiwu

Copy link
Copy Markdown
Contributor Author

/azp run Integration Tests

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Comment thread src/python/py/models/builders/base.py
Comment thread src/python/py/models/builders/base.py
Comment thread src/python/py/models/builders/base.py
Comment thread test/python/builder/test_qmoe_weights.py

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Comment thread src/python/py/models/quantization.md Outdated
Comment thread src/python/py/models/README.md Outdated
Comment thread src/python/py/models/README.md Outdated
Comment thread src/python/py/models/builder.py Outdated
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
@tianleiwu

Copy link
Copy Markdown
Contributor Author

/azp run Integration Tests

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants