Skip to content

Commit 7cf33d3

Browse files
authored
Address QMoE review feedback on SM80 prepack docs and checks
1 parent d8c2264 commit 7cf33d3

2 files changed

Lines changed: 9 additions & 4 deletions

File tree

docs/contrib_ops/cuda/moe_qmoe.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -71,7 +71,7 @@ input tokens → router (top-k softmax) → permute by expert
7171
| `expert_weight_bits` (QMoE only) | int | 4 | 4 (INT4/MXFP4) or 8 (INT8/FP8). |
7272
| `block_size` (QMoE only) | int | -1 | Group size for INT4/INT8 group-wise quantization. -1 = per-output-channel. |
7373
| `quant_type` (QMoE only) | string | `"int"` | `"int"`, `"fp4"`, `"fp8"`, `"wfp4afp8"`. See [§3](#3-quantization-modes). |
74-
| `weights_prepacked` (QMoE only) | int | -1 | Tri-state, only meaningful when `quant_type="int"`. The prepacked layouts selected by `-1` and `1` are **EP-determined**. `-1` (default): the INT4/INT8 `fc1`/`fc2` initializers are already prepacked in the EP's default layout (e.g. from `pack_weights_for_cuda_mixed_gemm` for the CUDA EP). `1`: already prepacked in the EP's SM90 (Hopper) layout. `0`: the initializers are raw `[E, N, K/pack]` tensors (as produced by `quantize_matmul_{4,8}bits`) and the kernel runs the CUTLASS layout transform in `PrePack()` for the runtime arch. **Note:** the CUDA EP INT4/INT8 MoE GEMM always runs the Ampere (SM80) kernel — even on SM90 — so it consumes the SM80 `fpA_intB` layout on all architectures; `-1` and `1` are therefore equivalent for the CUDA EP today, and `1` is reserved for a possible future Hopper-specific layout. See [§5.1](#51-weights-input-2--5--8). |
74+
| `weights_prepacked` (QMoE only) | int | -1 | Tri-state, only meaningful when `quant_type="int"`. The prepacked layouts selected by `-1` and `1` are **EP-determined**. `-1` (default): the INT4/INT8 `fc1`/`fc2` initializers are already prepacked in the EP's default layout (e.g. from `pack_weights_for_cuda_mixed_gemm` for the CUDA EP). `1`: already prepacked in an alternate EP-selected layout. `0`: the initializers are raw `[E, N, K/pack]` tensors (as produced by `quantize_matmul_{4,8}bits`) and the kernel runs the CUTLASS layout transform in `PrePack()`. **Note:** the CUDA EP INT4/INT8 MoE GEMM always runs the Ampere (SM80) kernel — even on SM90 — so it consumes the SM80 `fpA_intB` layout on all architectures; `-1` and `1` are therefore equivalent for the CUDA EP today, and `1` is reserved for a possible future Hopper-specific layout. See [§5.1](#51-weights-input-2--5--8). |
7575

7676
### 2.2 Type Constraints
7777

@@ -1017,8 +1017,8 @@ over-aligned by-value parameters.
10171017
- **In-`PrePack` INT weight layout transform** (`weights_prepacked=0`) is
10181018
currently covered only by a smoke test (`TestQMoEIntPrePackSmoke`), not a
10191019
bit-parity check: the existing offline pre-pack harness hardcodes
1020-
`force_arch=80` and produces incorrect output on SM≥90, so a parity
1021-
comparison against it is omitted until that harness honours the runtime SM.
1020+
`force_arch=80` (the same SM80 layout consumed by the CUDA EP on all GPUs),
1021+
so a separate parity harness for this path is still pending.
10221022
- **Hopper W4A8** (INT4 weight + FP8 activation) is not supported — TRT-LLM gates
10231023
its fast path to SM89 only.
10241024

onnxruntime/contrib_ops/cuda/moe/moe_quantization.cc

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -67,7 +67,7 @@ QMoE::QMoE(const OpKernelInfo& op_kernel_info) : CudaKernel(op_kernel_info), MoE
6767
// concrete prepacked layouts selected by -1 and 1 are determined by the
6868
// execution provider. The CUDA EP maps the tri-state as:
6969
// -1 (default): already prepacked in the EP's default int weight layout.
70-
// 1: already prepacked in the EP's SM90 (Hopper) int weight layout.
70+
// 1: already prepacked in an alternate EP-selected int weight layout.
7171
// 0: raw [E, N, K/pack] initializers; the PrePack hook lays them out.
7272
//
7373
// Important: the CUDA QMoE int4/int8 MoE GEMM always dispatches to the
@@ -77,6 +77,8 @@ QMoE::QMoE(const OpKernelInfo& op_kernel_info) : CudaKernel(op_kernel_info), MoE
7777
// consumes the SM80/Ampere CUTLASS fpA_intB layout on every GPU. As a result
7878
// the EP default (-1) is the SM80 layout regardless of the runtime device SM,
7979
// and SM80-format weights are valid on SM90 (they run via the SM80 kernel).
80+
// For CUDA today, -1 and 1 are equivalent (both SM80 layout), and 1 is
81+
// reserved for a possible future Hopper-specific layout.
8082
// PrePack (weights_prepacked=0) packs for the SM80 layout accordingly.
8183
const int64_t weights_prepacked_mode =
8284
op_kernel_info.GetAttrOrDefault<int64_t>("weights_prepacked", static_cast<int64_t>(-1));
@@ -1154,6 +1156,9 @@ void QMoE::PrePackIntExpertWeights(const Tensor& tensor, cudaStream_t stream, Al
11541156
IAllocatorUniquePtr<void>& packed_buf, bool& is_packed) {
11551157
ORT_ENFORCE(expert_weight_bits_ == 4 || expert_weight_bits_ == 8,
11561158
"PrePackIntExpertWeights: only 4 and 8 bits are supported, got ", expert_weight_bits_);
1159+
ORT_ENFORCE(sm_ >= 75,
1160+
"PrePackIntExpertWeights: quant_type='int' with weights_prepacked=0 requires SM75+ CUDA hardware, got SM",
1161+
sm_);
11571162
const auto& shape = tensor.Shape();
11581163
ORT_ENFORCE(shape.NumDimensions() == 3,
11591164
"PrePackIntExpertWeights: expected 3-D weight tensor [E, N, K/pack], got ndim=",

0 commit comments

Comments
 (0)