This is used to track the progress of GPT-OSS-20B Throughput Optimization. Related PRs: * olive-recipes https://github.com/microsoft/olive-recipes/pull/507 * onnxruntime-genai https://github.com/microsoft/onnxruntime-genai/pull/2234 * cuda kernel improvements: https://github.com/microsoft/onnxruntime/pull/29038 https://github.com/microsoft/onnxruntime/pull/29161 https://github.com/microsoft/onnxruntime/pull/29162 https://github.com/microsoft/onnxruntime/pull/29166 https://github.com/microsoft/onnxruntime/pull/29177 https://github.com/microsoft/onnxruntime/pull/29167 (Experiment) * fusion https://github.com/microsoft/onnxruntime/pull/29186 https://github.com/microsoft/onnxruntime/pull/29170 QMoE router Fusion (Experiment) * Experiemnts [quantization recipes experiments](https://github.com/tianleiwu/olive-recipes/blob/tlwu/gpt-oss-20b/gpt-oss-20b/gpt_oss_20b_experiments.md) ***BatchSize=1, H200 GPU*** Quantization | Size (GiB) | Prefill TPS | Decode TPS | MMLU | |---|---:|---:|---:|---:| | rtn, mixed, per-channel MoE, int4 lm_head | **10.77** | **29403** | **382.0** | 0.8017 | | k_quant, mixed, per-channel MoE, int4 lm_head | 10.79 | 29018 | 366.7 | 0.8128 | | k_quant, mixed, block-64 MoE, **int8 lm_head | 11.61 | 24492 | 345.5 | **0.8170** |
This is used to track the progress of GPT-OSS-20B Throughput Optimization.
Related PRs:
olive-recipes
Add GPT-OSS 20B recipes olive-recipes#507
onnxruntime-genai
Update model builder for gpt-oss onnxruntime-genai#2234
cuda kernel improvements:
[CUDA] QMoE GEMV fast path for batch-1 decode #29038
[CUDA] Optimize FlashDecode split planning for local-window GQA #29161
[CUDA] Enable XQA decode for GroupQueryAttention with attention sink #29162
[CUDA] Default QMoE GEMV fp16 accumulation for fp16 activations #29166
[CUDA] Add sliding-window support to XQA decode #29177
[CUDA]: Split-K2 QMoE SwiGLU GEMV kernel #29167 (Experiment)
fusion
[CUDA] Enable CUDA GQA QK-Norm and XQA decode #29186
[CUDA] Fuse MoE router bias into MatMulNBits GEMV #29170
QMoE router Fusion (Experiment)
Experiemnts
quantization recipes experiments
BatchSize=1, H200 GPU