[CUDA] GPT-OSS-20B Throughput Optimization

This is used to track the progress of GPT-OSS-20B Throughput Optimization.

Related PRs:
* olive-recipes
https://github.com/microsoft/olive-recipes/pull/507
 
* onnxruntime-genai
https://github.com/microsoft/onnxruntime-genai/pull/2234

* cuda kernel improvements:
https://github.com/microsoft/onnxruntime/pull/29038
https://github.com/microsoft/onnxruntime/pull/29161
https://github.com/microsoft/onnxruntime/pull/29162
https://github.com/microsoft/onnxruntime/pull/29166
https://github.com/microsoft/onnxruntime/pull/29177
https://github.com/microsoft/onnxruntime/pull/29167 (Experiment)

* fusion
https://github.com/microsoft/onnxruntime/pull/29186
https://github.com/microsoft/onnxruntime/pull/29170
QMoE router Fusion (Experiment)

* Experiemnts
 [quantization recipes experiments](https://github.com/tianleiwu/olive-recipes/blob/tlwu/gpt-oss-20b/gpt-oss-20b/gpt_oss_20b_experiments.md)

***BatchSize=1, H200 GPU***
Quantization | Size (GiB) | Prefill TPS | Decode TPS | MMLU |
|---|---:|---:|---:|---:|
| rtn, mixed, per-channel MoE, int4 lm_head | **10.77** | **29403** | **382.0** | 0.8017 |
| k_quant, mixed, per-channel MoE, int4 lm_head | 10.79 | 29018 | 366.7 | 0.8128 |
| k_quant, mixed, block-64 MoE, **int8 lm_head | 11.61 | 24492 | 345.5 | **0.8170** |

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CUDA] GPT-OSS-20B Throughput Optimization #29160

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Quantization	Size (GiB)	Prefill TPS	Decode TPS	MMLU
rtn, mixed, per-channel MoE, int4 lm_head	10.77	29403	382.0	0.8017
k_quant, mixed, per-channel MoE, int4 lm_head	10.79	29018	366.7	0.8128
k_quant, mixed, block-64 MoE, **int8 lm_head	11.61	24492	345.5	0.8170

Uh oh!

[CUDA] GPT-OSS-20B Throughput Optimization #29160

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions