Skip to content

[CUDA] GPT-OSS-20B Throughput Optimization #29160

Description

@tianleiwu

This is used to track the progress of GPT-OSS-20B Throughput Optimization.

Related PRs:

BatchSize=1, H200 GPU

Quantization Size (GiB) Prefill TPS Decode TPS MMLU
rtn, mixed, per-channel MoE, int4 lm_head 10.77 29403 382.0 0.8017
k_quant, mixed, per-channel MoE, int4 lm_head 10.79 29018 366.7 0.8128
k_quant, mixed, block-64 MoE, **int8 lm_head 11.61 24492 345.5 0.8170

Metadata

Metadata

Assignees

No one assigned

    Labels

    ep:CUDAissues related to the CUDA execution provider

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions