BenchMoE — SGLang MoE (mori + aiter.fused_moe) per-kernel bench

Per-kernel benchmark of one SGLang MoE layer on AMD MI355X, using the mori expert-parallel dispatcher plus aiter.fused_moe for the local compute.

knob	value
MoE class	`get_moe_impl_class(quant_config=...)` -> `MoriEPMoE`
a2a backend	`mori`
runner backend	`aiter` (informational; MoriEPMoE calls aiter.fused_moe directly)
profile	`torch.profiler` (kineto + roctracer), CUDA-only
launch	`torchrun --standalone --nproc_per_node=$NUM_GPUS`
platform	Compute-DCPT partition, MI355X, 1 node × 8 GPUs
container	`lmsysorg/sglang:v0.5.11-rocm720-mi35x`

Files

.
├── bench_moe.py          # the bench (mori + aiter only)
├── model_configs.json    # MoE shape catalog (moe_models.*)
├── run_all_models.sh     # docker + torchrun sweep across (model × quant × batch)
├── run_all_models.slurm  # SLURM wrapper (Compute-DCPT, 1 node × 8 GPU)
├── build_report.py       # aggregates logs/results_*/*.json → report/REPORT.md + summary.csv
└── logs/
    ├── results_<stamp>/<model>__q<quant>.json      raw per-(model,quant) result
    ├── traces_<stamp>/<model>__q<quant>/…json      chrome traces per (model,quant,bs,rank)
    └── sweep_<stamp>.log                            console log

What gets benchmarked

Each forward of the MoE layer runs the full mori-EP pipeline:

TopK routing -> moe_align_sort -> mori dispatch (a2a)
             -> aiter fused_moe (gate-up GEMM + SwiGLU + down GEMM)
             -> mori combine (a2a)

Kernels observed in the kineto trace are bucketed by substring match (see KERNEL_GROUPS in bench_moe.py):

bucket	matches
`topk_softmax`	`moe_fused_gate`, `topk_softmax`, `biased_grouped_topk`
`moe_align_sort`	`ck_tile::MoeSortingMultiPhaseKernel_*`, `moe_align_block_size`
`dispatch`	mori `EpDispatch*`
`combine`	mori `EpCombine*`
`act_quant`	`dynamic_per_group_scaled_quant`, `_per_token_group_quant_8bit`
`fused_moe_gemm`	aiter CK `kernel_moe_mxgemm`, `ck_moe_stage1/2`, `asm_moe`, `fmoe_*`
`activation`	`silu_and_mul`, `gelu_and_mul`, `act_and_mul`
`moe_sum_reduce`	`moe_sum_reduce`, `moe_sum`
`memcpy_misc`	`hipMemcpy`, `hipMemset`, elementwise copies/fills
`torch_dist_sync`	`ncclkernel*`, `c10d::`
`_unmatched`	should stay ~0; if not, extend `KERNEL_GROUPS` in `bench_moe.py`

per_iter numbers are GPU time per single MoE forward (sum of kernel durations in that bucket / kineto_num_tests). all_kernels is the sum across every kernel in the profile window. median_ms is the host-side per-forward median measured outside the profiler.

mori's EpDispatch*Transfer kernels are long-running on-GPU polling ops; they can overlap with other streams. So summing per-iter columns may exceed median_ms, but each bucket on its own is a true total GPU time for that stage.

Quick start

From the cluster login node (SLURM)

Full sweep (every model × none,fp8,mxfp4 × default batch sizes):

sbatch run_all_models.slurm

Subset:

MODELS=DeepSeek-V3,Kimi-K2 QUANTIZATIONS=none,fp8 \
    BATCH_SIZES=64,256,1024 sbatch run_all_models.slurm

Knobs (env, all optional):

var	default
`NUM_GPUS`	8
`MODELS`	all keys under `moe_models` in `model_configs.json`
`QUANTIZATIONS`	`none,fp8,mxfp4`
`BATCH_SIZES`	`1,2,4,8,16,32,64,128,4096,8192`
`WARMUP / ITERS`	`5 / 20`
`KINETO_NUM_TESTS`	`15`
`MOE_RUNNER_BACKEND`	`auto`
`DEEPEP_MODE`	`normal` if `NUM_GPUS<=8` else `low_latency`
`MORI_ENABLE_SDMA`	`0` (anvil SDMA queue alloc hangs in current container; flip to `1` once fixed upstream)
`DOCKER_IMAGE`	`lmsysorg/sglang:v0.5.11-rocm720-mi35x`

From inside a node with docker + ROCm

./run_all_models.sh
NUM_GPUS=4 MODELS=DeepSeek-V3 BATCH_SIZES=64,256 ./run_all_models.sh
ATTACH=1 ./run_all_models.sh                  # interactive shell

Inside the sglang container directly

torchrun --standalone --nproc_per_node=8 bench_moe.py \
    --config-file model_configs.json \
    --model-name DeepSeek-V3 \
    --quantization fp8 \
    --batch-sizes 64,256,1024 \
    --output /tmp/results.json \
    --trace-dir /tmp/traces/

Output

After each run, three things land under logs/:

logs/results_<stamp>/<model>__q<quant>.json — rank-0 JSON dump, one record per batch size with the kineto kernel breakdown (groups.<stage>.per_iter_us, groups.<stage>.kernels, groups.<stage>.top_names).
logs/traces_<stamp>/<model>__q<quant>/moe_*__rank=<r>.json — chrome trace per rank; drop into chrome://tracing or perfetto.dev.
logs/sweep_<stamp>.log — full console output.

Then aggregate everything into a report:

python3 build_report.py

Produces:

report/REPORT.md   per-model markdown tables (one block per (model, quant))
report/summary.csv flat CSV (one row per (model, quant, batch_size))

Caveats

fp8 / mxfp4 dispatch is wired from the quant_config, not env. The bench builds the same QuantizationConfig sglang would, then calls process_weights_after_loading to let Fp8MoEMethod shuffle weights for AITER and switch mori to fp8 dispatch. For mxfp4 the bench additionally pushes {weight_dtype: float4_e2m1fn_x2} into the dispatcher since Mxfp4MoEMethod doesn't propagate it. Setting SGLANG_MORI_DISPATCH_DTYPE by hand has no effect on the dispatch dtype with this wiring.
mori in this image defaults to the LowLatency Async dispatch path whenever deepep_mode != "normal". The EpDispatchLowLatencyAsync* kernels are on-GPU polling kernels whose wall-clock grows roughly linearly with batch size and only use XGMI for intra-node hops when MORI_ENABLE_SDMA=1. Without SDMA they fall back to ShmemPutMemNbi over ibverbs/RDMA for every peer, including local GPUs — at bs=4096 that is ~20 s per forward. See the next section.
The bench requires world_size > 1 to do anything meaningful. mori dispatch+combine on a single rank degenerates to a no-op.

mori dispatch mode & XGMI (`MORI_ENABLE_SDMA` / `DEEPEP_MODE`)

What mori has

mori ships three dispatch/combine kernel families (sources under /sgl-workspace/mori/src/ops/dispatch_combine/):

kernel family	sglang `EpMode`	intra-node transport	inter-node transport
`IntraNode`	`INTRA_NODE`	XGMI P2P (direct)	n/a
`InterNodeV1[/LL]`	`INTER_NODE`	XGMI P2P (direct)	RDMA
`LowLatencyAsync`	`LOW_LATENCY`	XGMI SDMA iff `MORI_ENABLE_SDMA=1`, else RDMA	RDMA

sglang's mode selector (moriep.py:221-224):

mode = EpMode.INTRA_NODE if world_size <= 8 else EpMode.INTER_NODE
async_mode = deepep_mode.enable_low_latency() or enable_sdma
if async_mode:
    mode = EpMode.LOW_LATENCY      # AsyncLL kernel

i.e. deepep_mode=auto/low_latency OR MORI_ENABLE_SDMA=1 forces the AsyncLL kernel. The AsyncLL kernel's transport decision then lives in low_latency_async.cpp:263:

if (destPe / config.gpuPerNode == myNode && config.enableSdma) {
    // SDMA transfer via XGMI DMA engine (fast)
} else {
    // ShmemPutMemNbi -> ibverbs/RDMA (slow on single node)
}

Does mori LowLatency support XGMI?

Yes, but only via SDMA, and SDMA is currently broken in this container.

The LL kernel does take the SDMA branch on intra-node peers when MORI_ENABLE_SDMA=1, which uses the XGMI DMA engines (the same hardware lane normal mode uses).
Without SDMA, LL falls back to ShmemPutMemNbi over ibverbs/RDMA for every peer (intra-node included), giving the multi-second polling behavior we observed at large batch sizes.
In the current lmsysorg/sglang:v0.5.11-rocm720-mi35x image, setting MORI_ENABLE_SDMA=1 makes mori's anvil SDMA-queue allocator (hsaKmtCreateQueueExt(... HSA_QUEUE_SDMA_BY_ENG_ID ...) in application/transport/sdma/anvil.cpp) hang during init. Concretely, job 14289 stalled right after RCCL bring-up and never reached MoE setup; the identical config without SDMA (job 14283) ran to completion. So on this hardware/container LL effectively cannot use XGMI.

Resulting policy in the launchers

run_all_models.sh / run_all_models.slurm pick DEEPEP_MODE automatically:

NUM_GPUS <= 8 (single-node EP=8) → DEEPEP_MODE=normal, MORI_ENABLE_SDMA=0. This selects the mori IntraNode kernel, which always uses XGMI P2P directly — the fastest intra-node transport mori has on MI355X and the recommended path here.
NUM_GPUS > 8 (multi-node) → DEEPEP_MODE=low_latency, MORI_ENABLE_SDMA=0. AsyncLL handles the inter-node hops via RDMA. Intra-node hops in LL mode without SDMA also go through RDMA in this build; once the container ships a working anvil SDMA stack, flip MORI_ENABLE_SDMA=1 to recover XGMI on local peers.

Override either env on the command line if you want to test a different combination.

Tooling notes

bench_moe.py patches a small bit of sglang state at import time (set_global_server_args_for_scheduler + initialize_moe_config) so it can stand up the MoE layer outside a full serving runtime. It does NOT launch any sglang scheduler / engine.

build_report.py only consumes JSON files under logs/results_*/. You can drop in additional result jsons (e.g. from older runs) and re-run to get an updated report.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
doc		doc
logs		logs
report		report
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
bench_aiter_attention.py		bench_aiter_attention.py
bench_attention.py		bench_attention.py
bench_moe.py		bench_moe.py
build_attention_report.py		build_attention_report.py
build_report.py		build_report.py
model_configs.json		model_configs.json
op_config.csv		op_config.csv
run_all_attention.sh		run_all_attention.sh
run_all_attention.slurm		run_all_attention.slurm
run_all_models.sh		run_all_models.sh
run_all_models.slurm		run_all_models.slurm

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BenchMoE — SGLang MoE (mori + aiter.fused_moe) per-kernel bench

Files

What gets benchmarked

Quick start

From the cluster login node (SLURM)

From inside a node with docker + ROCm

Inside the sglang container directly

Output

Caveats

mori dispatch mode & XGMI (`MORI_ENABLE_SDMA` / `DEEPEP_MODE`)

What mori has

Does mori LowLatency support XGMI?

Resulting policy in the launchers

Tooling notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

BenchMoE — SGLang MoE (mori + aiter.fused_moe) per-kernel bench

Files

What gets benchmarked

Quick start

From the cluster login node (SLURM)

From inside a node with docker + ROCm

Inside the sglang container directly

Output

Caveats

mori dispatch mode & XGMI (MORI_ENABLE_SDMA / DEEPEP_MODE)

What mori has

Does mori LowLatency support XGMI?

Resulting policy in the launchers

Tooling notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

mori dispatch mode & XGMI (`MORI_ENABLE_SDMA` / `DEEPEP_MODE`)

Packages