Per-kernel benchmark of one SGLang MoE layer on AMD MI355X, using the
mori expert-parallel dispatcher plus aiter.fused_moe for the local
compute.
| knob | value |
|---|---|
| MoE class | get_moe_impl_class(quant_config=...) -> MoriEPMoE |
| a2a backend | mori |
| runner backend | aiter (informational; MoriEPMoE calls aiter.fused_moe directly) |
| profile | torch.profiler (kineto + roctracer), CUDA-only |
| launch | torchrun --standalone --nproc_per_node=$NUM_GPUS |
| platform | Compute-DCPT partition, MI355X, 1 node × 8 GPUs |
| container | lmsysorg/sglang:v0.5.11-rocm720-mi35x |
.
├── bench_moe.py # the bench (mori + aiter only)
├── model_configs.json # MoE shape catalog (moe_models.*)
├── run_all_models.sh # docker + torchrun sweep across (model × quant × batch)
├── run_all_models.slurm # SLURM wrapper (Compute-DCPT, 1 node × 8 GPU)
├── build_report.py # aggregates logs/results_*/*.json → report/REPORT.md + summary.csv
└── logs/
├── results_<stamp>/<model>__q<quant>.json raw per-(model,quant) result
├── traces_<stamp>/<model>__q<quant>/…json chrome traces per (model,quant,bs,rank)
└── sweep_<stamp>.log console log
Each forward of the MoE layer runs the full mori-EP pipeline:
TopK routing -> moe_align_sort -> mori dispatch (a2a)
-> aiter fused_moe (gate-up GEMM + SwiGLU + down GEMM)
-> mori combine (a2a)
Kernels observed in the kineto trace are bucketed by substring match
(see KERNEL_GROUPS in bench_moe.py):
| bucket | matches |
|---|---|
topk_softmax |
moe_fused_gate, topk_softmax, biased_grouped_topk |
moe_align_sort |
ck_tile::MoeSortingMultiPhaseKernel_*, moe_align_block_size |
dispatch |
mori EpDispatch* |
combine |
mori EpCombine* |
act_quant |
dynamic_per_group_scaled_quant, _per_token_group_quant_8bit |
fused_moe_gemm |
aiter CK kernel_moe_mxgemm, ck_moe_stage1/2, asm_moe, fmoe_* |
activation |
silu_and_mul, gelu_and_mul, act_and_mul |
moe_sum_reduce |
moe_sum_reduce, moe_sum |
memcpy_misc |
hipMemcpy*, hipMemset*, elementwise copies/fills |
torch_dist_sync |
ncclkernel*, c10d:: |
_unmatched |
should stay ~0; if not, extend KERNEL_GROUPS in bench_moe.py |
per_iter numbers are GPU time per single MoE forward (sum of kernel
durations in that bucket / kineto_num_tests). all_kernels is the sum
across every kernel in the profile window. median_ms is the host-side
per-forward median measured outside the profiler.
mori's
EpDispatch*Transferkernels are long-running on-GPU polling ops; they can overlap with other streams. So summing per-iter columns may exceedmedian_ms, but each bucket on its own is a true total GPU time for that stage.
Full sweep (every model × none,fp8,mxfp4 × default batch sizes):
sbatch run_all_models.slurmSubset:
MODELS=DeepSeek-V3,Kimi-K2 QUANTIZATIONS=none,fp8 \
BATCH_SIZES=64,256,1024 sbatch run_all_models.slurmKnobs (env, all optional):
| var | default |
|---|---|
NUM_GPUS |
8 |
MODELS |
all keys under moe_models in model_configs.json |
QUANTIZATIONS |
none,fp8,mxfp4 |
BATCH_SIZES |
1,2,4,8,16,32,64,128,4096,8192 |
WARMUP / ITERS |
5 / 20 |
KINETO_NUM_TESTS |
15 |
MOE_RUNNER_BACKEND |
auto |
DEEPEP_MODE |
normal if NUM_GPUS<=8 else low_latency |
MORI_ENABLE_SDMA |
0 (anvil SDMA queue alloc hangs in current container; flip to 1 once fixed upstream) |
DOCKER_IMAGE |
lmsysorg/sglang:v0.5.11-rocm720-mi35x |
./run_all_models.sh
NUM_GPUS=4 MODELS=DeepSeek-V3 BATCH_SIZES=64,256 ./run_all_models.sh
ATTACH=1 ./run_all_models.sh # interactive shelltorchrun --standalone --nproc_per_node=8 bench_moe.py \
--config-file model_configs.json \
--model-name DeepSeek-V3 \
--quantization fp8 \
--batch-sizes 64,256,1024 \
--output /tmp/results.json \
--trace-dir /tmp/traces/After each run, three things land under logs/:
logs/results_<stamp>/<model>__q<quant>.json— rank-0 JSON dump, one record per batch size with the kineto kernel breakdown (groups.<stage>.per_iter_us,groups.<stage>.kernels,groups.<stage>.top_names).logs/traces_<stamp>/<model>__q<quant>/moe_*__rank=<r>.json— chrome trace per rank; drop into chrome://tracing or perfetto.dev.logs/sweep_<stamp>.log— full console output.
Then aggregate everything into a report:
python3 build_report.pyProduces:
report/REPORT.md per-model markdown tables (one block per (model, quant))
report/summary.csv flat CSV (one row per (model, quant, batch_size))
- fp8 / mxfp4 dispatch is wired from the quant_config, not env. The
bench builds the same
QuantizationConfigsglang would, then callsprocess_weights_after_loadingto letFp8MoEMethodshuffle weights for AITER and switch mori to fp8 dispatch. For mxfp4 the bench additionally pushes{weight_dtype: float4_e2m1fn_x2}into the dispatcher sinceMxfp4MoEMethoddoesn't propagate it. SettingSGLANG_MORI_DISPATCH_DTYPEby hand has no effect on the dispatch dtype with this wiring. - mori in this image defaults to the LowLatency Async dispatch path
whenever
deepep_mode != "normal". TheEpDispatchLowLatencyAsync*kernels are on-GPU polling kernels whose wall-clock grows roughly linearly with batch size and only use XGMI for intra-node hops whenMORI_ENABLE_SDMA=1. Without SDMA they fall back to ShmemPutMemNbi over ibverbs/RDMA for every peer, including local GPUs — at bs=4096 that is ~20 s per forward. See the next section. - The bench requires
world_size > 1to do anything meaningful. mori dispatch+combine on a single rank degenerates to a no-op.
mori ships three dispatch/combine kernel families (sources under
/sgl-workspace/mori/src/ops/dispatch_combine/):
| kernel family | sglang EpMode |
intra-node transport | inter-node transport |
|---|---|---|---|
IntraNode |
INTRA_NODE |
XGMI P2P (direct) | n/a |
InterNodeV1[/LL] |
INTER_NODE |
XGMI P2P (direct) | RDMA |
LowLatencyAsync |
LOW_LATENCY |
XGMI SDMA iff MORI_ENABLE_SDMA=1, else RDMA |
RDMA |
sglang's mode selector (moriep.py:221-224):
mode = EpMode.INTRA_NODE if world_size <= 8 else EpMode.INTER_NODE
async_mode = deepep_mode.enable_low_latency() or enable_sdma
if async_mode:
mode = EpMode.LOW_LATENCY # AsyncLL kerneli.e. deepep_mode=auto/low_latency OR MORI_ENABLE_SDMA=1 forces the
AsyncLL kernel. The AsyncLL kernel's transport decision then lives in
low_latency_async.cpp:263:
if (destPe / config.gpuPerNode == myNode && config.enableSdma) {
// SDMA transfer via XGMI DMA engine (fast)
} else {
// ShmemPutMemNbi -> ibverbs/RDMA (slow on single node)
}Yes, but only via SDMA, and SDMA is currently broken in this container.
- The LL kernel does take the SDMA branch on intra-node peers when
MORI_ENABLE_SDMA=1, which uses the XGMI DMA engines (the same hardware lane normal mode uses). - Without SDMA, LL falls back to ShmemPutMemNbi over ibverbs/RDMA for every peer (intra-node included), giving the multi-second polling behavior we observed at large batch sizes.
- In the current
lmsysorg/sglang:v0.5.11-rocm720-mi35ximage, settingMORI_ENABLE_SDMA=1makes mori'sanvilSDMA-queue allocator (hsaKmtCreateQueueExt(... HSA_QUEUE_SDMA_BY_ENG_ID ...)inapplication/transport/sdma/anvil.cpp) hang during init. Concretely, job 14289 stalled right after RCCL bring-up and never reached MoE setup; the identical config without SDMA (job 14283) ran to completion. So on this hardware/container LL effectively cannot use XGMI.
run_all_models.sh / run_all_models.slurm pick DEEPEP_MODE
automatically:
NUM_GPUS <= 8(single-node EP=8) →DEEPEP_MODE=normal,MORI_ENABLE_SDMA=0. This selects the mori IntraNode kernel, which always uses XGMI P2P directly — the fastest intra-node transport mori has on MI355X and the recommended path here.NUM_GPUS > 8(multi-node) →DEEPEP_MODE=low_latency,MORI_ENABLE_SDMA=0. AsyncLL handles the inter-node hops via RDMA. Intra-node hops in LL mode without SDMA also go through RDMA in this build; once the container ships a workinganvilSDMA stack, flipMORI_ENABLE_SDMA=1to recover XGMI on local peers.
Override either env on the command line if you want to test a different combination.
bench_moe.py patches a small bit of sglang state at import time
(set_global_server_args_for_scheduler + initialize_moe_config) so it
can stand up the MoE layer outside a full serving runtime. It does NOT
launch any sglang scheduler / engine.
build_report.py only consumes JSON files under logs/results_*/.
You can drop in additional result jsons (e.g. from older runs) and
re-run to get an updated report.