Skip to content

zhenhuang12/BenchMoE

Repository files navigation

BenchMoE — SGLang MoE (mori + aiter.fused_moe) per-kernel bench

Per-kernel benchmark of one SGLang MoE layer on AMD MI355X, using the mori expert-parallel dispatcher plus aiter.fused_moe for the local compute.

knob value
MoE class get_moe_impl_class(quant_config=...) -> MoriEPMoE
a2a backend mori
runner backend aiter (informational; MoriEPMoE calls aiter.fused_moe directly)
profile torch.profiler (kineto + roctracer), CUDA-only
launch torchrun --standalone --nproc_per_node=$NUM_GPUS
platform Compute-DCPT partition, MI355X, 1 node × 8 GPUs
container lmsysorg/sglang:v0.5.11-rocm720-mi35x

Files

.
├── bench_moe.py          # the bench (mori + aiter only)
├── model_configs.json    # MoE shape catalog (moe_models.*)
├── run_all_models.sh     # docker + torchrun sweep across (model × quant × batch)
├── run_all_models.slurm  # SLURM wrapper (Compute-DCPT, 1 node × 8 GPU)
├── build_report.py       # aggregates logs/results_*/*.json → report/REPORT.md + summary.csv
└── logs/
    ├── results_<stamp>/<model>__q<quant>.json      raw per-(model,quant) result
    ├── traces_<stamp>/<model>__q<quant>/…json      chrome traces per (model,quant,bs,rank)
    └── sweep_<stamp>.log                            console log

What gets benchmarked

Each forward of the MoE layer runs the full mori-EP pipeline:

TopK routing -> moe_align_sort -> mori dispatch (a2a)
             -> aiter fused_moe (gate-up GEMM + SwiGLU + down GEMM)
             -> mori combine (a2a)

Kernels observed in the kineto trace are bucketed by substring match (see KERNEL_GROUPS in bench_moe.py):

bucket matches
topk_softmax moe_fused_gate, topk_softmax, biased_grouped_topk
moe_align_sort ck_tile::MoeSortingMultiPhaseKernel_*, moe_align_block_size
dispatch mori EpDispatch*
combine mori EpCombine*
act_quant dynamic_per_group_scaled_quant, _per_token_group_quant_8bit
fused_moe_gemm aiter CK kernel_moe_mxgemm, ck_moe_stage1/2, asm_moe, fmoe_*
activation silu_and_mul, gelu_and_mul, act_and_mul
moe_sum_reduce moe_sum_reduce, moe_sum
memcpy_misc hipMemcpy*, hipMemset*, elementwise copies/fills
torch_dist_sync ncclkernel*, c10d::
_unmatched should stay ~0; if not, extend KERNEL_GROUPS in bench_moe.py

per_iter numbers are GPU time per single MoE forward (sum of kernel durations in that bucket / kineto_num_tests). all_kernels is the sum across every kernel in the profile window. median_ms is the host-side per-forward median measured outside the profiler.

mori's EpDispatch*Transfer kernels are long-running on-GPU polling ops; they can overlap with other streams. So summing per-iter columns may exceed median_ms, but each bucket on its own is a true total GPU time for that stage.

Quick start

From the cluster login node (SLURM)

Full sweep (every model × none,fp8,mxfp4 × default batch sizes):

sbatch run_all_models.slurm

Subset:

MODELS=DeepSeek-V3,Kimi-K2 QUANTIZATIONS=none,fp8 \
    BATCH_SIZES=64,256,1024 sbatch run_all_models.slurm

Knobs (env, all optional):

var default
NUM_GPUS 8
MODELS all keys under moe_models in model_configs.json
QUANTIZATIONS none,fp8,mxfp4
BATCH_SIZES 1,2,4,8,16,32,64,128,4096,8192
WARMUP / ITERS 5 / 20
KINETO_NUM_TESTS 15
MOE_RUNNER_BACKEND auto
DEEPEP_MODE normal if NUM_GPUS<=8 else low_latency
MORI_ENABLE_SDMA 0 (anvil SDMA queue alloc hangs in current container; flip to 1 once fixed upstream)
DOCKER_IMAGE lmsysorg/sglang:v0.5.11-rocm720-mi35x

From inside a node with docker + ROCm

./run_all_models.sh
NUM_GPUS=4 MODELS=DeepSeek-V3 BATCH_SIZES=64,256 ./run_all_models.sh
ATTACH=1 ./run_all_models.sh                  # interactive shell

Inside the sglang container directly

torchrun --standalone --nproc_per_node=8 bench_moe.py \
    --config-file model_configs.json \
    --model-name DeepSeek-V3 \
    --quantization fp8 \
    --batch-sizes 64,256,1024 \
    --output /tmp/results.json \
    --trace-dir /tmp/traces/

Output

After each run, three things land under logs/:

  • logs/results_<stamp>/<model>__q<quant>.json — rank-0 JSON dump, one record per batch size with the kineto kernel breakdown (groups.<stage>.per_iter_us, groups.<stage>.kernels, groups.<stage>.top_names).
  • logs/traces_<stamp>/<model>__q<quant>/moe_*__rank=<r>.json — chrome trace per rank; drop into chrome://tracing or perfetto.dev.
  • logs/sweep_<stamp>.log — full console output.

Then aggregate everything into a report:

python3 build_report.py

Produces:

report/REPORT.md   per-model markdown tables (one block per (model, quant))
report/summary.csv flat CSV (one row per (model, quant, batch_size))

Caveats

  • fp8 / mxfp4 dispatch is wired from the quant_config, not env. The bench builds the same QuantizationConfig sglang would, then calls process_weights_after_loading to let Fp8MoEMethod shuffle weights for AITER and switch mori to fp8 dispatch. For mxfp4 the bench additionally pushes {weight_dtype: float4_e2m1fn_x2} into the dispatcher since Mxfp4MoEMethod doesn't propagate it. Setting SGLANG_MORI_DISPATCH_DTYPE by hand has no effect on the dispatch dtype with this wiring.
  • mori in this image defaults to the LowLatency Async dispatch path whenever deepep_mode != "normal". The EpDispatchLowLatencyAsync* kernels are on-GPU polling kernels whose wall-clock grows roughly linearly with batch size and only use XGMI for intra-node hops when MORI_ENABLE_SDMA=1. Without SDMA they fall back to ShmemPutMemNbi over ibverbs/RDMA for every peer, including local GPUs — at bs=4096 that is ~20 s per forward. See the next section.
  • The bench requires world_size > 1 to do anything meaningful. mori dispatch+combine on a single rank degenerates to a no-op.

mori dispatch mode & XGMI (MORI_ENABLE_SDMA / DEEPEP_MODE)

What mori has

mori ships three dispatch/combine kernel families (sources under /sgl-workspace/mori/src/ops/dispatch_combine/):

kernel family sglang EpMode intra-node transport inter-node transport
IntraNode INTRA_NODE XGMI P2P (direct) n/a
InterNodeV1[/LL] INTER_NODE XGMI P2P (direct) RDMA
LowLatencyAsync LOW_LATENCY XGMI SDMA iff MORI_ENABLE_SDMA=1, else RDMA RDMA

sglang's mode selector (moriep.py:221-224):

mode = EpMode.INTRA_NODE if world_size <= 8 else EpMode.INTER_NODE
async_mode = deepep_mode.enable_low_latency() or enable_sdma
if async_mode:
    mode = EpMode.LOW_LATENCY      # AsyncLL kernel

i.e. deepep_mode=auto/low_latency OR MORI_ENABLE_SDMA=1 forces the AsyncLL kernel. The AsyncLL kernel's transport decision then lives in low_latency_async.cpp:263:

if (destPe / config.gpuPerNode == myNode && config.enableSdma) {
    // SDMA transfer via XGMI DMA engine (fast)
} else {
    // ShmemPutMemNbi -> ibverbs/RDMA (slow on single node)
}

Does mori LowLatency support XGMI?

Yes, but only via SDMA, and SDMA is currently broken in this container.

  • The LL kernel does take the SDMA branch on intra-node peers when MORI_ENABLE_SDMA=1, which uses the XGMI DMA engines (the same hardware lane normal mode uses).
  • Without SDMA, LL falls back to ShmemPutMemNbi over ibverbs/RDMA for every peer (intra-node included), giving the multi-second polling behavior we observed at large batch sizes.
  • In the current lmsysorg/sglang:v0.5.11-rocm720-mi35x image, setting MORI_ENABLE_SDMA=1 makes mori's anvil SDMA-queue allocator (hsaKmtCreateQueueExt(... HSA_QUEUE_SDMA_BY_ENG_ID ...) in application/transport/sdma/anvil.cpp) hang during init. Concretely, job 14289 stalled right after RCCL bring-up and never reached MoE setup; the identical config without SDMA (job 14283) ran to completion. So on this hardware/container LL effectively cannot use XGMI.

Resulting policy in the launchers

run_all_models.sh / run_all_models.slurm pick DEEPEP_MODE automatically:

  • NUM_GPUS <= 8 (single-node EP=8)DEEPEP_MODE=normal, MORI_ENABLE_SDMA=0. This selects the mori IntraNode kernel, which always uses XGMI P2P directly — the fastest intra-node transport mori has on MI355X and the recommended path here.
  • NUM_GPUS > 8 (multi-node)DEEPEP_MODE=low_latency, MORI_ENABLE_SDMA=0. AsyncLL handles the inter-node hops via RDMA. Intra-node hops in LL mode without SDMA also go through RDMA in this build; once the container ships a working anvil SDMA stack, flip MORI_ENABLE_SDMA=1 to recover XGMI on local peers.

Override either env on the command line if you want to test a different combination.

Tooling notes

bench_moe.py patches a small bit of sglang state at import time (set_global_server_args_for_scheduler + initialize_moe_config) so it can stand up the MoE layer outside a full serving runtime. It does NOT launch any sglang scheduler / engine.

build_report.py only consumes JSON files under logs/results_*/. You can drop in additional result jsons (e.g. from older runs) and re-run to get an updated report.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors