Add tuned MoE Triton configs for MetaX C500 (MXC500)#313
Conversation
Add two-stage Triton configs for shapes missing on MXC500: - H=2048,E=60,N=1408 (Qwen1.5-MoE-A2.7B, Qwen2MoeForCausalLM) - H=2048,E=64,N=1408 (DeepSeek-V2-Lite, DeepseekV2ForCausalLM) - H=2048,E=128,N=768 (Qwen3-30B-A3B, Qwen3MoeForCausalLM) Configs were tuned on the actual vllm_metax fused-MoE Triton kernel (rather than upstream fused_experts), using CUDA events with synchronization. Speedups vs default tile range from 1.17x to 2.20x for M>=64; correctness verified with torch.allclose. Signed-off-by: LindseyMei <648816901@qq.com>
There was a problem hiding this comment.
Code Review
This pull request introduces three new Triton kernel configuration JSON files for Fused MoE on the MXC500 device. The feedback points out several parameter inconsistencies in the H=2048,E=64,N=1408,device_name=MXC500.json configuration. Specifically, it is recommended to reduce GROUP_SIZE_M from 16 to 1 for batch size 1, and to decrease num_warps from 8 to 4 for batch sizes 1, 8, and 64 to ensure consistency with other small batch sizes and avoid low occupancy.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| "BLOCK_SIZE_M": 16, | ||
| "BLOCK_SIZE_N": 64, | ||
| "BLOCK_SIZE_K": 64, | ||
| "GROUP_SIZE_M": 16, |
There was a problem hiding this comment.
| "BLOCK_SIZE_M": 16, | ||
| "BLOCK_SIZE_N": 64, | ||
| "BLOCK_SIZE_K": 64, | ||
| "GROUP_SIZE_M": 16, |
There was a problem hiding this comment.
| "BLOCK_SIZE_N": 64, | ||
| "BLOCK_SIZE_K": 64, | ||
| "GROUP_SIZE_M": 16, | ||
| "num_warps": 8, |
| "BLOCK_SIZE_N": 64, | ||
| "BLOCK_SIZE_K": 64, | ||
| "GROUP_SIZE_M": 16, | ||
| "num_warps": 8, |
There was a problem hiding this comment.
| "BLOCK_SIZE_N": 64, | ||
| "BLOCK_SIZE_K": 64, | ||
| "GROUP_SIZE_M": 1, | ||
| "num_warps": 8, |
| "BLOCK_SIZE_N": 64, | ||
| "BLOCK_SIZE_K": 64, | ||
| "GROUP_SIZE_M": 1, | ||
| "num_warps": 8, |
There was a problem hiding this comment.
| "BLOCK_SIZE_N": 64, | ||
| "BLOCK_SIZE_K": 64, | ||
| "GROUP_SIZE_M": 1, | ||
| "num_warps": 8, |
| "BLOCK_SIZE_N": 64, | ||
| "BLOCK_SIZE_K": 64, | ||
| "GROUP_SIZE_M": 1, | ||
| "num_warps": 8, |
There was a problem hiding this comment.
Summary
Add tuned two-stage Triton MoE configs for MetaX C500 (
device_name=MXC500) covering three popular MoE shapes that currently fall back to the generic default config and emit theUsing default MoE config. Performance might be sub-optimal!warning.Qwen/Qwen1.5-MoE-A2.7BQwen2MoeForCausalLMdeepseek-ai/DeepSeek-V2-LiteDeepseekV2ForCausalLMQwen/Qwen3-30B-A3BQwen3MoeForCausalLMProblem
vllm_metaxselects the fused-MoE Triton tile from JSON configs keyed by(H, E, N, device_name[, dtype]). When no matching tuned config exists, it falls back toget_default_config, which is noticeably slower for prefill / large-batch shapes.Method
The upstream
benchmarks/kernels/benchmark_moe.pytunes the upstreamfused_experts, but on MACA the runtime path goes through the OOTvllm_metax.model_executor.layers.fused_moe.fused_moe.fused_experts(verified to be a different object). Therefore I wrote a small micro-benchmark (moe_tune.py) that:vllm.model_executor.layers.fused_moe.override_config.vllm_metaxfused_experts()call with random weights (no model download needed).torch.cuda.synchronize(), warmup + median over repeated iters.stage1/stage2+ACCF32/SPLIT_K/pipeline/scenario).Correctness was verified with
torch.allclose(rtol=2e-2, atol=2e-2)between default-tile and tuned-tile outputs, and config pickup was confirmed withget_moe_configs().Results (kernel-level, MetaX C500)
Small-M shapes are kept at or near the default tile to avoid decode-latency regressions.
Testing environment
0.13.1.dev0/ vllm_metax0.13.0/ mcoplib0.3.12.8.0+metax3.3.0.2The tile shapes (BLOCK_SIZE / warps / stages) are per-SM properties and should transfer to a full C500; the grid-level parameters (GROUP_SIZE_M=1, SPLIT_K=1) are conservative. I would welcome a maintainer running the same
moe_tune.pyon a full C500 to cross-check, but the current data already shows clear, reproducible wins on real MACA hardware.Notes
ACCF32: falseandSPLIT_K: 1for low-risk deployment.Signed-off-by: LindseyMei 648816901@qq.com