Skip to content

feat(metal): fused MoE expert dispatch with Q4K kernels for Metal#2048

Open
emanueleDiVizio wants to merge 3 commits intoEricLBuehler:masterfrom
emanueleDiVizio:feat/metal-moe-dispatch
Open

feat(metal): fused MoE expert dispatch with Q4K kernels for Metal#2048
emanueleDiVizio wants to merge 3 commits intoEricLBuehler:masterfrom
emanueleDiVizio:feat/metal-moe-dispatch

Conversation

@emanueleDiVizio
Copy link
Copy Markdown

@emanueleDiVizio emanueleDiVizio commented Apr 2, 2026

Summary

This PR introduces a fused Metal execution path for Mixture-of-Experts (MoE) layers with Q4K quantized weights on Apple Silicon.

Instead of the standard pipeline:

route → gather → matmul → activation

this implements:

route → fused gather + quantized matmul + activation (single kernel)

This eliminates intermediate buffer materialization and reduces GPU dispatch overhead, significantly improving per-token latency for MoE models on Metal.

Key changes

Fused MoE kernel (Metal)

  • New indexed_moe.metal kernel performing:
    • token routing by expert index
    • Q4K quantized matmul
    • fused gate + up projection with SiLU activation
  • Uses 2 simdgroups (64 threads) tuned for Apple GPU execution

Runtime integration

  • Automatically enabled when QTensor-backed weights are available on Metal
  • Falls back to standard path when not available
  • Supports batched expert matmuls in decode

Infrastructure

  • Added moe_qtensor() access for direct QTensor buffer usage
  • Eliminates redundant F32 conversions in Metal MoE path

Performance

Reduces per-token latency for MoE models (e.g. Mixtral Q4K) by:

  • removing intermediate gather buffers
  • reducing kernel dispatch count
  • improving memory locality

Dependencies

Depends on:

Add Metal kernels for fused MoE expert dispatch with Q4K quantized
weights. Includes:
- indexed_moe.metal: kernel that gathers tokens by expert assignment
  and performs quantized matmul in a single dispatch
- metal.rs: Rust bindings for the Metal MoE kernels
- moe_qtensor() trait method on QuantMethod for direct QTensor access
- metal_fused_gate_up_swiglu() public API for fused gate+up+SiLU
…etal

Wire the fused MoE kernels into the experts dispatch path:
- When gate and up projections have QTensor backing on Metal, use the
  fused gate+up+SiLU kernel for a single-dispatch MoE forward
- Batch expert matmuls for single-token decode on Metal
- Eliminate redundant F32 dtype conversions in the Metal MoE path
- Use 2 simdgroups (64 threads) for Q4K MoE dispatch
@emanueleDiVizio emanueleDiVizio force-pushed the feat/metal-moe-dispatch branch from 72be965 to 8ae3088 Compare April 2, 2026 17:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant