feat(metal): fused MoE expert dispatch with Q4K kernels for Metal by emanueleDiVizio · Pull Request #2048 · EricLBuehler/mistral.rs

emanueleDiVizio · 2026-04-02T17:22:33Z

Summary

This PR introduces a fused Metal execution path for Mixture-of-Experts (MoE) layers with Q4K quantized weights on Apple Silicon.

Instead of the standard pipeline:

route → gather → matmul → activation

this implements:

route → fused gather + quantized matmul + activation (single kernel)

This eliminates intermediate buffer materialization and reduces GPU dispatch overhead, significantly improving per-token latency for MoE models on Metal.

Key changes

Fused MoE kernel (Metal)

New indexed_moe.metal kernel performing:
- token routing by expert index
- Q4K quantized matmul
- fused gate + up projection with SiLU activation
Uses 2 simdgroups (64 threads) tuned for Apple GPU execution

Runtime integration

Automatically enabled when QTensor-backed weights are available on Metal
Falls back to standard path when not available
Supports batched expert matmuls in decode

Infrastructure

Added moe_qtensor() access for direct QTensor buffer usage
Eliminates redundant F32 conversions in Metal MoE path

Performance

Reduces per-token latency for MoE models (e.g. Mixtral Q4K) by:

removing intermediate gather buffers
reducing kernel dispatch count
improving memory locality

Dependencies

Depends on:

candle PR for QTensor Metal storage access (huggingface/candle#3444)

Add Metal kernels for fused MoE expert dispatch with Q4K quantized weights. Includes: - indexed_moe.metal: kernel that gathers tokens by expert assignment and performs quantized matmul in a single dispatch - metal.rs: Rust bindings for the Metal MoE kernels - moe_qtensor() trait method on QuantMethod for direct QTensor access - metal_fused_gate_up_swiglu() public API for fused gate+up+SiLU

…etal Wire the fused MoE kernels into the experts dispatch path: - When gate and up projections have QTensor backing on Metal, use the fused gate+up+SiLU kernel for a single-dispatch MoE forward - Batch expert matmuls for single-token decode on Metal - Eliminate redundant F32 dtype conversions in the Metal MoE path - Use 2 simdgroups (64 threads) for Q4K MoE dispatch

emanueleDiVizio added 2 commits April 2, 2026 19:21

emanueleDiVizio force-pushed the feat/metal-moe-dispatch branch from 72be965 to 8ae3088 Compare April 2, 2026 17:35

Fix API compat: update MoEExperts and gemma4 for upstream master

5e8c3cd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(metal): fused MoE expert dispatch with Q4K kernels for Metal#2048

feat(metal): fused MoE expert dispatch with Q4K kernels for Metal#2048
emanueleDiVizio wants to merge 3 commits intoEricLBuehler:masterfrom
emanueleDiVizio:feat/metal-moe-dispatch

emanueleDiVizio commented Apr 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

emanueleDiVizio commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key changes

Fused MoE kernel (Metal)

Runtime integration

Infrastructure

Performance

Dependencies

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

emanueleDiVizio commented Apr 2, 2026 •

edited

Loading