feat(metal): fused MoE expert dispatch with Q4K kernels for Metal#2048
Open
emanueleDiVizio wants to merge 3 commits intoEricLBuehler:masterfrom
Open
feat(metal): fused MoE expert dispatch with Q4K kernels for Metal#2048emanueleDiVizio wants to merge 3 commits intoEricLBuehler:masterfrom
emanueleDiVizio wants to merge 3 commits intoEricLBuehler:masterfrom
Conversation
Add Metal kernels for fused MoE expert dispatch with Q4K quantized weights. Includes: - indexed_moe.metal: kernel that gathers tokens by expert assignment and performs quantized matmul in a single dispatch - metal.rs: Rust bindings for the Metal MoE kernels - moe_qtensor() trait method on QuantMethod for direct QTensor access - metal_fused_gate_up_swiglu() public API for fused gate+up+SiLU
…etal Wire the fused MoE kernels into the experts dispatch path: - When gate and up projections have QTensor backing on Metal, use the fused gate+up+SiLU kernel for a single-dispatch MoE forward - Batch expert matmuls for single-token decode on Metal - Eliminate redundant F32 dtype conversions in the Metal MoE path - Use 2 simdgroups (64 threads) for Q4K MoE dispatch
72be965 to
8ae3088
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR introduces a fused Metal execution path for Mixture-of-Experts (MoE) layers with Q4K quantized weights on Apple Silicon.
Instead of the standard pipeline:
this implements:
This eliminates intermediate buffer materialization and reduces GPU dispatch overhead, significantly improving per-token latency for MoE models on Metal.
Key changes
Fused MoE kernel (Metal)
indexed_moe.metalkernel performing:Runtime integration
Infrastructure
moe_qtensor()access for direct QTensor buffer usagePerformance
Reduces per-token latency for MoE models (e.g. Mixtral Q4K) by:
Dependencies
Depends on: