*Major T/s improvement* Use the Metal qmatmul MM kernels by EricLBuehler · Pull Request #2615 · huggingface/candle

EricLBuehler · 2024-11-14T19:29:07Z

This PR adds the automatic usage of Metal GGML quantized mat-mat kernels instead of always using the mat-vec kernels and upstreams a few related/necessary changes.

Before this change, Candle's Metal decoding performance was on-par with MLX and llama.cpp but the prompt performance was insufficient. After this change, the prompt performance (on the benchmark) was increased to a factor of about 2.5x faster than MLX and within 10% of llama.cpp - a performance boost by a factor of almost 6x.

This PR switches to only using the MV kernels when D::Minus2 of the xs input tensor is equal to 1. This mirrors the logic in GGML.

Besides utilizing the MM kernels, this PR also upstreams some required changes:

Adds GGUF bf16 support (originally)
Updates quantized Metal kernels to support bf16 (originally)
Sync GGML <> Candle Metal kernels (originally)

* Add GGUF bf16 type support * Add non avx impl for vec_dot_bf16 * Fix from_u32 * Fix loading * Fix dequant of bf16

* Update kernels for metal bf16 * Fix typo * Check if have bfloat

* Test passes * All tests pass * Now all the tests really pass * Try out always using mm * Mirror llama.cpp metric * Mirror llama.cpp metric * Update test

EricLBuehler · 2024-11-14T21:30:40Z

@LaurentMazare if you could review, that would be great!

More benchmarks with some smaller models can be found here: EricLBuehler/mistral.rs#903 (comment)

lucasjinreal · 2025-01-29T08:58:55Z

Why is this still not close/?

ghost · 2025-04-02T04:17:43Z

Without merging this MR, is candle still slower than llama.cpp/ggml right now? Or has this improvement already been implemented in other code submissions?

EricLBuehler · 2025-04-02T18:33:15Z

@null-define without this, Candle Metal prompt performance is significantly reduced. This is because we aren't using the specialized Matrix-Matrix kernels, instead using Matrix-Vector kernels repeatedly which is slower.

lucasjinreal · 2025-04-03T05:28:12Z

wondering why it isn't merged into main? Does candle is now not maintained well?

greenrazer · 2025-04-22T18:31:20Z

This is an amazing improvement!

After testing across 11 GGUF LLMs, the new code is 73% faster than the current version, exceeding llama.cpp speeds on my M3 Max.

Data

Candle (CPU): 2.58 avg tokens/sec
Candle (Metal): 27.07 avg tokens/sec
MLX: 62.96 avg tokens/sec
Llama.cpp: 82.78 avg tokens/sec
Candle (Metal) + PR 2615: 100.60 avg tokens/sec

Computer Specs

M3 Max
36GB RAM
Mac OS 15.3.2 (24D81)

@LaurentMazare What would it take to get this merged?

lucasjinreal · 2025-04-23T03:15:18Z

@meg-huggingface Please consider merge it

AlpineVibrations · 2025-04-23T14:54:27Z

wow. this sounds amazing. we sure could use any speed boost we can get on metal. This original PR is from almost 5 months ago. Why is there no discussion as the reason its not merged yet?

AlpineVibrations · 2025-04-23T21:18:54Z

@LaurentMazare would this help with inference speed of Metal for Flux and SD3 image generation?

lucasjinreal · 2025-04-25T15:18:59Z

The candle team seems abundant in this lib?

AlpineVibrations · 2025-05-25T21:21:49Z

just checking in again on this hanging PR. is there anyone out there that can review? do we need to do it different or fix something? thanks

AlpineVibrations · 2025-06-13T22:24:10Z

@LaurentMazare sorry to bug you but is there someone else we can ping to get this approved or at least some comment on why it's still sitting here for so many months? thanks

lucasjinreal · 2025-06-14T02:57:39Z

I think HuggingFace abandoned the Candle project.

AlpineVibrations · 2025-06-14T23:15:39Z

is that real?

lucasjinreal · 2025-06-15T01:44:26Z

I think it is now mainly community driven, and the core developers are lazy at merging new features, not even supporting new features, such as many low level ONNX ops. I couldn't see any response or support for it.

AlpineVibrations · 2025-06-17T15:16:42Z

maybe they should add some more admins that have merge authority. it seams like there are many people ready to work.

greenrazer · 2025-07-18T21:31:49Z

LGTM

…#2615) * Add GGUF BF16 support (huggingface#17) * Add GGUF bf16 type support * Add non avx impl for vec_dot_bf16 * Fix from_u32 * Fix loading * Fix dequant of bf16 * Update kernels for metal bf16 (huggingface#19) * Update kernels for metal bf16 * Fix typo * Check if have bfloat * Sync ggml metal kernels (huggingface#33) * Metal qmatmul mat-mat product (huggingface#39) * Test passes * All tests pass * Now all the tests really pass * Try out always using mm * Mirror llama.cpp metric * Mirror llama.cpp metric * Update test * Update test * fixed merge error --------- Co-authored-by: keighbee <kb@huggingface.co>

EricLBuehler and others added 5 commits November 14, 2024 14:13

Add GGUF BF16 support (#17)

053e63a

* Add GGUF bf16 type support * Add non avx impl for vec_dot_bf16 * Fix from_u32 * Fix loading * Fix dequant of bf16

Update kernels for metal bf16 (#19)

9fa0b21

* Update kernels for metal bf16 * Fix typo * Check if have bfloat

Sync ggml metal kernels (#33)

23dacf7

Metal qmatmul mat-mat product (#39)

885bd31

* Test passes * All tests pass * Now all the tests really pass * Try out always using mm * Mirror llama.cpp metric * Mirror llama.cpp metric * Update test

Update test

82fe8ea

Vaibhavs10 requested a review from LaurentMazare November 18, 2024 19:53

This was referenced Nov 22, 2024

Quantized much slower than llama.cpp with same model and settings... #1939

Open

Sync with GGML: add GGML bf16 support #2640

Closed

greenrazer added 2 commits July 17, 2025 14:27

Merge branch 'main' into pr-2615-3

a3f98dd

fixed merge error

4af9162

greenrazer merged commit 1ef1341 into huggingface:main Jul 18, 2025
9 checks passed

greenrazer mentioned this pull request Jul 18, 2025

Apple silicon coreml backend #3025

Closed

Conversation

EricLBuehler commented Nov 14, 2024

Uh oh!

EricLBuehler commented Nov 14, 2024

Uh oh!

lucasjinreal commented Jan 29, 2025

Uh oh!

ghost commented Apr 2, 2025

Uh oh!

EricLBuehler commented Apr 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lucasjinreal commented Apr 3, 2025

Uh oh!

greenrazer commented Apr 22, 2025

Uh oh!

lucasjinreal commented Apr 23, 2025

Uh oh!

AlpineVibrations commented Apr 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AlpineVibrations commented Apr 23, 2025

Uh oh!

lucasjinreal commented Apr 25, 2025

Uh oh!

AlpineVibrations commented May 25, 2025

Uh oh!

AlpineVibrations commented Jun 13, 2025

Uh oh!

lucasjinreal commented Jun 14, 2025

Uh oh!

AlpineVibrations commented Jun 14, 2025

Uh oh!

lucasjinreal commented Jun 15, 2025

Uh oh!

AlpineVibrations commented Jun 17, 2025

Uh oh!

greenrazer commented Jul 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

EricLBuehler commented Apr 2, 2025 •

edited

Loading

AlpineVibrations commented Apr 23, 2025 •

edited

Loading