feat: add varlen attention on cpu by michaelfeil · Pull Request #777 · huggingface/text-embeddings-inference

michaelfeil · 2025-12-17T21:15:10Z

What does this PR do?

This PR brings varlen-flash-attention to CPU/Metal. (Its not softmax-fused / Flash), but at least its not-padded, so we don't do OOM.

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the documentation guidelines.
Did you write any new necessary tests? If applicable, did you include or update the insta snapshots?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

kozistr

aside from one thing, everything else looks good to me!

kozistr · 2025-12-18T13:59:02Z

backends/candle/src/flash_attn_cpu.rs

+            let causal_mask = create_causal_mask_batch(seq_len_q, seq_len_k, num_heads, device)?;
+            attention_scores = attention_scores.add(&causal_mask)?;


i was just wondering that it looks like causual_mask and window_mask below are always fp32 type while attention_scores could be fp16. I'm not sure if I'm right, it might fail due to a type mismatch!

Thanks, you are right. i figured that out too.

In this case its always fp32, but for apple metal backend it could be fp16 afaik.

michaelfeil · 2025-12-19T02:57:17Z

I am going to mark this PR as draft. I implemented a pretty fast attention primitive here: huggingface/candle#3250 once that is merged (which i am eagerly waiting for) we can do a simple copy of the function here (without the tests).

michaelfeil added 14 commits December 17, 2025 21:14

add flash attention on cpu

0f3d550

format and merge

f84bb19

almost workinng

206a794

no more warnings

edcc4e3

fix

cfea6cd

candle fa cpu

bec646c

add more tests

4e794dc

all gpu tests are passing

c9e8ada

add flash-attention-cpu

bceefbd

refactor test suite

3ec28c0

add comment

8d64231

fmt

3a696c1

flash-attn-cpu

e323b42

add varlen attention interface

c513e55

kozistr reviewed Dec 18, 2025

View reviewed changes

michaelfeil marked this pull request as draft December 19, 2025 02:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add varlen attention on cpu#777

feat: add varlen attention on cpu#777
michaelfeil wants to merge 14 commits intohuggingface:mainfrom
michaelfeil:mf/flash-attention-cpu

michaelfeil commented Dec 17, 2025 •

edited

Loading

Uh oh!

kozistr left a comment

Uh oh!

kozistr Dec 18, 2025

Uh oh!

michaelfeil Dec 19, 2025

Uh oh!

michaelfeil Dec 19, 2025

Uh oh!

michaelfeil commented Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		let causal_mask = create_causal_mask_batch(seq_len_q, seq_len_k, num_heads, device)?;
		attention_scores = attention_scores.add(&causal_mask)?;

Conversation

michaelfeil commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

kozistr left a comment

Choose a reason for hiding this comment

Uh oh!

kozistr Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

michaelfeil Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

michaelfeil Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

michaelfeil commented Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

michaelfeil commented Dec 17, 2025 •

edited

Loading