Support dense Qwen3 generation in EMLXAxon by hfiguera · Pull Request #119 · elixir-nx/emlx

hfiguera · 2026-06-20T21:53:10Z

This adds a dense Qwen3 generation path to EMLXAxon.

The existing Qwen3 path is built around MLX 4 bit checkpoints. This PR adds support for loading standard Hugging Face dense safetensors for Qwen3 and running generation through the native EMLX path.

The main pieces are:

load dense Qwen3 safetensors into EMLXAxon.Qwen3.Model.State
support greedy dense generation through EMLXAxon.Qwen3.Generate
support EMLXAxon.TextGeneration.run/4, stream/5, and serving/3
add the native EMLX support needed by the dense Qwen3 path
add validation around shapes, cache sizes, batch size, and generation options
add numerical and behavior tests for the dense path

The existing quantized path is still covered. The dense path is separate from the MLX 4 bit loader and is intended for standard Hugging Face checkpoints such as Qwen/Qwen3-0.6B.

Tests:

cd emlx && mix test test/emlx/fast_test.exs --include metal
cd emlx_axon && mix test test/emlx/qwen3_dense_loader_test.exs test/emlx/qwen3_generate_test.exs test/emlx/text_generation_test.exs --include metal

Local smoke numbers on an M4 Max with Qwen3 0.6B and 32 generated tokens:

MLX 4 bit path: around 118 tokens/sec p50
dense safetensors path: around 244 tokens/sec p50

These are local numbers, but they helped verify that the dense path is operational and not just functionally correct.

polvalente

The PR overall looks good. One general achitectural change I'd like to propose is that it looks to me like all of the native code can be composed in EMLXAxon, in the Elixir layer.

Can we start with that and then optimize with custom C++ if benchmarks support a huge difference?

Additionally, I'd like to ensure we don't have performance regressions. Can we run this benchmark both in main and on this branch?

hfiguera · 2026-06-22T20:17:40Z

@polvalente Thanks, that makes sense.

I ran emlx_axon/bench/validate_qwen3.exs as requested, using the same local mlx-community/Qwen3-0.6B-4bit checkpoint on main and on this branch. These runs were on an Apple M4 Max. I did not see a regression in the existing paths. The base and rewrite paths stayed in the same range, and the native path improved:

upstream/main: bb base 10.8 tok/s, bb+rewrite 44.7 tok/s, native 75.4 tok/s
this branch: bb base 10.7 tok/s, bb+rewrite 50.0 tok/s, native 112.1 tok/s

I also looked back through the local history behind this branch, and ran this as a dense Qwen3 matrix instead of relying only on the existing MLX 4bit benchmark.

The dense benchmark uses the standard Hugging Face Qwen/Qwen3-0.6B checkpoint from local safetensors, f16, on the EMLX GPU backend, on the same Apple M4 Max. The first dense path was mostly composed in emlx_axon using Elixir plus existing EMLX pieces, without exported Qwen3 native kernels. That branch records no qwen3_* exports from EMLX and reaches about 69 tok/s on this benchmark.

Adding the native pieces cumulatively gives:

native KV attention: about 100 tok/s
native MLP: about 112 tok/s
native attention block: about 201 tok/s
native full layer: about 219 to 224 tok/s
native all layer greedy path: about 241 tok/s
native chunked decode: about 256 tok/s

The current PR branch, using the direct EMLXAxon.TextGeneration path, measured about 257 tok/s p50 with the same dense model, precision, and GPU settings.

So I agree the Qwen3 user-facing path belongs in emlx_axon. The reason I kept the native support in this PR is that the composed dense path is measurable but substantially slower, and the cumulative matrix shows where the main gains come from.

Some of the native steps are enabling work rather than standalone features, so I think the clearest way to review them is against the end to end dense generation path. I’m happy to split this into smaller follow-up PRs, as long as each split stays tied to an emlx_axon dense generation benchmark so the performance reason stays visible.

Co-authored-by: Paulo Valente <16843419+polvalente@users.noreply.github.com>

polvalente · 2026-06-22T21:42:46Z

@polvalente Thanks, that makes sense.

I ran emlx_axon/bench/validate_qwen3.exs as requested, using the same local mlx-community/Qwen3-0.6B-4bit checkpoint on main and on this branch. These runs were on an Apple M4 Max. I did not see a regression in the existing paths. The base and rewrite paths stayed in the same range, and the native path improved:

upstream/main: bb base 10.8 tok/s, bb+rewrite 44.7 tok/s, native 75.4 tok/s

this branch: bb base 10.7 tok/s, bb+rewrite 50.0 tok/s, native 112.1 tok/s

I also looked back through the local history behind this branch, and ran this as a dense Qwen3 matrix instead of relying only on the existing MLX 4bit benchmark.

The dense benchmark uses the standard Hugging Face Qwen/Qwen3-0.6B checkpoint from local safetensors, f16, on the EMLX GPU backend, on the same Apple M4 Max. The first dense path was mostly composed in emlx_axon using Elixir plus existing EMLX pieces, without exported Qwen3 native kernels. That branch records no qwen3_* exports from EMLX and reaches about 69 tok/s on this benchmark.

Adding the native pieces cumulatively gives:

native KV attention: about 100 tok/s

native MLP: about 112 tok/s

native attention block: about 201 tok/s

native full layer: about 219 to 224 tok/s

native all layer greedy path: about 241 tok/s

native chunked decode: about 256 tok/s

The current PR branch, using the direct EMLXAxon.TextGeneration path, measured about 257 tok/s p50 with the same dense model, precision, and GPU settings.

So I agree the Qwen3 user-facing path belongs in emlx_axon. The reason I kept the native support in this PR is that the composed dense path is measurable but substantially slower, and the cumulative matrix shows where the main gains come from.

Some of the native steps are enabling work rather than standalone features, so I think the clearest way to review them is against the end to end dense generation path. I’m happy to split this into smaller follow-up PRs, as long as each split stays tied to an emlx_axon dense generation benchmark so the performance reason stays visible.

Wow! That's a really substantial difference gain, so it's basically a no-brainer in terms of "yeah, let's keep the native accelerators on". Would you be up for trying to make this more of a plug-in that emlx_axon compiles?
We could have some core C-level abstractions shared in a .h file, such that emlx_axon can rely on this contract by emlx, and provide the same types of native constructs. This would require making the internal tensor representation documented as well, which we can have as a follow-up.

As a possible iteration plan, we can try to keep the native code in EMLX in the PR, but in a very isolated area, such that it's easier to extract out into an emlx_axon plug-in in the next couple PRs. What do you think?

polvalente · 2026-06-22T21:50:21Z

@hfiguera I also messaged you at Elixir Slack if you wanna discuss things there!

Co-authored-by: Paulo Valente <16843419+polvalente@users.noreply.github.com>

hfiguera · 2026-06-23T05:43:13Z

@polvalente Yes, that direction makes sense to me.

For this PR, I think the safest step is to keep the native code in emlx for now, but make the boundary much clearer: isolate the Qwen3 native code from the generic EMLX fast path, keep the user facing API in emlx_axon, and avoid spreading model specific code through unrelated native files.

I would prefer not to move this into an emlx_axon native plug-in in this same PR, because that also needs a documented C contract for EMLX tensor refs and probably some packaging and build design. But I agree that this PR can be structured so that extraction is straightforward in follow-up PRs.

A concrete plan could be:

Keep the current native accelerators in this PR.
Move the Qwen3 native pieces into an isolated C++/header area inside emlx.
Keep the Elixir-facing dense loading and generation path in emlx_axon.
In a follow-up, document the EMLX tensor/native contract and explore compiling the model-specific native extension from emlx_axon.

That would preserve the performance work while making the ownership boundary clearer.

polvalente · 2026-06-23T14:11:24Z

@polvalente Yes, that direction makes sense to me.

For this PR, I think the safest step is to keep the native code in emlx for now, but make the boundary much clearer: isolate the Qwen3 native code from the generic EMLX fast path, keep the user facing API in emlx_axon, and avoid spreading model specific code through unrelated native files.

I would prefer not to move this into an emlx_axon native plug-in in this same PR, because that also needs a documented C contract for EMLX tensor refs and probably some packaging and build design. But I agree that this PR can be structured so that extraction is straightforward in follow-up PRs.

I agree with this general plan!

A concrete plan could be:

Keep the current native accelerators in this PR.

Move the Qwen3 native pieces into an isolated C++/header area inside emlx.

Keep the Elixir-facing dense loading and generation path in emlx_axon.

I think these 3 steps are the cutoff for this PR.

In a follow-up, document the EMLX tensor/native contract and explore compiling the model-specific native extension from emlx_axon.

That would preserve the performance work while making the ownership boundary clearer.

Perfect. I think with having this native and possibly a separate NIF boundary for Qwen3 while still in EMLX can be left as is, and when we have other specific accelerated models we think about extracting common features.

feat: add dense qwen3 generation to EMLXAxon

d730eb2

polvalente reviewed Jun 20, 2026

View reviewed changes

hfiguera and others added 5 commits June 22, 2026 14:28

Update emlx_axon/lib/emlx_axon/qwen3/generate.ex

49fb914

Co-authored-by: Paulo Valente <16843419+polvalente@users.noreply.github.com>

Update emlx_axon/lib/emlx_axon/qwen3/generate.ex

c2dec0c

Co-authored-by: Paulo Valente <16843419+polvalente@users.noreply.github.com>

Update emlx_axon/lib/emlx_axon/qwen3/generate.ex

b6843a9

Co-authored-by: Paulo Valente <16843419+polvalente@users.noreply.github.com>

Update emlx_axon/lib/emlx_axon/qwen3/generate.ex

e5bfe74

Co-authored-by: Paulo Valente <16843419+polvalente@users.noreply.github.com>

Address text generation style feedback

a4c287c

hfiguera and others added 6 commits June 22, 2026 16:05

Use microsecond timing for Qwen3 profiling

de701ea

Update emlx_axon/lib/emlx_axon/qwen3/dense_loader.ex

6935cc7

Co-authored-by: Paulo Valente <16843419+polvalente@users.noreply.github.com>

Simplify Qwen3 dense parameter lookup

597753e

Rename Qwen3 quantized attention path

ef75529

Leave dense Qwen3 state evaluation lazy

b7ddac5

Keep Qwen3 greedy sampler generic

685ec29

hfiguera added 2 commits June 22, 2026 23:43

Isolate Qwen3 native accelerators

92c279a

Inline Qwen3 native error returns

3dba672

hfiguera added 2 commits June 23, 2026 08:18

Move Qwen3 native code under emlx_fast

08f66e7

Move memory include to Qwen3 source

51c2393

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support dense Qwen3 generation in EMLXAxon#119

Support dense Qwen3 generation in EMLXAxon#119
hfiguera wants to merge 16 commits into
elixir-nx:mainfrom
hfiguera:emlx-axon-qwen3-dense-end-to-end

hfiguera commented Jun 20, 2026

Uh oh!

polvalente left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hfiguera commented Jun 22, 2026

Uh oh!

polvalente commented Jun 22, 2026

Uh oh!

polvalente commented Jun 22, 2026

Uh oh!

hfiguera commented Jun 23, 2026

Uh oh!

polvalente commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

hfiguera commented Jun 20, 2026

Uh oh!

polvalente left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hfiguera commented Jun 22, 2026

Uh oh!

polvalente commented Jun 22, 2026

Uh oh!

polvalente commented Jun 22, 2026

Uh oh!

hfiguera commented Jun 23, 2026

Uh oh!

polvalente commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants