Skip to content

Support dense Qwen3 generation in EMLXAxon#119

Open
hfiguera wants to merge 16 commits into
elixir-nx:mainfrom
hfiguera:emlx-axon-qwen3-dense-end-to-end
Open

Support dense Qwen3 generation in EMLXAxon#119
hfiguera wants to merge 16 commits into
elixir-nx:mainfrom
hfiguera:emlx-axon-qwen3-dense-end-to-end

Conversation

@hfiguera

Copy link
Copy Markdown
Contributor

This adds a dense Qwen3 generation path to EMLXAxon.

The existing Qwen3 path is built around MLX 4 bit checkpoints. This PR adds support for loading standard Hugging Face dense safetensors for Qwen3 and running generation through the native EMLX path.

The main pieces are:

  • load dense Qwen3 safetensors into EMLXAxon.Qwen3.Model.State
  • support greedy dense generation through EMLXAxon.Qwen3.Generate
  • support EMLXAxon.TextGeneration.run/4, stream/5, and serving/3
  • add the native EMLX support needed by the dense Qwen3 path
  • add validation around shapes, cache sizes, batch size, and generation options
  • add numerical and behavior tests for the dense path

The existing quantized path is still covered. The dense path is separate from the MLX 4 bit loader and is intended for standard Hugging Face checkpoints such as Qwen/Qwen3-0.6B.

Tests:

  • cd emlx && mix test test/emlx/fast_test.exs --include metal
  • cd emlx_axon && mix test test/emlx/qwen3_dense_loader_test.exs test/emlx/qwen3_generate_test.exs test/emlx/text_generation_test.exs --include metal

Local smoke numbers on an M4 Max with Qwen3 0.6B and 32 generated tokens:

  • MLX 4 bit path: around 118 tokens/sec p50
  • dense safetensors path: around 244 tokens/sec p50

These are local numbers, but they helped verify that the dense path is operational and not just functionally correct.

@polvalente polvalente left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR overall looks good. One general achitectural change I'd like to propose is that it looks to me like all of the native code can be composed in EMLXAxon, in the Elixir layer.

Can we start with that and then optimize with custom C++ if benchmarks support a huge difference?

Additionally, I'd like to ensure we don't have performance regressions. Can we run this benchmark both in main and on this branch?

Comment thread emlx/c_src/emlx_fast.cpp Outdated
Comment thread emlx/c_src/emlx_fast.cpp Outdated
Comment thread emlx/c_src/emlx_fast.cpp Outdated
Comment thread emlx/c_src/emlx_fast.cpp Outdated
Comment thread emlx/c_src/emlx_fast.cpp Outdated
Comment thread emlx_axon/lib/emlx_axon/qwen3/generate.ex
Comment thread emlx_axon/lib/emlx_axon/qwen3/generate.ex
Comment thread emlx_axon/lib/emlx_axon/qwen3/generate.ex Outdated
Comment thread emlx_axon/lib/emlx_axon/qwen3/sampler.ex Outdated
Comment thread emlx_axon/lib/emlx_axon/text_generation.ex Outdated
@hfiguera

Copy link
Copy Markdown
Contributor Author

@polvalente Thanks, that makes sense.

I ran emlx_axon/bench/validate_qwen3.exs as requested, using the same local mlx-community/Qwen3-0.6B-4bit checkpoint on main and on this branch. These runs were on an Apple M4 Max. I did not see a regression in the existing paths. The base and rewrite paths stayed in the same range, and the native path improved:

  • upstream/main: bb base 10.8 tok/s, bb+rewrite 44.7 tok/s, native 75.4 tok/s
  • this branch: bb base 10.7 tok/s, bb+rewrite 50.0 tok/s, native 112.1 tok/s

I also looked back through the local history behind this branch, and ran this as a dense Qwen3 matrix instead of relying only on the existing MLX 4bit benchmark.

The dense benchmark uses the standard Hugging Face Qwen/Qwen3-0.6B checkpoint from local safetensors, f16, on the EMLX GPU backend, on the same Apple M4 Max. The first dense path was mostly composed in emlx_axon using Elixir plus existing EMLX pieces, without exported Qwen3 native kernels. That branch records no qwen3_* exports from EMLX and reaches about 69 tok/s on this benchmark.

Adding the native pieces cumulatively gives:

  • native KV attention: about 100 tok/s
  • native MLP: about 112 tok/s
  • native attention block: about 201 tok/s
  • native full layer: about 219 to 224 tok/s
  • native all layer greedy path: about 241 tok/s
  • native chunked decode: about 256 tok/s

The current PR branch, using the direct EMLXAxon.TextGeneration path, measured about 257 tok/s p50 with the same dense model, precision, and GPU settings.

So I agree the Qwen3 user-facing path belongs in emlx_axon. The reason I kept the native support in this PR is that the composed dense path is measurable but substantially slower, and the cumulative matrix shows where the main gains come from.

Some of the native steps are enabling work rather than standalone features, so I think the clearest way to review them is against the end to end dense generation path. I’m happy to split this into smaller follow-up PRs, as long as each split stays tied to an emlx_axon dense generation benchmark so the performance reason stays visible.

hfiguera and others added 5 commits June 22, 2026 14:28
Co-authored-by: Paulo Valente <16843419+polvalente@users.noreply.github.com>
Co-authored-by: Paulo Valente <16843419+polvalente@users.noreply.github.com>
Co-authored-by: Paulo Valente <16843419+polvalente@users.noreply.github.com>
Co-authored-by: Paulo Valente <16843419+polvalente@users.noreply.github.com>
@polvalente

Copy link
Copy Markdown
Collaborator

@polvalente Thanks, that makes sense.

I ran emlx_axon/bench/validate_qwen3.exs as requested, using the same local mlx-community/Qwen3-0.6B-4bit checkpoint on main and on this branch. These runs were on an Apple M4 Max. I did not see a regression in the existing paths. The base and rewrite paths stayed in the same range, and the native path improved:

  • upstream/main: bb base 10.8 tok/s, bb+rewrite 44.7 tok/s, native 75.4 tok/s
  • this branch: bb base 10.7 tok/s, bb+rewrite 50.0 tok/s, native 112.1 tok/s

I also looked back through the local history behind this branch, and ran this as a dense Qwen3 matrix instead of relying only on the existing MLX 4bit benchmark.

The dense benchmark uses the standard Hugging Face Qwen/Qwen3-0.6B checkpoint from local safetensors, f16, on the EMLX GPU backend, on the same Apple M4 Max. The first dense path was mostly composed in emlx_axon using Elixir plus existing EMLX pieces, without exported Qwen3 native kernels. That branch records no qwen3_* exports from EMLX and reaches about 69 tok/s on this benchmark.

Adding the native pieces cumulatively gives:

  • native KV attention: about 100 tok/s
  • native MLP: about 112 tok/s
  • native attention block: about 201 tok/s
  • native full layer: about 219 to 224 tok/s
  • native all layer greedy path: about 241 tok/s
  • native chunked decode: about 256 tok/s

The current PR branch, using the direct EMLXAxon.TextGeneration path, measured about 257 tok/s p50 with the same dense model, precision, and GPU settings.

So I agree the Qwen3 user-facing path belongs in emlx_axon. The reason I kept the native support in this PR is that the composed dense path is measurable but substantially slower, and the cumulative matrix shows where the main gains come from.

Some of the native steps are enabling work rather than standalone features, so I think the clearest way to review them is against the end to end dense generation path. I’m happy to split this into smaller follow-up PRs, as long as each split stays tied to an emlx_axon dense generation benchmark so the performance reason stays visible.

Wow! That's a really substantial difference gain, so it's basically a no-brainer in terms of "yeah, let's keep the native accelerators on". Would you be up for trying to make this more of a plug-in that emlx_axon compiles?
We could have some core C-level abstractions shared in a .h file, such that emlx_axon can rely on this contract by emlx, and provide the same types of native constructs. This would require making the internal tensor representation documented as well, which we can have as a follow-up.

As a possible iteration plan, we can try to keep the native code in EMLX in the PR, but in a very isolated area, such that it's easier to extract out into an emlx_axon plug-in in the next couple PRs. What do you think?

@polvalente

Copy link
Copy Markdown
Collaborator

@hfiguera I also messaged you at Elixir Slack if you wanna discuss things there!

@hfiguera

Copy link
Copy Markdown
Contributor Author

@polvalente Yes, that direction makes sense to me.

For this PR, I think the safest step is to keep the native code in emlx for now, but make the boundary much clearer: isolate the Qwen3 native code from the generic EMLX fast path, keep the user facing API in emlx_axon, and avoid spreading model specific code through unrelated native files.

I would prefer not to move this into an emlx_axon native plug-in in this same PR, because that also needs a documented C contract for EMLX tensor refs and probably some packaging and build design. But I agree that this PR can be structured so that extraction is straightforward in follow-up PRs.

A concrete plan could be:

  1. Keep the current native accelerators in this PR.
  2. Move the Qwen3 native pieces into an isolated C++/header area inside emlx.
  3. Keep the Elixir-facing dense loading and generation path in emlx_axon.
  4. In a follow-up, document the EMLX tensor/native contract and explore compiling the model-specific native extension from emlx_axon.

That would preserve the performance work while making the ownership boundary clearer.

@polvalente

Copy link
Copy Markdown
Collaborator

@polvalente Yes, that direction makes sense to me.

For this PR, I think the safest step is to keep the native code in emlx for now, but make the boundary much clearer: isolate the Qwen3 native code from the generic EMLX fast path, keep the user facing API in emlx_axon, and avoid spreading model specific code through unrelated native files.

I would prefer not to move this into an emlx_axon native plug-in in this same PR, because that also needs a documented C contract for EMLX tensor refs and probably some packaging and build design. But I agree that this PR can be structured so that extraction is straightforward in follow-up PRs.

I agree with this general plan!

A concrete plan could be:

  1. Keep the current native accelerators in this PR.
  2. Move the Qwen3 native pieces into an isolated C++/header area inside emlx.
  3. Keep the Elixir-facing dense loading and generation path in emlx_axon.

I think these 3 steps are the cutoff for this PR.

  1. In a follow-up, document the EMLX tensor/native contract and explore compiling the model-specific native extension from emlx_axon.

That would preserve the performance work while making the ownership boundary clearer.

Perfect. I think with having this native and possibly a separate NIF boundary for Qwen3 while still in EMLX can be left as is, and when we have other specific accelerated models we think about extracting common features.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants