Support dense Qwen3 generation in EMLXAxon#119
Conversation
polvalente
left a comment
There was a problem hiding this comment.
The PR overall looks good. One general achitectural change I'd like to propose is that it looks to me like all of the native code can be composed in EMLXAxon, in the Elixir layer.
Can we start with that and then optimize with custom C++ if benchmarks support a huge difference?
Additionally, I'd like to ensure we don't have performance regressions. Can we run this benchmark both in main and on this branch?
|
@polvalente Thanks, that makes sense. I ran
I also looked back through the local history behind this branch, and ran this as a dense Qwen3 matrix instead of relying only on the existing MLX 4bit benchmark. The dense benchmark uses the standard Hugging Face Adding the native pieces cumulatively gives:
The current PR branch, using the direct So I agree the Qwen3 user-facing path belongs in Some of the native steps are enabling work rather than standalone features, so I think the clearest way to review them is against the end to end dense generation path. I’m happy to split this into smaller follow-up PRs, as long as each split stays tied to an |
Co-authored-by: Paulo Valente <16843419+polvalente@users.noreply.github.com>
Co-authored-by: Paulo Valente <16843419+polvalente@users.noreply.github.com>
Co-authored-by: Paulo Valente <16843419+polvalente@users.noreply.github.com>
Co-authored-by: Paulo Valente <16843419+polvalente@users.noreply.github.com>
Wow! That's a really substantial difference gain, so it's basically a no-brainer in terms of "yeah, let's keep the native accelerators on". Would you be up for trying to make this more of a plug-in that emlx_axon compiles? As a possible iteration plan, we can try to keep the native code in EMLX in the PR, but in a very isolated area, such that it's easier to extract out into an emlx_axon plug-in in the next couple PRs. What do you think? |
|
@hfiguera I also messaged you at Elixir Slack if you wanna discuss things there! |
Co-authored-by: Paulo Valente <16843419+polvalente@users.noreply.github.com>
|
@polvalente Yes, that direction makes sense to me. For this PR, I think the safest step is to keep the native code in I would prefer not to move this into an A concrete plan could be:
That would preserve the performance work while making the ownership boundary clearer. |
I agree with this general plan!
I think these 3 steps are the cutoff for this PR.
Perfect. I think with having this native and possibly a separate NIF boundary for Qwen3 while still in EMLX can be left as is, and when we have other specific accelerated models we think about extracting common features. |
This adds a dense Qwen3 generation path to EMLXAxon.
The existing Qwen3 path is built around MLX 4 bit checkpoints. This PR adds support for loading standard Hugging Face dense safetensors for Qwen3 and running generation through the native EMLX path.
The main pieces are:
EMLXAxon.Qwen3.Model.StateEMLXAxon.Qwen3.GenerateEMLXAxon.TextGeneration.run/4,stream/5, andserving/3The existing quantized path is still covered. The dense path is separate from the MLX 4 bit loader and is intended for standard Hugging Face checkpoints such as
Qwen/Qwen3-0.6B.Tests:
cd emlx && mix test test/emlx/fast_test.exs --include metalcd emlx_axon && mix test test/emlx/qwen3_dense_loader_test.exs test/emlx/qwen3_generate_test.exs test/emlx/text_generation_test.exs --include metalLocal smoke numbers on an M4 Max with Qwen3 0.6B and 32 generated tokens:
These are local numbers, but they helped verify that the dense path is operational and not just functionally correct.