Skip to content

[FEAT] Voxtral Support#3036

Merged
greenrazer merged 22 commits into
huggingface:mainfrom
jorge-menjivar:jorge-voxtral-3
Aug 4, 2025
Merged

[FEAT] Voxtral Support#3036
greenrazer merged 22 commits into
huggingface:mainfrom
jorge-menjivar:jorge-voxtral-3

Conversation

@jorge-menjivar

Copy link
Copy Markdown
Contributor
  • Adds support for mistralai/Voxtral-Mini-3B-2507.
  • Should also support mistralai/Voxtral-Small-24B-2507, but this is not tested.

These models only have support for tekken tokenizer files right now, so I created and added an optional tekken tokenizer crate to the examples workspace. (tekken-rs). This gets enabled with the tekken feature.

fixes #3028

Run example with

cargo run --example voxtral --features tekken,symphonia,cuda,cudnn --release

@jorge-menjivar

Copy link
Copy Markdown
Contributor Author

@maximizemaxwell Thank you for your earlier work! Here is the working code if you want to try it.

@maximizemaxwell

Copy link
Copy Markdown
Contributor

@greenrazer Could you review this feature implementations?

@greenrazer greenrazer self-assigned this Jul 29, 2025
@benedikt-schaber

Copy link
Copy Markdown

Hey, thank you for the implementation. I had a rough one myself but was missing the tekken part among some other flaws. I went through the code and I think I might have some pointers for improvement. I am new to audio transformers, however, so please have some leniency with any mistakes :)

For the example:

  • unused code: the example has some unused code, it loads the mel filters when creating the model, but does not use them. Further, the reimplementation of pcm_to_filer in audo_processing.rs is unused
  • audo_utils.rs only contains a resample function, pcm_decode is its own module, so I think this could be named more specifically
  • seperating transcribe_audio_with_tokens from transcribe_audio_internal seems unecessary
  • post_process_transcription seems very specific, especially with the word combinators, is this necessary / should this be generalized? I only did a small number of tests (around 100 audio files of 30secs each) without the post processing but I also did not encounter the issues there.

For the main code:

  • voxtral_llama.rs might be mergable in llama.rs, it only adds the explicit head_dim option, voxtral defaults and "fixes" dtype consistency, single sequence support

@greenrazer greenrazer left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution, great job!

Just fix the clippy errors and we should be good:

cargo clippy --workspace --tests --examples --fix -- -D warnings

Comment thread candle-transformers/src/models/voxtral/model.rs
Comment thread candle-examples/examples/voxtral/pcm_decode.rs Outdated
Comment thread candle-examples/examples/voxtral/audio_utils.rs Outdated
@jorge-menjivar

Copy link
Copy Markdown
Contributor Author

Thank you @greenrazer and @benedikt-schaber for the suggestions.

I have made the requested changes.

I have also tested whisper's and snac's examples to make sure they run correctly after moving the pcm_decode and resample functions to the shared module.

@greenrazer greenrazer left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, Thanks!

@greenrazer greenrazer merged commit 21032cb into huggingface:main Aug 4, 2025
9 checks passed
john-sharratt pushed a commit to john-sharratt/candle that referenced this pull request May 7, 2026
* feat: implement some configs in voxtral

* fix: fixed imports, implement more func

* feat: implemented full version, need fixes

* fix: fixed some compile errors

* feat: add initial examples

* fix: fixed voxtral.rs

* fix: fixed compile errors in examples

* fix: fixed compile errors

* fix: update model integration

* First working example

* Remove unused melfilters code

* Remove unused code

* Reuse whisper's pcm_decode

* Simplify generation function

* Remove unnecessary post-process fun

* Reuse snac's resample

* Apply clippy suggestions

* Remove unused filters

* Improve example

* Update tekken-rs

* Clippy fixes

---------

Co-authored-by: Max <naturale@hufs.ac.kr>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: Voxtral

4 participants