fix: wire flash_attention config field through to WhisperContextParam… by graysky2 · Pull Request #298 · peteonrails/voxtype

graysky2 · 2026-04-05T14:58:02Z

Description

The flash_attention field in [whisper] config was parsed from config.toml but never passed to WhisperContextParameters before model initialization, causing it to be silently ignored on all backends.

Fix: pass config.flash_attention to ctx_params.flash_attn() in WhisperTranscriber::new(), and add the missing field to WhisperConfig struct and its Default impl.

Tested on RTX 5060 Ti with large-v3-turbo, 6-minute audio file:

  Backend   flash_attn   encode buffer   wall time (avg 3 runs)
  -------   ----------   -------------   ---------
  CUDA      off          212 MB          7.00s
  Vulkan    off          220 MB          6.54s
  CUDA      on            55 MB          6.27s
  Vulkan    on            55 MB          6.26s

~10% reduction in wall time. Encode buffer reduced by ~75%, consistent with flash attention avoiding full attention matrix materialization.

Brief description of the changes in this PR.

Related Issue

Fixes #(issue number)

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation update

Testing

I have tested these changes locally
I have run cargo test and all tests pass
I have run cargo clippy with no warnings
I have run cargo fmt

Documentation

I have updated documentation as needed
No documentation changes are needed

Additional Notes

Any additional information reviewers should know.

…eters The flash_attention field in [whisper] config was parsed from config.toml but never passed to WhisperContextParameters before model initialization, causing it to be silently ignored on all backends. Fix: pass config.flash_attention to ctx_params.flash_attn() in WhisperTranscriber::new(), and add the missing field to WhisperConfig struct and its Default impl. Tested on RTX 5060 Ti with large-v3-turbo, 6-minute audio file: Backend flash_attn encode buffer wall time (avg 3 runs) ------- ---------- ------------- --------- CUDA off 212 MB 7.00s Vulkan off 220 MB 6.54s CUDA on 55 MB 6.27s Vulkan on 55 MB 6.26s ~10% reduction in wall time. Encode buffer reduced by ~75%, consistent with flash attention avoiding full attention matrix materialization.

graysky2 requested a review from peteonrails as a code owner April 5, 2026 14:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: wire flash_attention config field through to WhisperContextParam…#298

fix: wire flash_attention config field through to WhisperContextParam…#298
graysky2 wants to merge 1 commit intopeteonrails:mainfrom
graysky2:flash_attn_fix

graysky2 commented Apr 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

graysky2 commented Apr 5, 2026

Description

Related Issue

Type of Change

Testing

Documentation

Additional Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant