Skip to content

fix: wire flash_attention config field through to WhisperContextParam…#298

Open
graysky2 wants to merge 1 commit intopeteonrails:mainfrom
graysky2:flash_attn_fix
Open

fix: wire flash_attention config field through to WhisperContextParam…#298
graysky2 wants to merge 1 commit intopeteonrails:mainfrom
graysky2:flash_attn_fix

Conversation

@graysky2
Copy link
Copy Markdown
Contributor

@graysky2 graysky2 commented Apr 5, 2026

Description

The flash_attention field in [whisper] config was parsed from config.toml but never passed to WhisperContextParameters before model initialization, causing it to be silently ignored on all backends.

Fix: pass config.flash_attention to ctx_params.flash_attn() in WhisperTranscriber::new(), and add the missing field to WhisperConfig struct and its Default impl.

Tested on RTX 5060 Ti with large-v3-turbo, 6-minute audio file:

  Backend   flash_attn   encode buffer   wall time (avg 3 runs)
  -------   ----------   -------------   ---------
  CUDA      off          212 MB          7.00s
  Vulkan    off          220 MB          6.54s
  CUDA      on            55 MB          6.27s
  Vulkan    on            55 MB          6.26s

~10% reduction in wall time. Encode buffer reduced by ~75%, consistent with flash attention avoiding full attention matrix materialization.

Brief description of the changes in this PR.

Related Issue

Fixes #(issue number)

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update

Testing

  • I have tested these changes locally
  • I have run cargo test and all tests pass
  • I have run cargo clippy with no warnings
  • I have run cargo fmt

Documentation

  • I have updated documentation as needed
  • No documentation changes are needed

Additional Notes

Any additional information reviewers should know.

…eters

The flash_attention field in [whisper] config was parsed from config.toml
but never passed to WhisperContextParameters before model initialization,
causing it to be silently ignored on all backends.

Fix: pass config.flash_attention to ctx_params.flash_attn() in
WhisperTranscriber::new(), and add the missing field to WhisperConfig
struct and its Default impl.

Tested on RTX 5060 Ti with large-v3-turbo, 6-minute audio file:

  Backend   flash_attn   encode buffer   wall time (avg 3 runs)
  -------   ----------   -------------   ---------
  CUDA      off          212 MB          7.00s
  Vulkan    off          220 MB          6.54s
  CUDA      on            55 MB          6.27s
  Vulkan    on            55 MB          6.26s

~10% reduction in wall time. Encode buffer reduced by ~75%, consistent
with flash attention avoiding full attention matrix materialization.
@graysky2 graysky2 requested a review from peteonrails as a code owner April 5, 2026 14:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant