fix: reject near-silent audio to prevent Whisper hallucinations#1229
Open
egsok wants to merge 1 commit intocjpais:mainfrom
Open
fix: reject near-silent audio to prevent Whisper hallucinations#1229egsok wants to merge 1 commit intocjpais:mainfrom
egsok wants to merge 1 commit intocjpais:mainfrom
Conversation
Add RMS energy gate (threshold 0.005) in transcription pipeline to skip audio that passed VAD but is too quiet for meaningful speech. Also add minimum speech duration check (100ms) in audio manager to discard tiny VAD leakage fragments before they get zero-padded to 1.25s. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Owner
|
I think this is probably a good change, I will test it and pull it in. May need to pull in beta testers too to get more feedback on it to ensure no unintended consequences. But I think if there is very low RMS there should be no big problem |
Owner
|
Hmmm.. I need to think more on this. I am not convinced that we can ship this broadly to everyone without consequences |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Problem
When recording a few seconds of "silence", Whisper hallucinates text like "Subtitles by the Amara.org community". Real microphones pick up ambient noise that occasionally exceeds VAD's threshold, causing SmoothedVad to flush its prefill buffer (~8000+ samples of near-silence). The padding logic then inflates this to 1.25 seconds, and Whisper hallucinates on it.
How it works
Primary defense — RMS energy gate (
transcription.rs): Computes RMS of the audio buffer and rejects anything below 0.005. Mic self-noise is ~0.0001–0.001, ambient noise that fools VAD is ~0.001–0.005, while whispered speech starts at ~0.01. This catches all near-silent audio regardless of how it got through VAD.Secondary defense — min duration (
audio.rs): Fragments shorter than 100ms (1600 samples at 16kHz) are discarded as VAD leakage — SmoothedVad's minimum real output is ~8000 samples. A cheap safety net for edge cases.Test plan
Ctrl+Shift+D): "Audio RMS ... below silence threshold" logged when recording silence