Skip to content

fix: reject near-silent audio to prevent Whisper hallucinations#1229

Open
egsok wants to merge 1 commit intocjpais:mainfrom
egsok:fix/silence-hallucination
Open

fix: reject near-silent audio to prevent Whisper hallucinations#1229
egsok wants to merge 1 commit intocjpais:mainfrom
egsok:fix/silence-hallucination

Conversation

@egsok
Copy link
Copy Markdown

@egsok egsok commented Apr 5, 2026

Summary

  • Add RMS energy gate (threshold 0.005) in the transcription pipeline — skips audio that passed VAD but is too quiet for real speech
  • Add minimum speech duration check (100ms / 1600 samples) in audio manager — discards tiny VAD leakage fragments before they get zero-padded to 1.25s

Problem

When recording a few seconds of "silence", Whisper hallucinates text like "Subtitles by the Amara.org community". Real microphones pick up ambient noise that occasionally exceeds VAD's threshold, causing SmoothedVad to flush its prefill buffer (~8000+ samples of near-silence). The padding logic then inflates this to 1.25 seconds, and Whisper hallucinates on it.

image

How it works

Primary defense — RMS energy gate (transcription.rs): Computes RMS of the audio buffer and rejects anything below 0.005. Mic self-noise is ~0.0001–0.001, ambient noise that fools VAD is ~0.001–0.005, while whispered speech starts at ~0.01. This catches all near-silent audio regardless of how it got through VAD.

Secondary defense — min duration (audio.rs): Fragments shorter than 100ms (1600 samples at 16kHz) are discarded as VAD leakage — SmoothedVad's minimum real output is ~8000 samples. A cheap safety net for edge cases.

Test plan

  • Record 2–3 seconds of silence → no transcription output, no hallucination
  • Record a short word (~0.3s) → transcribes normally
  • Record normal speech → identical behavior to before
  • Debug mode (Ctrl+Shift+D): "Audio RMS ... below silence threshold" logged when recording silence

Add RMS energy gate (threshold 0.005) in transcription pipeline to skip
audio that passed VAD but is too quiet for meaningful speech. Also add
minimum speech duration check (100ms) in audio manager to discard tiny
VAD leakage fragments before they get zero-padded to 1.25s.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@cjpais
Copy link
Copy Markdown
Owner

cjpais commented Apr 5, 2026

I think this is probably a good change, I will test it and pull it in. May need to pull in beta testers too to get more feedback on it to ensure no unintended consequences. But I think if there is very low RMS there should be no big problem

@cjpais
Copy link
Copy Markdown
Owner

cjpais commented Apr 7, 2026

Hmmm.. I need to think more on this. I am not convinced that we can ship this broadly to everyone without consequences

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants