Description of the feature request:
Summary
When using Gemini 2.5 Flash Native Audio via the Multimodal Live API in a server-side telephony scenario (Twilio → WebSocket → Gemini Live), short utterances such as
"Yes", "No", "John", or other single-word responses frequently fail to trigger a model response.
Longer utterances (>2 seconds) work reliably, but brief responses that are common in natural phone conversations do not.
Environment
- Model:
gemini-2.5-flash-native-audio-preview-12-2025
- API: Multimodal Live API (WebSocket)
- Audio Input: PCM 16-bit, 16kHz, mono from Twilio phone calls
- Audio Chunk Size: ~20ms per chunk (640 bytes)
- Use Case: Restaurant phone ordering system
Current Configuration
setup_message = {
"setup": {
"model": "models/gemini-2.5-flash-native-audio-preview-12-2025",
"generation_config": {
"response_modalities": ["AUDIO"],
"thinking_config": {
"thinking_budget": 0
},
"speech_config": {
"voice_config": {
"prebuilt_voice_config": {
"voice_name": "Kore"
}
}
}
},
"system_instruction": {
"parts": [{"text": "You are a friendly restaurant assistant..."}]
},
"tools": [...]
}
}
Reproduction Steps
- Customer calls Twilio phone number
- Audio streams via WebSocket to our server
- Server forwards PCM audio chunks to Gemini Live API via
realtimeInput.mediaChunks
- Customer says a short response like:
- "Yes"
- "No"
- "John" (their name)
- "Sure"
- "Okay"
- Expected: Gemini responds with audio
- Actual: No response is generated (most of the time)
What Works
✅ Longer utterances (>2 seconds) trigger responses reliably
✅ Questions or full sentences work fine
✅ Azure Speech SDK STT recognizes these short utterances perfectly in parallel
Questions / Feature Requests
Since the documentation for VAD and turn detection parameters is not publicly available, I'd like to request:
1. VAD Configuration Parameters
Are there any Voice Activity Detection parameters we can configure? For example:
"speech_config": {
"vad_sensitivity": "low", # or "medium", "high"
"min_speech_duration_ms": 300,
"speech_end_timeout_ms": 800
}
2. Turn Detection Configuration
Can we adjust turn-taking/end-of-utterance detection? Similar to OpenAI's Realtime API:
"turn_detection": {
"enabled": True,
"timeout_ms": 1500,
"silence_threshold_ms": 500
}
3. Manual turnComplete Signal
Should we send explicit turnComplete signals after detecting silence? What's the correct format?
{
"clientContent": {
"turnComplete": True
}
}
or
{
"realtimeInput": {
"turnComplete": True
}
}
4. Minimum Audio Length Threshold
Is there a minimum audio duration requirement? Should we pad short utterances with silence?
5. Complete generation_config Schema
Could you publish the complete configuration schema including all experimental parameters for speech_config, especially for server-side telephony use cases?
Workaround Considered
We could use Azure STT for transcription and send text via clientContent, but this would lose the benefit of native audio processing:
# Hybrid approach
user_text = azure_stt.recognize(audio)
await gemini_live.send_text(user_text) # Send as text instead of audio
However, this defeats the purpose of using Native Audio and adds latency.
Impact
This issue significantly impacts conversational AI in telephony scenarios where:
- Short confirmations are common ("Yes", "No", "Okay")
- Customers provide brief information (names, numbers)
- Natural conversation flow requires quick back-and-forth
Additional Context
- The same audio streams work perfectly with Azure Speech STT (no missed short utterances)
- This only occurs with audio input; text input via
clientContent works for all lengths
- GitHub TypeScript example comments suggest VAD parameters exist but aren't documented
Related Links
Thank you for considering this feature request! Happy to provide more details or test experimental parameters if needed.
What problem are you trying to solve with this feature?
No response
Any other information you'd like to share?
No response
Description of the feature request:
Summary
When using Gemini 2.5 Flash Native Audio via the Multimodal Live API in a server-side telephony scenario (Twilio → WebSocket → Gemini Live), short utterances such as
"Yes", "No", "John", or other single-word responses frequently fail to trigger a model response.
Longer utterances (>2 seconds) work reliably, but brief responses that are common in natural phone conversations do not.
Environment
gemini-2.5-flash-native-audio-preview-12-2025Current Configuration
Reproduction Steps
realtimeInput.mediaChunksWhat Works
✅ Longer utterances (>2 seconds) trigger responses reliably
✅ Questions or full sentences work fine
✅ Azure Speech SDK STT recognizes these short utterances perfectly in parallel
Questions / Feature Requests
Since the documentation for VAD and turn detection parameters is not publicly available, I'd like to request:
1. VAD Configuration Parameters
Are there any Voice Activity Detection parameters we can configure? For example:
2. Turn Detection Configuration
Can we adjust turn-taking/end-of-utterance detection? Similar to OpenAI's Realtime API:
3. Manual
turnCompleteSignalShould we send explicit
turnCompletesignals after detecting silence? What's the correct format?{ "clientContent": { "turnComplete": True } }or
{ "realtimeInput": { "turnComplete": True } }4. Minimum Audio Length Threshold
Is there a minimum audio duration requirement? Should we pad short utterances with silence?
5. Complete
generation_configSchemaCould you publish the complete configuration schema including all experimental parameters for
speech_config, especially for server-side telephony use cases?Workaround Considered
We could use Azure STT for transcription and send text via
clientContent, but this would lose the benefit of native audio processing:However, this defeats the purpose of using Native Audio and adds latency.
Impact
This issue significantly impacts conversational AI in telephony scenarios where:
Additional Context
clientContentworks for all lengthsRelated Links
Thank you for considering this feature request! Happy to provide more details or test experimental parameters if needed.
What problem are you trying to solve with this feature?
No response
Any other information you'd like to share?
No response