Skip to content

VAD configuration for short utterances - "Yes", "No", single names not triggering response #142

@fansen

Description

@fansen

Description of the feature request:

Summary

When using Gemini 2.5 Flash Native Audio via the Multimodal Live API in a server-side telephony scenario (Twilio → WebSocket → Gemini Live), short utterances such as
"Yes", "No", "John", or other single-word responses frequently fail to trigger a model response.

Longer utterances (>2 seconds) work reliably, but brief responses that are common in natural phone conversations do not.

Environment

  • Model: gemini-2.5-flash-native-audio-preview-12-2025
  • API: Multimodal Live API (WebSocket)
  • Audio Input: PCM 16-bit, 16kHz, mono from Twilio phone calls
  • Audio Chunk Size: ~20ms per chunk (640 bytes)
  • Use Case: Restaurant phone ordering system

Current Configuration

setup_message = {
    "setup": {
        "model": "models/gemini-2.5-flash-native-audio-preview-12-2025",
        "generation_config": {
            "response_modalities": ["AUDIO"],
            "thinking_config": {
                "thinking_budget": 0
            },
            "speech_config": {
                "voice_config": {
                    "prebuilt_voice_config": {
                        "voice_name": "Kore"
                    }
                }
            }
        },
        "system_instruction": {
            "parts": [{"text": "You are a friendly restaurant assistant..."}]
        },
        "tools": [...]
    }
}

Reproduction Steps

  1. Customer calls Twilio phone number
  2. Audio streams via WebSocket to our server
  3. Server forwards PCM audio chunks to Gemini Live API via realtimeInput.mediaChunks
  4. Customer says a short response like:
    • "Yes"
    • "No"
    • "John" (their name)
    • "Sure"
    • "Okay"
  5. Expected: Gemini responds with audio
  6. Actual: No response is generated (most of the time)

What Works

✅ Longer utterances (>2 seconds) trigger responses reliably
✅ Questions or full sentences work fine
✅ Azure Speech SDK STT recognizes these short utterances perfectly in parallel

Questions / Feature Requests

Since the documentation for VAD and turn detection parameters is not publicly available, I'd like to request:

1. VAD Configuration Parameters

Are there any Voice Activity Detection parameters we can configure? For example:

"speech_config": {
    "vad_sensitivity": "low",  # or "medium", "high"
    "min_speech_duration_ms": 300,
    "speech_end_timeout_ms": 800
}

2. Turn Detection Configuration

Can we adjust turn-taking/end-of-utterance detection? Similar to OpenAI's Realtime API:

"turn_detection": {
    "enabled": True,
    "timeout_ms": 1500,
    "silence_threshold_ms": 500
}

3. Manual turnComplete Signal

Should we send explicit turnComplete signals after detecting silence? What's the correct format?

{
    "clientContent": {
        "turnComplete": True
    }
}

or

{
    "realtimeInput": {
        "turnComplete": True
    }
}

4. Minimum Audio Length Threshold

Is there a minimum audio duration requirement? Should we pad short utterances with silence?

5. Complete generation_config Schema

Could you publish the complete configuration schema including all experimental parameters for speech_config, especially for server-side telephony use cases?

Workaround Considered

We could use Azure STT for transcription and send text via clientContent, but this would lose the benefit of native audio processing:

# Hybrid approach
user_text = azure_stt.recognize(audio)
await gemini_live.send_text(user_text)  # Send as text instead of audio

However, this defeats the purpose of using Native Audio and adds latency.

Impact

This issue significantly impacts conversational AI in telephony scenarios where:

  • Short confirmations are common ("Yes", "No", "Okay")
  • Customers provide brief information (names, numbers)
  • Natural conversation flow requires quick back-and-forth

Additional Context

  • The same audio streams work perfectly with Azure Speech STT (no missed short utterances)
  • This only occurs with audio input; text input via clientContent works for all lengths
  • GitHub TypeScript example comments suggest VAD parameters exist but aren't documented

Related Links


Thank you for considering this feature request! Happy to provide more details or test experimental parameters if needed.

What problem are you trying to solve with this feature?

No response

Any other information you'd like to share?

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions