VAD configuration for short utterances - "Yes", "No", single names not triggering response

### Description of the feature request:


## Summary

  When using Gemini 2.5 Flash Native Audio via the Multimodal Live API in a **server-side telephony scenario** (Twilio → WebSocket → Gemini Live), short utterances such as
   "Yes", "No", "John", or other single-word responses frequently fail to trigger a model response.

  Longer utterances (>2 seconds) work reliably, but brief responses that are common in natural phone conversations do not.

  ## Environment

  - **Model**: `gemini-2.5-flash-native-audio-preview-12-2025`
  - **API**: Multimodal Live API (WebSocket)
  - **Audio Input**: PCM 16-bit, 16kHz, mono from Twilio phone calls
  - **Audio Chunk Size**: ~20ms per chunk (640 bytes)
  - **Use Case**: Restaurant phone ordering system

  ## Current Configuration

  ```python
  setup_message = {
      "setup": {
          "model": "models/gemini-2.5-flash-native-audio-preview-12-2025",
          "generation_config": {
              "response_modalities": ["AUDIO"],
              "thinking_config": {
                  "thinking_budget": 0
              },
              "speech_config": {
                  "voice_config": {
                      "prebuilt_voice_config": {
                          "voice_name": "Kore"
                      }
                  }
              }
          },
          "system_instruction": {
              "parts": [{"text": "You are a friendly restaurant assistant..."}]
          },
          "tools": [...]
      }
  }
  ```

  ## Reproduction Steps

  1. Customer calls Twilio phone number
  2. Audio streams via WebSocket to our server
  3. Server forwards PCM audio chunks to Gemini Live API via `realtimeInput.mediaChunks`
  4. Customer says a short response like:
     - "Yes"
     - "No"
     - "John" (their name)
     - "Sure"
     - "Okay"
  5. **Expected**: Gemini responds with audio
  6. **Actual**: No response is generated (most of the time)

  ## What Works

  ✅ Longer utterances (>2 seconds) trigger responses reliably
  ✅ Questions or full sentences work fine
  ✅ Azure Speech SDK STT recognizes these short utterances perfectly in parallel

  ## Questions / Feature Requests

  Since the documentation for VAD and turn detection parameters is not publicly available, I'd like to request:

  ### 1. **VAD Configuration Parameters**
  Are there any Voice Activity Detection parameters we can configure? For example:
  ```python
  "speech_config": {
      "vad_sensitivity": "low",  # or "medium", "high"
      "min_speech_duration_ms": 300,
      "speech_end_timeout_ms": 800
  }
  ```

  ### 2. **Turn Detection Configuration**
  Can we adjust turn-taking/end-of-utterance detection? Similar to OpenAI's Realtime API:
  ```python
  "turn_detection": {
      "enabled": True,
      "timeout_ms": 1500,
      "silence_threshold_ms": 500
  }
  ```

  ### 3. **Manual `turnComplete` Signal**
  Should we send explicit `turnComplete` signals after detecting silence? What's the correct format?
  ```python
  {
      "clientContent": {
          "turnComplete": True
      }
  }
  ```
  or
  ```python
  {
      "realtimeInput": {
          "turnComplete": True
      }
  }
  ```

  ### 4. **Minimum Audio Length Threshold**
  Is there a minimum audio duration requirement? Should we pad short utterances with silence?

  ### 5. **Complete `generation_config` Schema**
  Could you publish the complete configuration schema including all experimental parameters for `speech_config`, especially for server-side telephony use cases?

  ## Workaround Considered

  We could use Azure STT for transcription and send text via `clientContent`, but this would lose the benefit of native audio processing:

  ```python
  # Hybrid approach
  user_text = azure_stt.recognize(audio)
  await gemini_live.send_text(user_text)  # Send as text instead of audio
  ```

  However, this defeats the purpose of using Native Audio and adds latency.

  ## Impact

  This issue significantly impacts conversational AI in telephony scenarios where:
  - Short confirmations are common ("Yes", "No", "Okay")
  - Customers provide brief information (names, numbers)
  - Natural conversation flow requires quick back-and-forth

  ## Additional Context

  - The same audio streams work perfectly with Azure Speech STT (no missed short utterances)
  - This only occurs with audio input; text input via `clientContent` works for all lengths
  - GitHub TypeScript example comments suggest VAD parameters exist but aren't documented

  ## Related Links

  - [Multimodal Live API docs](https://ai.google.dev/api/multimodal-live) (VAD config returns 404)
  - Similar pattern in OpenAI Realtime API: [Turn Detection](https://platform.openai.com/docs/guides/realtime/overview)

  ---

  Thank you for considering this feature request! Happy to provide more details or test experimental parameters if needed.

### What problem are you trying to solve with this feature?


_No response_

### Any other information you'd like to share?


_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VAD configuration for short utterances - "Yes", "No", single names not triggering response #142

Description of the feature request:

Summary

Environment

Current Configuration

Reproduction Steps

What Works

Questions / Feature Requests

1. VAD Configuration Parameters

2. Turn Detection Configuration

3. Manual `turnComplete` Signal

4. Minimum Audio Length Threshold

5. Complete `generation_config` Schema

Workaround Considered

Impact

Additional Context

Related Links

What problem are you trying to solve with this feature?

Any other information you'd like to share?

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

VAD configuration for short utterances - "Yes", "No", single names not triggering response #142

Description

Description of the feature request:

Summary

Environment

Current Configuration

Reproduction Steps

What Works

Questions / Feature Requests

1. VAD Configuration Parameters

2. Turn Detection Configuration

3. Manual turnComplete Signal

4. Minimum Audio Length Threshold

5. Complete generation_config Schema

Workaround Considered

Impact

Additional Context

Related Links

What problem are you trying to solve with this feature?

Any other information you'd like to share?

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

3. Manual `turnComplete` Signal

5. Complete `generation_config` Schema