Skip to content

Latest commit

 

History

History
184 lines (130 loc) · 4.31 KB

File metadata and controls

184 lines (130 loc) · 4.31 KB

Voice Mode

GarraIA supports end-to-end voice conversation with speech-to-text and text-to-speech.

Overview

┌─────────────────────────────────────────────────────────────┐
│                    VOICE PIPELINE                             │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  User Audio → STT → LLM → TTS → Audio Response             │
│                                                              │
│  STT: Whisper (local or API)                               │
│  TTS: Chatterbox, Hibiki, OpenAI TTS                       │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Setup

Prerequisites

  • FFmpeg installed
  • TTS server (Chatterbox or Hibiki) for TTS
  • Optional: Whisper for local STT

Wizard integration (plan 0126)

When garraia init runs on a machine with an NVIDIA GPU (nvidia-smi detected) and the user opts into voice mode, the wizard pre-fills the configuration block below and prints the install instructions for both servers — but it does not auto-install the Python TTS/STT stacks. Run those commands yourself once, then start the gateway with garraia start (no --with-voice flag needed when voice.enabled is already true in config.yml):

# TTS — Chatterbox Multilingual on :7860
pip install chatterbox-tts
chatterbox-tts serve --host 127.0.0.1 --port 7860

# STT — faster-whisper-server on :9090
pip install faster-whisper-server
fwsh serve --host 127.0.0.1 --port 9090

The wizard writes voice.tts_endpoint=http://127.0.0.1:7860, voice.stt_endpoint=http://127.0.0.1:9090, voice.tts_provider=chatterbox, and voice.language=pt into the emitted config.yml. CPU-only machines skip the voice prompt entirely.

Configuration

voice:
  enabled: true
  tts_endpoint: "http://127.0.0.1:7860"  # Chatterbox/Hibiki
  stt_provider: whisper  # whisper or openai
  language: "pt"  # pt, en, es, fr, de, it, hi

TTS Providers

Chatterbox (Recommended)

Docker-based GPU TTS:

docker run -d --gpus all -p 7860:7860 ghcr.io/garraia/chatterbox:latest

Features:

  • Multilingual (pt, en, es, fr, de, it, hi)
  • GPU accelerated
  • Low latency

Hibiki

Alternative GPU TTS:

docker run -d --gpus all -p 7861:7860 ghcr.io/garraia/hibiki:latest

OpenAI TTS

Cloud-based TTS:

voice:
  enabled: true
  tts_provider: openai
  tts_model: "tts-1-hd"
  tts_voice: "alloy"

STT Providers

Local Whisper

voice:
  stt_provider: whisper
  whisper_model: "base"  # tiny, base, small, medium, large

OpenAI Whisper API

voice:
  stt_provider: openai
  openai_api_key: "sk-..."

Usage

Starting with Voice

garraia start --with-voice

Voice Commands

  • /voz or /voice - Toggle voice mode for current session
  • Voice responses are automatic when enabled

Telegram Voice

Send voice messages and receive voice responses automatically when voice mode is enabled.

API Endpoints

TTS Endpoint

curl -X POST http://127.0.0.1:3888/api/tts \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello, how can I help you?", "language": "en"}'

Returns audio file (WAV/MP3).

STT Endpoint

curl -X POST http://127.0.0.1:3888/api/stt \
  -H "Content-Type: audio/wav" \
  --data-binary @audio.wav

Returns transcribed text.

Health Checks

Voice services are checked at startup:

garraia health

Output includes TTS and STT status.

Troubleshooting

TTS not responding

Check TTS server:

curl http://127.0.0.1:7860/health

Audio quality issues

  • Increase TTS quality setting
  • Check network latency to TTS server
  • Use local TTS (Chatterbox/Hibiki)

STT errors

  • Check FFmpeg installation
  • Verify audio format (16kHz mono recommended)
  • Try different Whisper model