GarraIA supports end-to-end voice conversation with speech-to-text and text-to-speech.
┌─────────────────────────────────────────────────────────────┐
│ VOICE PIPELINE │
├─────────────────────────────────────────────────────────────┤
│ │
│ User Audio → STT → LLM → TTS → Audio Response │
│ │
│ STT: Whisper (local or API) │
│ TTS: Chatterbox, Hibiki, OpenAI TTS │
│ │
└─────────────────────────────────────────────────────────────┘
- FFmpeg installed
- TTS server (Chatterbox or Hibiki) for TTS
- Optional: Whisper for local STT
When garraia init runs on a machine with an NVIDIA GPU (nvidia-smi
detected) and the user opts into voice mode, the wizard pre-fills the
configuration block below and prints the install instructions for both
servers — but it does not auto-install the Python TTS/STT stacks.
Run those commands yourself once, then start the gateway with
garraia start (no --with-voice flag needed when voice.enabled is
already true in config.yml):
# TTS — Chatterbox Multilingual on :7860
pip install chatterbox-tts
chatterbox-tts serve --host 127.0.0.1 --port 7860
# STT — faster-whisper-server on :9090
pip install faster-whisper-server
fwsh serve --host 127.0.0.1 --port 9090The wizard writes voice.tts_endpoint=http://127.0.0.1:7860,
voice.stt_endpoint=http://127.0.0.1:9090,
voice.tts_provider=chatterbox, and voice.language=pt into the
emitted config.yml. CPU-only machines skip the voice prompt entirely.
voice:
enabled: true
tts_endpoint: "http://127.0.0.1:7860" # Chatterbox/Hibiki
stt_provider: whisper # whisper or openai
language: "pt" # pt, en, es, fr, de, it, hiDocker-based GPU TTS:
docker run -d --gpus all -p 7860:7860 ghcr.io/garraia/chatterbox:latestFeatures:
- Multilingual (pt, en, es, fr, de, it, hi)
- GPU accelerated
- Low latency
Alternative GPU TTS:
docker run -d --gpus all -p 7861:7860 ghcr.io/garraia/hibiki:latestCloud-based TTS:
voice:
enabled: true
tts_provider: openai
tts_model: "tts-1-hd"
tts_voice: "alloy"voice:
stt_provider: whisper
whisper_model: "base" # tiny, base, small, medium, largevoice:
stt_provider: openai
openai_api_key: "sk-..."garraia start --with-voice/vozor/voice- Toggle voice mode for current session- Voice responses are automatic when enabled
Send voice messages and receive voice responses automatically when voice mode is enabled.
curl -X POST http://127.0.0.1:3888/api/tts \
-H "Content-Type: application/json" \
-d '{"text": "Hello, how can I help you?", "language": "en"}'Returns audio file (WAV/MP3).
curl -X POST http://127.0.0.1:3888/api/stt \
-H "Content-Type: audio/wav" \
--data-binary @audio.wavReturns transcribed text.
Voice services are checked at startup:
garraia healthOutput includes TTS and STT status.
Check TTS server:
curl http://127.0.0.1:7860/health- Increase TTS quality setting
- Check network latency to TTS server
- Use local TTS (Chatterbox/Hibiki)
- Check FFmpeg installation
- Verify audio format (16kHz mono recommended)
- Try different Whisper model