Speech To Speech: Build local voice agents with open-source models

📖 Quick Index

Approach
- Structure
- Modularity
Setup
Usage
Command-line usage

Approach

Structure

This repository implements a speech-to-speech cascaded pipeline consisting of the following parts:

Voice Activity Detection (VAD)
Speech to Text (STT)
Language Model (LM)
Text to Speech (TTS)

Modularity

The pipeline provides a fully open and modular approach, with a focus on leveraging models available through the Transformers library on the Hugging Face hub. The code is designed for easy modification, and we already support device-specific and external library implementations:

VAD

Silero VAD v5

STT

Any Whisper model checkpoint on the Hugging Face Hub through Transformers 🤗, including whisper-large-v3 and distil-large-v3
Lightning Whisper MLX
MLX Audio Whisper - Fast Whisper inference on Apple Silicon
Parakeet TDT - Real-time streaming STT with sub-100ms latency on Apple Silicon (CUDA/CPU via nano-parakeet, no NeMo)
Paraformer - FunASR

LLM

Any instruction-following model on the Hugging Face Hub via Transformers 🤗
mlx-lm
OpenAI API

TTS

MeloTTS
ChatTTS
Pocket TTS - Streaming TTS with voice cloning from Kyutai Labs
Kokoro-82M - Fast and high-quality TTS optimized for Apple Silicon

Setup

Clone the repository:

git clone https://github.com/huggingface/speech-to-speech.git
cd speech-to-speech

Install dependencies with uv:

uv sync

The project now uses a single pyproject.toml with platform markers, so macOS and non-macOS dependencies are resolved automatically from one file.

If you use Melo TTS (default on macOS), run this once after install:

uv run python -m unidic download

Apple Silicon MeloTTS note:

If MeloTTS fails on MPS with Output channels > 65536 not supported at the MPS device, update macOS first.
We reproduced this on an older macOS release and verified that the same environment worked after updating to macOS 26.3.1.

Note on DeepFilterNet: DeepFilterNet (used for optional audio enhancement in VAD) is currently incompatible with Pocket TTS due to numpy version constraints. DeepFilterNet requires numpy<2, while Pocket TTS requires numpy>=2.

If you want a DeepFilterNet-focused setup with pyproject.toml:

Edit pyproject.toml: remove the pocket-tts dependency line.
Add deepfilternet>=0.5.6 and numpy<2 to project.dependencies.
Re-sync the environment:
```
uv sync --refresh
```

To switch back to Pocket TTS, revert those pyproject.toml changes and run uv sync --refresh again.

Usage

The pipeline can be run in three ways:

Server/Client approach: Models run on a server, and audio input/output are streamed from a client using TCP sockets.
WebSocket approach: Models run on a server, and audio input/output are streamed from a client using WebSockets.
Local approach: Runs locally.

Recommended setup

Server/Client Approach

Run the pipeline on the server:

python s2s_pipeline.py --recv_host 0.0.0.0 --send_host 0.0.0.0

Run the client locally to handle microphone input and receive generated audio:
```
python listen_and_play.py --host <IP address of your server>
```

WebSocket Approach

Run the pipeline with WebSocket mode:

python s2s_pipeline.py --mode websocket --ws_host 0.0.0.0 --ws_port 8765

Connect to the WebSocket server from your client application at ws://<server-ip>:8765. The server handles bidirectional audio streaming:
- Send raw audio bytes to the server (16kHz, int16, mono)
- Receive generated audio bytes from the server

Local Approach (Mac)

For optimal settings on Mac:

python s2s_pipeline.py --local_mac_optimal_settings

You can also specify a particular LLM model:

python s2s_pipeline.py \
    --local_mac_optimal_settings \
    --lm_model_name mlx-community/Qwen3-4B-Instruct-2507-bf16

This setting:

Adds --device mps to use MPS for all models.
Sets Parakeet TDT for STT (fast streaming ASR on Apple Silicon)
Sets MLX LM for the language model (uses --lm_model_name to specify the model)
Sets MeloTTS for TTS
Requires one-time UniDic setup for MeloTTS:
```
uv run python -m unidic download
```
--tts pocket and --tts kokoro are also valid TTS options on macOS.

Docker Server

Install the NVIDIA Container Toolkit

https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html

Start the docker container

docker compose up

Recommended usage with Cuda

Leverage Torch Compile for Whisper with Pocket TTS for a simple low-latency setup:

python s2s_pipeline.py \
	--lm_model_name microsoft/Phi-3-mini-4k-instruct \
	--stt_compile_mode reduce-overhead \
  --tts pocket \
  --recv_host 0.0.0.0 \
	--send_host 0.0.0.0

Multi-language Support

The pipeline currently supports English, French, Spanish, Chinese, Japanese, and Korean.
Two use cases are considered:

Single-language conversation: Enforce the language setting using the --language flag, specifying the target language code (default is 'en').
Language switching: Set --language to 'auto'. In this case, Whisper detects the language for each spoken prompt, and the LLM is prompted with "Please reply to my message in ..." to ensure the response is in the detected language.

Please note that you must use STT and LLM checkpoints compatible with the target language(s). For multilingual TTS, use Melo (English, French, Spanish, Chinese, Japanese, and Korean) or Chat-TTS.

With the server version:

For automatic language detection:

python s2s_pipeline.py \
    --stt whisper-mlx \
    --stt_model_name large-v3 \
    --language auto \
    --llm mlx-lm \
    --lm_model_name mlx-community/Qwen3-4B-Instruct-2507-bf16

Or for one language in particular, chinese in this example

python s2s_pipeline.py \
    --stt whisper-mlx \
    --stt_model_name large-v3 \
    --language zh \
    --llm mlx-lm \
    --lm_model_name mlx-community/Qwen3-4B-Instruct-2507-bf16

Local Mac Setup

For automatic language detection (note: --stt whisper-mlx overrides the default parakeet-tdt from optimal settings, since Whisper large-v3 has broader language coverage):

python s2s_pipeline.py \
    --local_mac_optimal_settings \
    --stt whisper-mlx \
    --stt_model_name large-v3 \
    --language auto \
    --lm_model_name mlx-community/Qwen3-4B-Instruct-2507-bf16

Or for one language in particular, chinese in this example

python s2s_pipeline.py \
    --local_mac_optimal_settings \
    --stt whisper-mlx \
    --stt_model_name large-v3 \
    --language zh \
    --lm_model_name mlx-community/Qwen3-4B-Instruct-2507-bf16

Using Pocket TTS

Pocket TTS from Kyutai Labs provides streaming TTS with voice cloning capabilities. To use it:

python s2s_pipeline.py \
    --tts pocket \
    --pocket_tts_voice jean \
    --pocket_tts_device cpu

Available voice presets: alba, marius, javert, jean, fantine, cosette, eponine, azelma. You can also use custom voice files or HuggingFace paths.

Command-line Usage

NOTE: References for all the CLI arguments can be found directly in the arguments classes or by running python s2s_pipeline.py -h.

Module level Parameters

See ModuleArguments class. Allows to set:

a common --device (if one wants each part to run on the same device)
--mode local or server
chosen STT implementation
chosen LM implementation
chose TTS implementation
logging level

VAD parameters

See VADHandlerArguments class. Notably:

--thresh: Threshold value to trigger voice activity detection.
--min_speech_ms: Minimum duration of detected voice activity to be considered speech.
--min_silence_ms: Minimum length of silence intervals for segmenting speech, balancing sentence cutting and latency reduction.

STT, LM and TTS parameters

model_name, torch_dtype, and device are exposed for each implementation of the Speech to Text, Language Model, and Text to Speech. Specify the targeted pipeline part with the corresponding prefix (e.g. stt, lm or tts, check the implementations' arguments classes for more details).

For example:

--lm_model_name google/gemma-2b-it

Generation parameters

Other generation parameters of the model's generate method can be set using the part's prefix + _gen_, e.g., --stt_gen_max_new_tokens 128. These parameters can be added to the pipeline part's arguments class if not already exposed.

Citations

Silero VAD

@misc{Silero VAD,
  author = {Silero Team},
  title = {Silero VAD: pre-trained enterprise-grade Voice Activity Detector (VAD), Number Detector and Language Classifier},
  year = {2021},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/snakers4/silero-vad}},
  commit = {insert_some_commit_here},
  email = {hello@silero.ai}
}

Distil-Whisper

@misc{gandhi2023distilwhisper,
      title={Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling},
      author={Sanchit Gandhi and Patrick von Platen and Alexander M. Rush},
      year={2023},
      eprint={2311.00430},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Parler-TTS

@misc{lacombe-etal-2024-parler-tts,
  author = {Yoach Lacombe and Vaibhav Srivastav and Sanchit Gandhi},
  title = {Parler-TTS},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/huggingface/parler-tts}}
}

Name		Name	Last commit message	Last commit date
Latest commit History 382 Commits
LLM		LLM
STT		STT
TTS		TTS
VAD		VAD
api		api
archive		archive
arguments_classes		arguments_classes
connections		connections
tests		tests
utils		utils
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
Dockerfile.arm64		Dockerfile.arm64
LICENSE		LICENSE
README.md		README.md
baseHandler.py		baseHandler.py
benchmark_stt.py		benchmark_stt.py
benchmark_tts.py		benchmark_tts.py
cancel_scope.py		cancel_scope.py
conftest.py		conftest.py
docker-compose.yml		docker-compose.yml
listen_and_play.py		listen_and_play.py
listen_and_play_realtime.py		listen_and_play_realtime.py
logo.png		logo.png
mypy.ini		mypy.ini
pipeline_control.py		pipeline_control.py
pyproject.toml		pyproject.toml
s2s_pipeline.py		s2s_pipeline.py

Folders and files

Latest commit

History

Repository files navigation

Speech To Speech: Build local voice agents with open-source models

📖 Quick Index

Approach

Structure

Modularity

Setup

Usage

Recommended setup

Server/Client Approach

WebSocket Approach

Local Approach (Mac)

Docker Server

Install the NVIDIA Container Toolkit

Start the docker container

Recommended usage with Cuda

Multi-language Support

With the server version:

Local Mac Setup

Using Pocket TTS

Command-line Usage

Module level Parameters

VAD parameters

STT, LM and TTS parameters

Generation parameters

Citations

Silero VAD

Distil-Whisper

Parler-TTS

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages