A high-performance Rust server that wraps Qwen3-TTS and Qwen3-ASR behind OpenAI-compatible endpoints. Any client that speaks the OpenAI audio API can point at this server and get text-to-speech and speech-to-text from Qwen3 models.
Backends:
- libtorch (Linux) — PyTorch C++ runtime, CPU or CUDA GPU
- MLX (macOS Apple Silicon) — Apple Metal GPU acceleration
The release binaries are self-contained — ffmpeg is statically linked for audio format conversion (MP3, Opus, AAC, FLAC encoding and decoding of all common audio input formats). No external ffmpeg installation is required.
Run the installer to download the binary, models, and tokenizers for your platform:
curl -sSf https://raw.githubusercontent.com/second-state/qwen3_audio_api/main/rust/install.sh | bashThe installer detects your OS, CPU, and NVIDIA GPU (if present), then sets up everything in ./qwen3_audio_api/. Once complete, start the server:
cd qwen3_audio_api
TTS_CUSTOMVOICE_MODEL_PATH=./models/Qwen3-TTS-12Hz-0.6B-CustomVoice \
TTS_BASE_MODEL_PATH=./models/Qwen3-TTS-12Hz-0.6B-Base \
ASR_MODEL_PATH=./models/Qwen3-ASR-0.6B \
./qwen3-audio-apiText-to-Speech:
curl -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"model": "qwen3-tts", "input": "Hello world!", "voice": "alloy"}' \
--output speech.mp3Speech-to-Text:
curl -X POST http://localhost:8000/v1/audio/transcriptions \
-F file=@audio.wav -F model=qwen3-asr| Variable | Default | Description |
|---|---|---|
TTS_CUSTOMVOICE_MODEL_PATH |
-- | Path to CustomVoice model directory (enables voice/instructions parameters) |
TTS_BASE_MODEL_PATH |
-- | Path to Base model directory (enables audio_sample voice cloning) |
ASR_MODEL_PATH |
-- | Path to ASR model directory (enables /v1/audio/transcriptions) |
HOST |
0.0.0.0 |
Server bind address |
PORT |
8000 |
Server port |
RUST_LOG |
info |
Log level (trace, debug, info, warn, error) |
At least one of TTS_CUSTOMVOICE_MODEL_PATH, TTS_BASE_MODEL_PATH, or ASR_MODEL_PATH must be set.
Example — all models loaded:
TTS_CUSTOMVOICE_MODEL_PATH=./models/Qwen3-TTS-12Hz-0.6B-CustomVoice \
TTS_BASE_MODEL_PATH=./models/Qwen3-TTS-12Hz-0.6B-Base \
ASR_MODEL_PATH=./models/Qwen3-ASR-0.6B \
./qwen3-audio-apiGenerate speech from text. Compatible with the OpenAI audio speech API.
Request body (JSON):
| Field | Type | Required | Default | Description | Requires model |
|---|---|---|---|---|---|
model |
string | yes | -- | Model identifier (accepted for compatibility; the loaded model is always used) | -- |
input |
string | yes | -- | Text to synthesize (max 4096 characters) | -- |
voice |
string | no | alloy |
Voice name (see table below) | CustomVoice |
response_format |
string | no | mp3 |
mp3, opus, aac, flac, wav, or pcm |
-- |
speed |
number | no | 1.0 |
Playback speed, 0.25 to 4.0 |
-- |
language |
string | no | Auto |
Language of the input text (Auto, English, Chinese, Japanese, Korean, French, German, Spanish, Italian, Portuguese, Russian) |
-- |
instructions |
string | no | -- | Style/emotion instruction passed to the model | CustomVoice |
audio_sample |
string/file | no | -- | Reference audio for voice cloning (file upload via multipart, or base64 string via JSON) | Base |
audio_sample_text |
string | no | -- | Transcript of the reference audio; enables in-context learning mode for higher quality cloning | Base |
Note: The endpoint accepts both JSON and multipart/form-data. Use multipart (
curl -F) to uploadaudio_sampleas a binary file — this avoids base64 encoding. JSON requests can passaudio_sampleas a base64-encoded string.When
audio_sampleis provided the request uses the Base model for voice cloning andvoice/instructionsare ignored. Whenaudio_sampleis omitted the request uses the CustomVoice model and requires a validvoice. If the required model is not loaded the server returns HTTP 400.
Response: The raw audio bytes with the appropriate Content-Type header.
Example — predefined voice (CustomVoice model):
curl -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-tts",
"input": "Hello, welcome to the Qwen text-to-speech API.",
"voice": "alloy",
"language": "English",
"response_format": "wav"
}' \
--output speech.wavExample — voice cloning (Base model):
curl -X POST http://localhost:8000/v1/audio/speech \
-F model=qwen3-tts \
-F "input=This sentence will be spoken in the cloned voice." \
-F audio_sample=@reference.wav \
-F "audio_sample_text=Transcript of the reference audio." \
-F language=English \
-F response_format=wav \
--output cloned.wavTranscribe audio to text. Compatible with the OpenAI audio transcriptions API.
Request body (multipart/form-data):
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
file |
file | yes | -- | The audio file to transcribe (mp3, mp4, mpeg, mpga, m4a, wav, webm) |
model |
string | no | qwen3-asr |
Model identifier (accepted for compatibility; the loaded model is always used) |
language |
string | no | -- | Language of the audio (auto-detected if not specified). Supports 30+ languages including English, Chinese, Japanese, Korean, French, German, Spanish, etc. |
prompt |
string | no | -- | Optional context hint (not currently used) |
response_format |
string | no | json |
json or text |
temperature |
number | no | 0.0 |
Sampling temperature (not currently used) |
Response (JSON):
{
"text": "The transcribed text content."
}Example:
curl -X POST http://localhost:8000/v1/audio/transcriptions \
-F file=@audio.wav \
-F model=qwen3-asrExample with language hint:
curl -X POST http://localhost:8000/v1/audio/transcriptions \
-F file=@audio.wav \
-F model=qwen3-asr \
-F language=English \
-F response_format=textReturns the list of available models.
curl http://localhost:8000/v1/modelsReturns {"status": "ok"} when the server is ready.
curl http://localhost:8000/healthThe voice field accepts OpenAI voice names (mapped to Qwen3-TTS speakers) or Qwen3-TTS speaker names directly.
OpenAI voice mapping:
| OpenAI voice | Qwen3-TTS speaker |
|---|---|
alloy |
Vivian |
ash |
Serena |
ballad |
Uncle_Fu |
coral |
Dylan |
echo |
Eric |
fable |
Ryan |
onyx |
Aiden |
nova |
Ono_Anna |
sage |
Sohee |
shimmer |
Vivian |
verse |
Ryan |
marin |
Serena |
cedar |
Aiden |
Qwen3-TTS speakers can also be used directly as the voice value: Vivian, Serena, Uncle_Fu, Dylan, Eric, Ryan, Aiden, Ono_Anna, Sohee.
| Format | Content-Type |
|---|---|
wav |
audio/wav |
pcm |
audio/pcm |
mp3 |
audio/mpeg |
opus |
audio/opus |
aac |
audio/aac |
flac |
audio/flac |
All formats are handled natively by the statically-linked ffmpeg library. No external tools are needed.
Install Python tools for downloading models and generating tokenizer files:
pip install huggingface_hub transformersDownload model weights before starting the server. At least one of the three model paths must be set.
| Model | Parameters | Type | Use case |
|---|---|---|---|
Qwen3-TTS-12Hz-0.6B-CustomVoice |
0.6B | CustomVoice | Built-in voice presets via voice parameter |
Qwen3-TTS-12Hz-0.6B-Base |
0.6B | Base | Voice cloning via audio_sample parameter |
Qwen3-ASR-0.6B |
0.6B | ASR | Speech-to-text transcription |
mkdir -p models
# CustomVoice TTS
huggingface-cli download Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice \
--local-dir ./models/Qwen3-TTS-12Hz-0.6B-CustomVoice
# Base TTS (voice cloning)
huggingface-cli download Qwen/Qwen3-TTS-12Hz-0.6B-Base \
--local-dir ./models/Qwen3-TTS-12Hz-0.6B-Base
# ASR
huggingface-cli download Qwen/Qwen3-ASR-0.6B \
--local-dir ./models/Qwen3-ASR-0.6BThe Rust tokenizers crate requires a tokenizer.json file with the full tokenizer configuration. Generate it from the Python tokenizer for each model you downloaded:
python3 -c "
from transformers import AutoTokenizer
for model in ['Qwen3-TTS-12Hz-0.6B-CustomVoice', 'Qwen3-TTS-12Hz-0.6B-Base', 'Qwen3-ASR-0.6B']:
path = f'models/{model}'
tok = AutoTokenizer.from_pretrained(path, trust_remote_code=True)
tok.backend_tokenizer.save(f'{path}/tokenizer.json')
print(f'Saved {path}/tokenizer.json')
"Requires Apple Silicon Mac, Xcode (full installation for Metal shader compiler), and CMake.
Install dependencies:
brew install cmake pkg-config nasm lame opusBuild:
git clone https://github.com/second-state/qwen3_audio_api.git
cd qwen3_audio_api/rust
cargo build --release --no-default-features --features "mlx build-ffmpeg"Copy mlx.metallib next to the binary:
cp target/release/build/qwen3_tts-*/out/lib/mlx.metallib target/release/
# Binary at: target/release/qwen3-audio-apiDownload and extract libtorch for your platform from libtorch-releases:
# Linux x86_64 (CPU)
curl -LO https://github.com/second-state/libtorch-releases/releases/download/v2.7.1/libtorch-cxx11-abi-x86_64-2.7.1.tar.gz
tar xzf libtorch-cxx11-abi-x86_64-2.7.1.tar.gz
# Linux x86_64 (CUDA 12.6)
curl -LO https://github.com/second-state/libtorch-releases/releases/download/v2.7.1/libtorch-cxx11-abi-x86_64-cuda12.6-2.7.1.tar.gz
tar xzf libtorch-cxx11-abi-x86_64-cuda12.6-2.7.1.tar.gz
# Linux aarch64 (CPU)
curl -LO https://github.com/second-state/libtorch-releases/releases/download/v2.7.1/libtorch-cxx11-abi-aarch64-2.7.1.tar.gz
tar xzf libtorch-cxx11-abi-aarch64-2.7.1.tar.gz
# Linux aarch64 (CUDA 12.6 / Jetson)
curl -LO https://github.com/second-state/libtorch-releases/releases/download/v2.7.1/libtorch-cxx11-abi-aarch64-cuda12.6-2.7.1.tar.gz
tar xzf libtorch-cxx11-abi-aarch64-cuda12.6-2.7.1.tar.gzexport LIBTORCH=$(pwd)/libtorch
export LIBTORCH_BYPASS_VERSION_CHECK=1sudo apt-get install -y cmake pkg-config nasm libclang-dev libmp3lame-dev libopus-devgit clone https://github.com/second-state/qwen3_audio_api.git
cd qwen3_audio_api/rust
cargo build --release
# Binary at: target/release/qwen3-audio-api