Name	Name	Last commit message	Last commit date
parent directory ..
src	src
Cargo.lock	Cargo.lock
Cargo.toml	Cargo.toml
README.md	README.md
build.rs	build.rs
install.sh	install.sh

Qwen3-TTS/ASR OpenAI-Compatible API (Rust)

A high-performance Rust server that wraps Qwen3-TTS and Qwen3-ASR behind OpenAI-compatible endpoints. Any client that speaks the OpenAI audio API can point at this server and get text-to-speech and speech-to-text from Qwen3 models.

Backends:

libtorch (Linux) — PyTorch C++ runtime, CPU or CUDA GPU
MLX (macOS Apple Silicon) — Apple Metal GPU acceleration

The release binaries are self-contained — ffmpeg is statically linked for audio format conversion (MP3, Opus, AAC, FLAC encoding and decoding of all common audio input formats). No external ffmpeg installation is required.

Quick Start

Run the installer to download the binary, models, and tokenizers for your platform:

curl -sSf https://raw.githubusercontent.com/second-state/qwen3_audio_api/main/rust/install.sh | bash

The installer detects your OS, CPU, and NVIDIA GPU (if present), then sets up everything in ./qwen3_audio_api/. Once complete, start the server:

cd qwen3_audio_api
TTS_CUSTOMVOICE_MODEL_PATH=./models/Qwen3-TTS-12Hz-0.6B-CustomVoice \
  TTS_BASE_MODEL_PATH=./models/Qwen3-TTS-12Hz-0.6B-Base \
  ASR_MODEL_PATH=./models/Qwen3-ASR-0.6B \
  ./qwen3-audio-api

Text-to-Speech:

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"model": "qwen3-tts", "input": "Hello world!", "voice": "alloy"}' \
  --output speech.mp3

Speech-to-Text:

curl -X POST http://localhost:8000/v1/audio/transcriptions \
  -F file=@audio.wav -F model=qwen3-asr

Configuration

Variable	Default	Description
`TTS_CUSTOMVOICE_MODEL_PATH`	--	Path to CustomVoice model directory (enables `voice`/`instructions` parameters)
`TTS_BASE_MODEL_PATH`	--	Path to Base model directory (enables `audio_sample` voice cloning)
`ASR_MODEL_PATH`	--	Path to ASR model directory (enables `/v1/audio/transcriptions`)
`HOST`	`0.0.0.0`	Server bind address
`PORT`	`8000`	Server port
`RUST_LOG`	`info`	Log level (`trace`, `debug`, `info`, `warn`, `error`)

At least one of TTS_CUSTOMVOICE_MODEL_PATH, TTS_BASE_MODEL_PATH, or ASR_MODEL_PATH must be set.

Example — all models loaded:

TTS_CUSTOMVOICE_MODEL_PATH=./models/Qwen3-TTS-12Hz-0.6B-CustomVoice \
  TTS_BASE_MODEL_PATH=./models/Qwen3-TTS-12Hz-0.6B-Base \
  ASR_MODEL_PATH=./models/Qwen3-ASR-0.6B \
  ./qwen3-audio-api

API reference

`POST /v1/audio/speech`

Generate speech from text. Compatible with the OpenAI audio speech API.

Request body (JSON):

Field	Type	Required	Default	Description	Requires model
`model`	string	yes	--	Model identifier (accepted for compatibility; the loaded model is always used)	--
`input`	string	yes	--	Text to synthesize (max 4096 characters)	--
`voice`	string	no	`alloy`	Voice name (see table below)	CustomVoice
`response_format`	string	no	`mp3`	`mp3`, `opus`, `aac`, `flac`, `wav`, or `pcm`	--
`speed`	number	no	`1.0`	Playback speed, `0.25` to `4.0`	--
`language`	string	no	`Auto`	Language of the input text (`Auto`, `English`, `Chinese`, `Japanese`, `Korean`, `French`, `German`, `Spanish`, `Italian`, `Portuguese`, `Russian`)	--
`instructions`	string	no	--	Style/emotion instruction passed to the model	CustomVoice
`audio_sample`	string/file	no	--	Reference audio for voice cloning (file upload via multipart, or base64 string via JSON)	Base
`audio_sample_text`	string	no	--	Transcript of the reference audio; enables in-context learning mode for higher quality cloning	Base

Note: The endpoint accepts both JSON and multipart/form-data. Use multipart (curl -F) to upload audio_sample as a binary file — this avoids base64 encoding. JSON requests can pass audio_sample as a base64-encoded string.

When audio_sample is provided the request uses the Base model for voice cloning and voice/instructions are ignored. When audio_sample is omitted the request uses the CustomVoice model and requires a valid voice. If the required model is not loaded the server returns HTTP 400.

Response: The raw audio bytes with the appropriate Content-Type header.

Example — predefined voice (CustomVoice model):

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-tts",
    "input": "Hello, welcome to the Qwen text-to-speech API.",
    "voice": "alloy",
    "language": "English",
    "response_format": "wav"
  }' \
  --output speech.wav

Example — voice cloning (Base model):

curl -X POST http://localhost:8000/v1/audio/speech \
  -F model=qwen3-tts \
  -F "input=This sentence will be spoken in the cloned voice." \
  -F audio_sample=@reference.wav \
  -F "audio_sample_text=Transcript of the reference audio." \
  -F language=English \
  -F response_format=wav \
  --output cloned.wav

`POST /v1/audio/transcriptions`

Transcribe audio to text. Compatible with the OpenAI audio transcriptions API.

Request body (multipart/form-data):

Field	Type	Required	Default	Description
`file`	file	yes	--	The audio file to transcribe (mp3, mp4, mpeg, mpga, m4a, wav, webm)
`model`	string	no	`qwen3-asr`	Model identifier (accepted for compatibility; the loaded model is always used)
`language`	string	no	--	Language of the audio (auto-detected if not specified). Supports 30+ languages including English, Chinese, Japanese, Korean, French, German, Spanish, etc.
`prompt`	string	no	--	Optional context hint (not currently used)
`response_format`	string	no	`json`	`json` or `text`
`temperature`	number	no	`0.0`	Sampling temperature (not currently used)

Response (JSON):

{
  "text": "The transcribed text content."
}

Example:

curl -X POST http://localhost:8000/v1/audio/transcriptions \
  -F file=@audio.wav \
  -F model=qwen3-asr

Example with language hint:

curl -X POST http://localhost:8000/v1/audio/transcriptions \
  -F file=@audio.wav \
  -F model=qwen3-asr \
  -F language=English \
  -F response_format=text

`GET /v1/models`

Returns the list of available models.

curl http://localhost:8000/v1/models

`GET /health`

Returns {"status": "ok"} when the server is ready.

curl http://localhost:8000/health

Voices

The voice field accepts OpenAI voice names (mapped to Qwen3-TTS speakers) or Qwen3-TTS speaker names directly.

OpenAI voice mapping:

OpenAI voice	Qwen3-TTS speaker
`alloy`	Vivian
`ash`	Serena
`ballad`	Uncle_Fu
`coral`	Dylan
`echo`	Eric
`fable`	Ryan
`onyx`	Aiden
`nova`	Ono_Anna
`sage`	Sohee
`shimmer`	Vivian
`verse`	Ryan
`marin`	Serena
`cedar`	Aiden

Qwen3-TTS speakers can also be used directly as the voice value: Vivian, Serena, Uncle_Fu, Dylan, Eric, Ryan, Aiden, Ono_Anna, Sohee.

Output formats

Format	Content-Type
`wav`	`audio/wav`
`pcm`	`audio/pcm`
`mp3`	`audio/mpeg`
`opus`	`audio/opus`
`aac`	`audio/aac`
`flac`	`audio/flac`

All formats are handled natively by the statically-linked ffmpeg library. No external tools are needed.

Build from Source

Prerequisites

Install Python tools for downloading models and generating tokenizer files:

pip install huggingface_hub transformers

Download models

Download model weights before starting the server. At least one of the three model paths must be set.

Model	Parameters	Type	Use case
`Qwen3-TTS-12Hz-0.6B-CustomVoice`	0.6B	CustomVoice	Built-in voice presets via `voice` parameter
`Qwen3-TTS-12Hz-0.6B-Base`	0.6B	Base	Voice cloning via `audio_sample` parameter
`Qwen3-ASR-0.6B`	0.6B	ASR	Speech-to-text transcription

mkdir -p models

# CustomVoice TTS
huggingface-cli download Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice \
  --local-dir ./models/Qwen3-TTS-12Hz-0.6B-CustomVoice

# Base TTS (voice cloning)
huggingface-cli download Qwen/Qwen3-TTS-12Hz-0.6B-Base \
  --local-dir ./models/Qwen3-TTS-12Hz-0.6B-Base

# ASR
huggingface-cli download Qwen/Qwen3-ASR-0.6B \
  --local-dir ./models/Qwen3-ASR-0.6B

Generate tokenizer.json

The Rust tokenizers crate requires a tokenizer.json file with the full tokenizer configuration. Generate it from the Python tokenizer for each model you downloaded:

python3 -c "
from transformers import AutoTokenizer
for model in ['Qwen3-TTS-12Hz-0.6B-CustomVoice', 'Qwen3-TTS-12Hz-0.6B-Base', 'Qwen3-ASR-0.6B']:
    path = f'models/{model}'
    tok = AutoTokenizer.from_pretrained(path, trust_remote_code=True)
    tok.backend_tokenizer.save(f'{path}/tokenizer.json')
    print(f'Saved {path}/tokenizer.json')
"

Build for macOS (MLX)

Requires Apple Silicon Mac, Xcode (full installation for Metal shader compiler), and CMake.

Install dependencies:

brew install cmake pkg-config nasm lame opus

Build:

git clone https://github.com/second-state/qwen3_audio_api.git
cd qwen3_audio_api/rust
cargo build --release --no-default-features --features "mlx build-ffmpeg"

Copy mlx.metallib next to the binary:

cp target/release/build/qwen3_tts-*/out/lib/mlx.metallib target/release/
# Binary at: target/release/qwen3-audio-api

Build for Linux (libtorch)

Download libtorch

Download and extract libtorch for your platform from libtorch-releases:

# Linux x86_64 (CPU)
curl -LO https://github.com/second-state/libtorch-releases/releases/download/v2.7.1/libtorch-cxx11-abi-x86_64-2.7.1.tar.gz
tar xzf libtorch-cxx11-abi-x86_64-2.7.1.tar.gz

# Linux x86_64 (CUDA 12.6)
curl -LO https://github.com/second-state/libtorch-releases/releases/download/v2.7.1/libtorch-cxx11-abi-x86_64-cuda12.6-2.7.1.tar.gz
tar xzf libtorch-cxx11-abi-x86_64-cuda12.6-2.7.1.tar.gz

# Linux aarch64 (CPU)
curl -LO https://github.com/second-state/libtorch-releases/releases/download/v2.7.1/libtorch-cxx11-abi-aarch64-2.7.1.tar.gz
tar xzf libtorch-cxx11-abi-aarch64-2.7.1.tar.gz

# Linux aarch64 (CUDA 12.6 / Jetson)
curl -LO https://github.com/second-state/libtorch-releases/releases/download/v2.7.1/libtorch-cxx11-abi-aarch64-cuda12.6-2.7.1.tar.gz
tar xzf libtorch-cxx11-abi-aarch64-cuda12.6-2.7.1.tar.gz

Set environment variables

export LIBTORCH=$(pwd)/libtorch
export LIBTORCH_BYPASS_VERSION_CHECK=1

Install dependencies

sudo apt-get install -y cmake pkg-config nasm libclang-dev libmp3lame-dev libopus-dev

Build

git clone https://github.com/second-state/qwen3_audio_api.git
cd qwen3_audio_api/rust
cargo build --release
# Binary at: target/release/qwen3-audio-api

License

Apache-2.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

README.md

Qwen3-TTS/ASR OpenAI-Compatible API (Rust)

Quick Start

Configuration

API reference

`POST /v1/audio/speech`

`POST /v1/audio/transcriptions`

`GET /v1/models`

`GET /health`

Voices

Output formats

Build from Source

Prerequisites

Download models

Generate tokenizer.json

Build for macOS (MLX)

Build for Linux (libtorch)

Download libtorch

Set environment variables

Install dependencies

Build

License

Uh oh!

FilesExpand file tree

rust

Directory actions

More options

Directory actions

More options

Latest commit

History

rust

Folders and files

parent directory

README.md

Qwen3-TTS/ASR OpenAI-Compatible API (Rust)

Quick Start

Configuration

API reference

POST /v1/audio/speech

POST /v1/audio/transcriptions

GET /v1/models

GET /health

Voices

Output formats

Build from Source

Prerequisites

Download models

Generate tokenizer.json

Build for macOS (MLX)

Build for Linux (libtorch)

Download libtorch

Set environment variables

Install dependencies

Build

License

`POST /v1/audio/speech`

`POST /v1/audio/transcriptions`

`GET /v1/models`

`GET /health`