English | 简体中文
On-device speech AI runtime for ASR, TTS, VAD, and voice cloning. Python-simple, C++-native, GGUF-powered.
RapidSpeech.cpp runs speech recognition, text-to-speech, VAD, speaker embedding, and voice cloning on-device. It gives Python developers a simple API while keeping the runtime pure C/C++, backed by ggml and a unified GGUF model format. No cloud API, no speech server, no heavyweight Python model stack.
pip install rapidspeechGPU wheels:
pip install rapidspeech-metal # macOS / Apple Silicon
pip install rapidspeech-cuda # Linux / NVIDIApython python-api-examples/tts/tts-offline.py \
--model /path/to/omnivoice-f16.gguf \
--text "Hello, welcome to RapidSpeech." \
--output output.wavpython python-api-examples/asr/asr-offline.py \
--model /path/to/funasr-nano-fp16.gguf \
--audio /path/to/audio.wavimport rapidspeech
tts = rapidspeech.tts_synthesizer("/path/to/omnivoice-f16.gguf")
tts.set_params(instruct="male, young adult", language="English", seed=42)
pcm = tts.synthesize("Hello from a native speech engine.")
sample_rate = tts.get_sample_rate()import rapidspeech
asr = rapidspeech.asr_offline("/path/to/funasr-nano-fp16.gguf")
sample_rate = asr.get_model_meta()["audio_sample_rate"]
pcm = ... # 1-D float32 mono PCM at sample_rate
asr.push_audio(pcm)
asr.process()
print(asr.get_text())- Built for the edge: run speech models locally on laptops, servers, browsers, and device-class hardware.
- Python-simple, C++-native: write Python, run a C++/ggml engine underneath.
- One model format: ASR, TTS, VAD, and speaker models use GGUF.
- NumPy in, NumPy out: ASR takes float32 PCM; TTS returns float32 PCM.
- Edge-first backends: CPU, Metal, CUDA, Vulkan, CANN, OpenCL, and WebGPU.
Test environment: Apple M1 Pro, funasr-nano-fp16.gguf, 15s audio.
| Configuration | RTF | Wall Time | Notes |
|---|---|---|---|
| CPU -t 4 | 0.465 | 12.4s | CPU-only inference |
| GPU -t 4 | 0.170 | 5.2s | Metal acceleration |
| GPU -t 4 Q4_K | 0.756 | - | Quantized model: GPU dequant overhead |
| CPU -t 4 Q4_K | 0.530 | - | Quantized model CPU inference, 596 MB (3.3x compression) |
RTF is processing time divided by audio duration. Lower is faster; RTF < 1 is faster than real time.
| Task | Models | Status |
|---|---|---|
| ASR | SenseVoice-small, FunASR-nano, X-ASR (Zipformer2, streaming) | Stable |
| VAD | Silero VAD, FireRedVAD | Stable |
| TTS | OmniVoice, OpenVoice2, Kokoro, IndexTTS-2 | Active |
| Speaker | CAMPPlus | Stable |
X-ASR — Chinese/English Zipformer2 transducer (icefall/k2). One GGUF serves
both offline full-context decoding and true chunked streaming (per-layer
left-context caches, sub-second partials, --chunk-len 16/32/48/96/192 fbank
frames). Punctuation and casing, greedy transducer decode, runs on CPU / Metal /
CUDA / Vulkan and quantizes to q4_k_m (99.5 MB).
IndexTTS-2 — expressive zero-shot voice-cloning TTS (GPT + S2Mel CFM + BigVGAN-v2 vocoder) with 4-mode emotion control (reference audio / vector / text / Qwen). See docs/index2tts.md.
CosyVoice3, Qwen3-ASR, Qwen3-TTS.
- Python examples
- Technical Notes: architecture, design tradeoffs, backends, model conversion, and binding surfaces.
- Model guides:
- ASR — X-ASR (Zipformer2, streaming) · SenseVoice · FunASR-Nano
- TTS — IndexTTS-2 (voice clone + emotion) · CosyVoice3 · OmniVoice · OpenVoice2 · Kokoro
- VAD — Silero / FireRedVAD
- Speaker — CAMPPlus
- Browser / WASM examples
- Node.js example
Models are available on:
- 🤗 Hugging Face: https://huggingface.co/RapidAI/RapidSpeech
- ModelScope: https://www.modelscope.cn/models/RapidAI/RapidSpeech
git clone https://github.com/RapidAI/RapidSpeech.cpp
cd RapidSpeech.cpp
git submodule sync && git submodule update --init --recursive
cmake -B build
cmake --build build --config ReleaseBuild artifacts are located in the build/ directory:
rs-asr-offline— Offline ASR command-line toolrs-asr-vad-online— VAD-segmented quasi-streaming ASR command-line toolrs-asr-online— True chunked streaming ASR (X-ASR; mic or WAV, live partials)rs-tts-offline— Offline TTS command-line toolrs-quantize— Model quantization tool
Offline ASR
./build/rs-asr-offline \
-m /path/to/funasr-nano-fp16.gguf \
-w /path/to/audio.wav \
-t 4 \
--gpu trueVAD-segmented ASR
./build/rs-asr-offline \
-m /path/to/funasr-nano-fp16.gguf \
-v /path/to/silero_vad_v6.gguf \
-w /path/to/audio.wav \
-t 4 \
--vad-threshold 0.5 \
--silence-ms 600Streaming ASR (X-ASR)
# WAV, real-time paced with live partials (or --fast to run as fast as possible)
./build/rs-asr-online -m /path/to/xasr-q4_k_m.gguf -w /path/to/audio.wav --chunk-len 32
# Microphone
./build/rs-asr-online -m /path/to/xasr-q4_k_m.gguf --mic --chunk-len 16See docs/x-asr.md for the model, chunk-size / latency tradeoffs, and GGUF conversion.
Text to speech
./build/rs-tts-offline \
-m /path/to/omnivoice-f16.gguf \
-t "Hello, welcome to RapidSpeech!" \
--instruct "male, young adult, moderate pitch" \
--lang English \
--n-steps 32 \
-o output.wavQuantization
./build/rs-quantize /path/to/input-f16.gguf /path/to/output-q4_k.gguf q4_kSee Python examples for offline ASR, streaming ASR, offline TTS, streaming TTS, VAD, and voice cloning.
If you are interested in the following areas, we welcome your PRs or participation in discussions:
- Adapting more models to the framework.
- Refining and optimizing the project architecture.
- Improving inference performance.
