VibeVoice-ASR

VibeVoice-ASR is a unified speech-to-text model designed to handle 60-minute long-form audio in a single pass, generating structured transcriptions containing Who (Speaker), When (Timestamps), and What (Content), with support for Customized Hotwords and over 50 languages.

Model: VibeVoice-ASR-7B
Demo: VibeVoice-ASR-Demo
Report: VibeVoice-ASR-Report
Finetuning: finetune-guide
vLLM: vLLM-asr

🔥 Key Features

🕒 60-minute Single-Pass Processing: Unlike conventional ASR models that slice audio into short chunks (often losing global context), VibeVoice ASR accepts up to 60 minutes of continuous audio input within 64K token length. This ensures consistent speaker tracking and semantic coherence across the entire hour.
👤 Customized Hotwords: Users can provide customized hotwords (e.g., specific names, technical terms, or background info) to guide the recognition process, significantly improving accuracy on domain-specific content.
📝 Rich Transcription (Who, When, What): The model jointly performs ASR, diarization, and timestamping, producing a structured output that indicates who said what and when.
🌍 Multilingual & Code-Switching Support: It supports over 50 languages, requires no explicit language setting, and natively handles code-switching within and across utterances. See the Language distribution.

🏗️ Model Architecture

Demo

small.mp4

Evaluation

Installation

We recommend to use NVIDIA Deep Learning Container to manage the CUDA environment.

Launch docker

# NVIDIA PyTorch Container 24.07 ~ 25.12 verified. 
# Previous versions are also compatible.
sudo docker run --privileged --net=host --ipc=host --ulimit memlock=-1:-1 --ulimit stack=-1:-1 --gpus all --rm -it  nvcr.io/nvidia/pytorch:25.12-py3

## If flash attention is not included in your docker environment, you need to install it manually
## Refer to https://github.com/Dao-AILab/flash-attention for installation instructions
# pip install flash-attn --no-build-isolation

Install from github

git clone https://github.com/microsoft/VibeVoice.git
cd VibeVoice

pip install -e .

Usages

Usage 1: Launch Gradio demo

apt update && apt install ffmpeg -y # for demo

python demo/vibevoice_asr_gradio_demo.py --model_path microsoft/VibeVoice-ASR --share

Usage 2: Inference from files directly

python demo/vibevoice_asr_inference_from_file.py --model_path microsoft/VibeVoice-ASR --audio_files [add an audio path here]

Finetuning

LoRA (Low-Rank Adaptation) fine-tuning is supported. See Finetuning for detailed guide.

Results

Multilingual

Dataset	Language	DER	cpWER	tcpWER	WER
MLC-Challenge	English	4.28	11.48	13.02	7.99
MLC-Challenge	French	3.80	18.80	19.64	15.21
MLC-Challenge	German	1.04	17.10	17.26	16.30
MLC-Challenge	Italian	2.08	15.76	15.91	13.91
MLC-Challenge	Japanese	0.82	15.33	15.41	14.69
MLC-Challenge	Korean	4.52	15.35	16.07	9.65
MLC-Challenge	Portuguese	7.98	29.91	31.65	21.54
MLC-Challenge	Russian	0.90	12.94	12.98	12.40
MLC-Challenge	Spanish	2.67	10.51	11.71	8.04
MLC-Challenge	Thai	4.09	14.91	15.57	13.61
MLC-Challenge	Vietnamese	0.16	14.57	14.57	14.43

Dataset	Language	DER	cpWER	tcpWER	WER
AISHELL-4	Chinese	6.77	24.99	25.35	21.40
AMI-IHM	English	11.92	20.41	20.82	18.81
AMI-SDM	English	13.43	28.82	29.80	24.65
AliMeeting	Chinese	10.92	29.33	29.51	27.40
MLC-Challenge	Average	3.42	14.81	15.66	12.07

Language Distribution

📄 License

This project is licensed under the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VibeVoice-ASR

🔥 Key Features

🏗️ Model Architecture

Demo

Evaluation

Installation

Usages

Usage 1: Launch Gradio demo

Usage 2: Inference from files directly

Finetuning

Results

Multilingual

Language Distribution

📄 License

FilesExpand file tree

vibevoice-asr.md

Latest commit

History

vibevoice-asr.md

File metadata and controls

VibeVoice-ASR

🔥 Key Features

🏗️ Model Architecture

Demo

Evaluation

Installation

Usages

Usage 1: Launch Gradio demo

Usage 2: Inference from files directly

Finetuning

Results

Multilingual

Language Distribution

📄 License