-
Notifications
You must be signed in to change notification settings - Fork 2
dev Diarization
🌐 Language: English | Français
User-facing intro: Diarization.
NVIDIA's Sortformer (ONNX), single-transformer end-to-end diarization. Streams natively in 10-second chunks with overlap. Hard-capped at 4 speakers (model property, not a config). Language-agnostic. Rust impl in src/sortformer.rs via ONNX Runtime; CUDA backend with silent CPU fallback (diarize_only.rs:122-128) — 5–10× slower on CPU.
| Audio duration | VRAM peak |
|---|---|
| 1 min … 10 hours | ~1.2 GB (constant) |
No effective duration limit; memory is bounded by the chunk size.
The single-shot transcribe-diarize runs Parakeet against the full mel in one pass → Parakeet alone hits a ~5:20 min hard cap on every GPU due to a fixed-shape attention mask in the Parakeet-TDT v3 ONNX export (independent of VRAM — see project-parakeet-tdt-attention-mask-bug tracking). The chunked pipeline transcribe-diarize-batch keeps each chunk under 320 s and lifts the ceiling. Since v1.3.4, the dictee-transcribe UI auto-routes any file longer than one chunk through it, regardless of GPU build or diarize toggle.
Pipeline:
- Run Sortformer once on full audio (streaming, no cap)
- Chunk audio into overlapping 10-min segments via
ffmpeg - Transcribe each chunk with Parakeet-TDT (
transcribe-diarize-batch --no-diarize) - Merge speaker labels across chunks via the global Sortformer timestamps (argmax overlap)
Field validation:
- 29-min keynote, 8 GB CUDA → 04:56
- 54-min keynote, 3 speakers, RTX 4070 (8 GB) → 122 s, 94 % speaker accuracy vs ground truth
Edge cases:
- 4 GB VRAM, or 8 GB with Ollama already loaded → Parakeet OOM (no minimum-VRAM guard yet)
- Token landing in a silence gap between two global Sortformer segments → tagged
UNKNOWN: <text>(predictable, non-critical)
| Binary | Purpose | ASR backend | Streaming | Max duration |
|---|---|---|---|---|
transcribe-diarize |
File mode, single-shot | Parakeet-TDT | ❌ | ~5:20 min (ONNX attention-mask bug, any GPU) |
transcribe-stream-diarize |
File mode, streaming | Nemotron (EN) | ✅ | unlimited |
transcribe-diarize-batch |
File mode, chunked | Parakeet-TDT | ⚠ chunked | unlimited |
transcribe-diarize meeting.wav \
--sortformer-model /usr/share/dictee/sortformer/ \
--parakeet-model /usr/share/dictee/tdt/ \
--lang enOutput goes to stdout.
Default text format with speaker prefix lines:
Speaker 1: Welcome to the second quarter review.
Speaker 2: Thanks. Should we dive into the numbers?
Speaker 3: Quick question — is the call being recorded?
CLI flags:
-
--format=rttm— frame-level speaker IDs (standard diarization format) -
--format=json— timestamps + speaker IDs + confidence -
--format=srt— subtitle format with timestamps
Used by transcribe-diarize-batch:
- Sortformer once on full audio → global speaker timestamps
- Chunk audio (10 min, 10 s overlap)
- Parakeet-TDT per chunk → token-level timestamps
- For each transcribed word, look up its global speaker via Sortformer output (argmax overlap)
- Emit with consistent
Speaker Nlabels
Source: src/bin/transcribe_diarize_batch.rs.
Meeting mode skips the 12-step pipeline entirely:
| Step | Why it would break diarization |
|---|---|
| 5/7. Regex rules | Voice commands inserted in wrong speaker's turn |
| 6. LLM correction | Reflows text, merges speaker turns into a monologue |
| 8/9. Keepcaps | Acronym handling is per-phrase, not per-speaker |
| 10. Capitalization | Sentence-start vs speaker-start conflict |
| 11. Translation | Multi-speaker translation is rarely coherent |
Only formatting kept on diarize output: whitespace cleanup, speaker prefix insertion, trailing \n.
For polished output, use LLM Diarization analysis — built specifically for diarized input.
flowchart TB
A["🎙 Audio (N speakers)"] --> B["Sortformer<br/>(streaming, 10s chunks)"]
B --> C1["Speaker 1 timestamps"]
B --> C2["Speaker 2 timestamps"]
B --> C3["Speaker 3 timestamps"]
B --> C4["Speaker 4 timestamps"]
C1 --> D["Parakeet-TDT<br/>(transcription)"]
C2 --> D
C3 --> D
C4 --> D
D --> E["Merge by timestamps<br/>(argmax overlap)"]
E --> F["Speaker 1: text<br/>Speaker 2: text<br/>Speaker 3: text"]
- Parakeet-TDT-Deep-Dive#vram-usage--duration-limits — VRAM context
- Post-Processing-Overview#diarization-bypass — full bypass rationale
- CLI-Reference#transcribe-diarize — exhaustive flag list
- Troubleshooting — diarize-specific issues
Getting started / Premiers pas
- Installation · 🇬🇧 · 🇫🇷
- Setup-Wizard · 🇬🇧 · 🇫🇷
- Configuration · 🇬🇧 · 🇫🇷
- Plasmoid-Widget · 🇬🇧 · 🇫🇷
- Tray-Icon · 🇬🇧 · 🇫🇷
- Keyboard-Shortcuts · 🇬🇧 · 🇫🇷
- Voice-Commands · 🇬🇧 · 🇫🇷
- GPU-Setup · 🇬🇧 · 🇫🇷
- Diarization · 🇬🇧 · 🇫🇷
- LLM-Diarization · 🇬🇧 · 🇫🇷
Speech recognition / ASR
Translation / Traduction
Post-processing / Post-traitement
- Overview · 🇬🇧 · 🇫🇷
- Rules-and-Dictionary · 🇬🇧 · 🇫🇷
- LLM-Correction · 🇬🇧 · 🇫🇷
- Numbers-Dates-Continuation · 🇬🇧 · 🇫🇷
CLI
Reference / Référence
🏠 Repo · 📦 Releases · 🐛 Issues