Skip to content

dev Diarization

rcspam edited this page May 12, 2026 · 2 revisions

🌐 Language: English | Français

Diarization — developer reference

User-facing intro: Diarization.

Sortformer model

NVIDIA's Sortformer (ONNX), single-transformer end-to-end diarization. Streams natively in 10-second chunks with overlap. Hard-capped at 4 speakers (model property, not a config). Language-agnostic. Rust impl in src/sortformer.rs via ONNX Runtime; CUDA backend with silent CPU fallback (diarize_only.rs:122-128) — 5–10× slower on CPU.

VRAM & duration limits

Streaming mode (transcribe-stream-diarize, EN-only via Nemotron)

Audio duration VRAM peak
1 min … 10 hours ~1.2 GB (constant)

No effective duration limit; memory is bounded by the chunk size.

Batch chunked mode (transcribe-diarize-batch, default for the UI)

The single-shot transcribe-diarize runs Parakeet against the full mel in one pass → Parakeet alone hits a ~5:20 min hard cap on every GPU due to a fixed-shape attention mask in the Parakeet-TDT v3 ONNX export (independent of VRAM — see project-parakeet-tdt-attention-mask-bug tracking). The chunked pipeline transcribe-diarize-batch keeps each chunk under 320 s and lifts the ceiling. Since v1.3.4, the dictee-transcribe UI auto-routes any file longer than one chunk through it, regardless of GPU build or diarize toggle.

Pipeline:

  1. Run Sortformer once on full audio (streaming, no cap)
  2. Chunk audio into overlapping 10-min segments via ffmpeg
  3. Transcribe each chunk with Parakeet-TDT (transcribe-diarize-batch --no-diarize)
  4. Merge speaker labels across chunks via the global Sortformer timestamps (argmax overlap)

Field validation:

  • 29-min keynote, 8 GB CUDA → 04:56
  • 54-min keynote, 3 speakers, RTX 4070 (8 GB) → 122 s, 94 % speaker accuracy vs ground truth

Edge cases:

  • 4 GB VRAM, or 8 GB with Ollama already loaded → Parakeet OOM (no minimum-VRAM guard yet)
  • Token landing in a silence gap between two global Sortformer segments → tagged UNKNOWN: <text> (predictable, non-critical)

CLI binaries

Binary Purpose ASR backend Streaming Max duration
transcribe-diarize File mode, single-shot Parakeet-TDT ~5:20 min (ONNX attention-mask bug, any GPU)
transcribe-stream-diarize File mode, streaming Nemotron (EN) unlimited
transcribe-diarize-batch File mode, chunked Parakeet-TDT ⚠ chunked unlimited
transcribe-diarize meeting.wav \
  --sortformer-model /usr/share/dictee/sortformer/ \
  --parakeet-model   /usr/share/dictee/tdt/ \
  --lang en

Output goes to stdout.

Output formats

Default text format with speaker prefix lines:

Speaker 1: Welcome to the second quarter review.
Speaker 2: Thanks. Should we dive into the numbers?
Speaker 3: Quick question — is the call being recorded?

CLI flags:

  • --format=rttm — frame-level speaker IDs (standard diarization format)
  • --format=json — timestamps + speaker IDs + confidence
  • --format=srt — subtitle format with timestamps

Speaker merging across chunks

Used by transcribe-diarize-batch:

  1. Sortformer once on full audio → global speaker timestamps
  2. Chunk audio (10 min, 10 s overlap)
  3. Parakeet-TDT per chunk → token-level timestamps
  4. For each transcribed word, look up its global speaker via Sortformer output (argmax overlap)
  5. Emit with consistent Speaker N labels

Source: src/bin/transcribe_diarize_batch.rs.

Why post-processing is bypassed

Meeting mode skips the 12-step pipeline entirely:

Step Why it would break diarization
5/7. Regex rules Voice commands inserted in wrong speaker's turn
6. LLM correction Reflows text, merges speaker turns into a monologue
8/9. Keepcaps Acronym handling is per-phrase, not per-speaker
10. Capitalization Sentence-start vs speaker-start conflict
11. Translation Multi-speaker translation is rarely coherent

Only formatting kept on diarize output: whitespace cleanup, speaker prefix insertion, trailing \n.

For polished output, use LLM Diarization analysis — built specifically for diarized input.

Mermaid pipeline

flowchart TB
    A["🎙 Audio (N speakers)"] --> B["Sortformer<br/>(streaming, 10s chunks)"]
    B --> C1["Speaker 1 timestamps"]
    B --> C2["Speaker 2 timestamps"]
    B --> C3["Speaker 3 timestamps"]
    B --> C4["Speaker 4 timestamps"]
    C1 --> D["Parakeet-TDT<br/>(transcription)"]
    C2 --> D
    C3 --> D
    C4 --> D
    D --> E["Merge by timestamps<br/>(argmax overlap)"]
    E --> F["Speaker 1: text<br/>Speaker 2: text<br/>Speaker 3: text"]
Loading

See also

📖 dictee Wiki

🇬🇧 Home · 🇫🇷 Accueil


Getting started / Premiers pas

Speech recognition / ASR

Translation / Traduction

Post-processing / Post-traitement

CLI

Reference / Référence


🏠 Repo · 📦 Releases · 🐛 Issues

Clone this wiki locally