dev Diarization

🌐 Language: English | Français

Diarization — developer reference

User-facing intro: Diarization.

Sortformer model

NVIDIA's Sortformer (ONNX), single-transformer end-to-end diarization. Streams natively in 10-second chunks with overlap. Hard-capped at 4 speakers (model property, not a config). Language-agnostic. Rust impl in src/sortformer.rs via ONNX Runtime; CUDA backend with silent CPU fallback (diarize_only.rs:122-128) — 5–10× slower on CPU.

VRAM & duration limits

Streaming mode (`transcribe-stream-diarize`, EN-only via Nemotron)

Audio duration	VRAM peak
1 min … 10 hours	~1.2 GB (constant)

No effective duration limit; memory is bounded by the chunk size.

Batch chunked mode (`transcribe-diarize-batch`, default for the UI)

The single-shot transcribe-diarize runs Parakeet against the full mel in one pass → Parakeet alone hits a ~5:20 min hard cap on every GPU due to a fixed-shape attention mask in the Parakeet-TDT v3 ONNX export (independent of VRAM — see project-parakeet-tdt-attention-mask-bug tracking). The chunked pipeline transcribe-diarize-batch keeps each chunk under 320 s and lifts the ceiling. Since v1.3.4, the dictee-transcribe UI auto-routes any file longer than one chunk through it, regardless of GPU build or diarize toggle.

Pipeline:

Run Sortformer once on full audio (streaming, no cap)
Chunk audio into overlapping 10-min segments via ffmpeg
Transcribe each chunk with Parakeet-TDT (transcribe-diarize-batch --no-diarize)
Merge speaker labels across chunks via the global Sortformer timestamps (argmax overlap)

Field validation:

29-min keynote, 8 GB CUDA → 04:56
54-min keynote, 3 speakers, RTX 4070 (8 GB) → 122 s, 94 % speaker accuracy vs ground truth

Edge cases:

4 GB VRAM, or 8 GB with Ollama already loaded → Parakeet OOM (no minimum-VRAM guard yet)
Token landing in a silence gap between two global Sortformer segments → tagged UNKNOWN: <text> (predictable, non-critical)

CLI binaries

Binary	Purpose	ASR backend	Streaming	Max duration
`transcribe-diarize`	File mode, single-shot	Parakeet-TDT	❌	~5:20 min (ONNX attention-mask bug, any GPU)
`transcribe-stream-diarize`	File mode, streaming	Nemotron (EN)	✅	unlimited
`transcribe-diarize-batch`	File mode, chunked	Parakeet-TDT	⚠ chunked	unlimited

transcribe-diarize meeting.wav \
  --sortformer-model /usr/share/dictee/sortformer/ \
  --parakeet-model   /usr/share/dictee/tdt/ \
  --lang en

Output goes to stdout.

Output formats

Default text format with speaker prefix lines:

Speaker 1: Welcome to the second quarter review.
Speaker 2: Thanks. Should we dive into the numbers?
Speaker 3: Quick question — is the call being recorded?

CLI flags:

--format=rttm — frame-level speaker IDs (standard diarization format)
--format=json — timestamps + speaker IDs + confidence
--format=srt — subtitle format with timestamps

Speaker merging across chunks

Used by transcribe-diarize-batch:

Sortformer once on full audio → global speaker timestamps
Chunk audio (10 min, 10 s overlap)
Parakeet-TDT per chunk → token-level timestamps
For each transcribed word, look up its global speaker via Sortformer output (argmax overlap)
Emit with consistent Speaker N labels

Source: src/bin/transcribe_diarize_batch.rs.

Why post-processing is bypassed

Meeting mode skips the 12-step pipeline entirely:

Step	Why it would break diarization
5/7. Regex rules	Voice commands inserted in wrong speaker's turn
6. LLM correction	Reflows text, merges speaker turns into a monologue
8/9. Keepcaps	Acronym handling is per-phrase, not per-speaker
10. Capitalization	Sentence-start vs speaker-start conflict
11. Translation	Multi-speaker translation is rarely coherent

Only formatting kept on diarize output: whitespace cleanup, speaker prefix insertion, trailing \n.

For polished output, use LLM Diarization analysis — built specifically for diarized input.

Mermaid pipeline

flowchart TB
    A["🎙 Audio (N speakers)"] --> B["Sortformer<br/>(streaming, 10s chunks)"]
    B --> C1["Speaker 1 timestamps"]
    B --> C2["Speaker 2 timestamps"]
    B --> C3["Speaker 3 timestamps"]
    B --> C4["Speaker 4 timestamps"]
    C1 --> D["Parakeet-TDT<br/>(transcription)"]
    C2 --> D
    C3 --> D
    C4 --> D
    D --> E["Merge by timestamps<br/>(argmax overlap)"]
    E --> F["Speaker 1: text<br/>Speaker 2: text<br/>Speaker 3: text"]

📖 dictee Wiki

🇬🇧 Home · 🇫🇷 Accueil

Getting started / Premiers pas

Installation · 🇬🇧 · 🇫🇷
Setup-Wizard · 🇬🇧 · 🇫🇷
Configuration · 🇬🇧 · 🇫🇷
Plasmoid-Widget · 🇬🇧 · 🇫🇷
Tray-Icon · 🇬🇧 · 🇫🇷
Keyboard-Shortcuts · 🇬🇧 · 🇫🇷
Voice-Commands · 🇬🇧 · 🇫🇷
GPU-Setup · 🇬🇧 · 🇫🇷
Diarization · 🇬🇧 · 🇫🇷
LLM-Diarization · 🇬🇧 · 🇫🇷

Speech recognition / ASR

ASR-Backends · 🇬🇧 · 🇫🇷
Parakeet-TDT-Deep-Dive · 🇬🇧 · 🇫🇷
Canary-1B-Deep-Dive · 🇬🇧 · 🇫🇷

Translation / Traduction

Translation · 🇬🇧 · 🇫🇷
Ollama-Setup · 🇬🇧 · 🇫🇷

Post-processing / Post-traitement

Overview · 🇬🇧 · 🇫🇷
Rules-and-Dictionary · 🇬🇧 · 🇫🇷
LLM-Correction · 🇬🇧 · 🇫🇷
Numbers-Dates-Continuation · 🇬🇧 · 🇫🇷

CLI

CLI-Reference · 🇬🇧 · 🇫🇷

Reference / Référence

Troubleshooting · 🇬🇧 · 🇫🇷
FAQ · 🇬🇧 · 🇫🇷
Developer-Guide · 🇬🇧 · 🇫🇷
Changelog · 🇬🇧 · 🇫🇷

🏠 Repo · 📦 Releases · 🐛 Issues

dev Diarization

Diarization — developer reference

Sortformer model

VRAM & duration limits

Streaming mode (transcribe-stream-diarize, EN-only via Nemotron)

Batch chunked mode (transcribe-diarize-batch, default for the UI)

CLI binaries

Output formats

Speaker merging across chunks

Why post-processing is bypassed

Mermaid pipeline

See also

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

📖 dictee Wiki

Clone this wiki locally

Streaming mode (`transcribe-stream-diarize`, EN-only via Nemotron)

Batch chunked mode (`transcribe-diarize-batch`, default for the UI)