-
Notifications
You must be signed in to change notification settings - Fork 2
How It Works
🌐 Language: English | Français
This page explains what happens between pressing F9 and the text appearing. No programming background needed — the goal is to give you a clear mental picture of the system.
In one sentence: your voice is recorded, sent to a program that understands speech, the transcribed text is returned, then typed into whichever window had your cursor. All in under a second on a decent machine.
But behind this apparent simplicity, 9 layers of software pass the baton to each other. Each has a specific role. Understanding these layers helps with troubleshooting and picking the right settings.
┌─────────────────────────────────────────────────────┐
│ 1. You press F9 │ ← You
│ │
│ 2. The dictee script (the conductor) │
│ │
│ 3. The messenger (transcribe-client) │
│ ↓ via a pipe (Unix socket) │
│ 4. The transcription server (transcribe-daemon) │
│ — always ready in the background │
│ │
│ 5. The in-house library (parakeet-rs) │ ← dictee's code
│ │
│ 6. The bridge to the engine (ort) │
│ │
│ 7. The inference engine (libonnxruntime) │ ← Microsoft's code
│ │
│ 8. The hardware driver (CPU or NVIDIA GPU) │ ← Hardware
│ │
│ 9. The AI model (.onnx file) │ ← Trained "brain"
└─────────────────────────────────────────────────────┘
The user presses F9 (or clicks the taskbar widget, or uses dictee-transcribe for an audio file). That's the starting point.
A text program (a "shell script") living at /usr/bin/dictee. When invoked, it:
-
records audio from your mic while you speak (using a system tool called
pw-record) -
reads your configuration (the file
~/.config/dictee.conf) to know which speech-recognition engine to use (Parakeet? Canary? Whisper?) - forwards the recorded audio file to the transcription server (step 4)
- once it gets the text back, types it into your active window (using a tool called
dotool, which acts as a virtual keyboard) - handles post-processing too: adding missing punctuation, applying language rules, optionally triggering translation…
That's the program you indirectly drive. All app settings flow through it.
A tiny program installed at /usr/bin/transcribe-client. Its only job: carry your audio file to the transcription server and bring back the transcribed text.
Picture a bike courier who delivers a package to a workshop and comes back with the answer. It's deliberately minimalist: starts instantly, almost zero resource usage, knows nothing about how transcription works. It just carries.
Why a separate courier? Because the actual transcription server (step 4) is big and slow to start (it needs to load an AI model into memory). If we had to start that server for every dictation, you'd wait 3 seconds instead of getting an instant response.
The real brain of transcription. Lives in the background permanently (since your session starts), managed by systemd as a system service (dictee.service).
This server:
- loads the AI model exactly once at startup (takes 2-5 seconds, but only paid once per session)
- keeps the model in memory (RAM if CPU, VRAM if GPU)
- listens on a communication channel (a Unix "socket", which is just a special file acting as a pipe between programs)
- for each audio file received: analyzes, transcribes, returns the text
- only stops when you close your session
This "permanent server + tiny client" architecture is why dictee is responsive: the heavy lifting of loading happens once, not per dictation.
At this point, the transcription server has received an audio file. It calls into a code library (a collection of reusable functions) that knows how to turn audio into text. That library is parakeet-rs. It's the code maintained by dictee's developers.
It handles:
- transforming raw sound (vibrations sampled 16,000 times per second) into a mathematical representation the AI model understands (a "mel-spectrogram" — think of a color photo of the sound's frequency over time)
- loading the ONNX model (the "brain", step 9) from disk
- feeding the model the mel-spectrogram
- getting the model's outputs (numbers representing the probability of each character/word)
- looping character by character to reconstruct the final text
That's the "business logic" part: everything specific to the Parakeet or Canary formats (the two models dictee supports in Rust). If tomorrow we wanted to add support for a new model, we'd add code here.
parakeet-rs is written in Rust, but the inference engine (step 7) that actually runs the model is written in C++ (a different language). For them to talk, we need a translator: the ort library.
Picture a meeting between a French executive and a Chinese executive: they need an interpreter to understand each other. ort is that interpreter between dictee's Rust code and Microsoft's C++ engine.
Without this layer, developers would have to write all the low-level calls manually (tedious and risky). With ort, we just say in Rust: "load this .onnx file, feed it this data, give me the outputs".
A big library (~50 to 80 MB) developed by Microsoft, free and open-source, used by countless projects worldwide (not just dictee). Installed on your machine at /usr/lib/dictee/libonnxruntime.so.
Its role: run AI models in the ONNX format (a standard format, like PDF for documents).
When asked to "run this model with this data":
-
Reads the
.onnxfile: it's a bundle of blueprints (the model's structure) and values (the billions of parameters the model learned during training) - Builds the operation graph: "first this matrix multiplication, then this convolution, then this probability function, etc."
- Optimizes the graph: eliminate redundant calculations, fuse operations that can be fused
- Delegates each operation to the appropriate hardware driver (step 8)
You never touch this code — it's a binary dependency installed alongside dictee. But it's useful to know it exists: if you ever see an error message mentioning "ONNX Runtime" or "ORT", it's this.
The previous step turned the model into a list of math operations to perform. But each operation has to be translated into the language of the underlying hardware:
-
On CPU (central processor, present in every machine): operations are translated into processor instructions. Modern processors have special AI-optimized instructions (AVX2, AVX-VNNI) that speed up matrix math.
-
On NVIDIA GPU (graphics card with CUDA acceleration): operations are translated into "CUDA kernels", small programs that run in parallel on the GPU's thousands of cores. Much faster for AI models.
The choice between CPU and GPU is made at transcription server startup:
- If dictee-cuda is installed and an NVIDIA card is detected → GPU
- If dictee-cpu is installed, no NVIDIA card, or
DICTEE_FORCE_CPU=1is set → CPU
That's why you see speed differences: on the same machine, Parakeet takes 0.18 s on GPU vs 1.17 s on CPU for 16 s of audio.
The model is the speech recognition "brain". It's what knows, after weeks of training by NVIDIA's researchers, how to match sounds to text.
Concretely, it's a file on your disk (typically in /usr/share/dictee/tdt/) containing:
- The structure of the neural network: how many layers, how they're connected, etc.
- The weights: billions of numbers tuned during training. They "know" how to map a sound pattern to a word.
- The precision: 32 bits per weight (FP32, more accurate but heavy) or 8 bits per weight (INT8, lighter but slightly less precise).
dictee supports two main models:
-
Parakeet-TDT 0.6B v3: 25 languages, ~600 million parameters,
.onnxformat, ~2.4 GB FP32 or ~670 MB INT8 - Canary 1B v2: 7 languages with built-in translation, ~1 billion parameters, ~4 GB FP32 or ~1 GB INT8
These models are downloaded once from Hugging Face (a public AI model repository) during setup.
It may seem complicated, but each layer has a good reason to exist:
Client/server split (steps 3 and 4): the client is instant, the server keeps the model loaded. Without this split, every dictation would take 3-5 extra seconds (model loading).
parakeet-rs library (step 5) vs ORT engine (step 7): parakeet-rs handles Parakeet/Canary specifics (formats, audio pre-processing, text post-processing). ORT handles generic model execution. This split lets us:
- update ORT without rewriting dictee
- use ORT for other models (whisper, computer vision, etc.) without changing its code
- benefit from Microsoft's optimizations effortlessly
Hardware driver (step 8): the same ONNX model can run on Linux/Windows/Mac, on Intel/AMD/Apple Silicon, on CPU or GPU. That's the richness of the ONNX format.
| Layer | Location |
|---|---|
dictee (script) |
/usr/bin/dictee |
transcribe-client |
/usr/bin/transcribe-client |
transcribe-daemon |
/usr/bin/transcribe-daemon |
parakeet-rs (compiled into transcribe-daemon) |
bundled in the binary above |
ort (ditto) |
bundled in the binary above |
libonnxruntime.so |
/usr/lib/dictee/libonnxruntime.so |
| NVIDIA driver + libcudnn |
/usr/lib/x86_64-linux-gnu/ or /usr/lib/dictee/
|
| Parakeet TDT model |
/usr/share/dictee/tdt/ (or ~/.local/share/dictee/tdt/) |
| Canary model | /usr/share/dictee/canary/ |
| Sortformer model (diarization) | /usr/share/dictee/sortformer/ |
| User config | ~/.config/dictee.conf |
Imagine ordering a dish at a restaurant:
| Layer | Metaphor |
|---|---|
| You pressing F9 | The customer placing the order |
The dictee script |
The waiter, who writes down the order |
The messenger transcribe-client
|
The runner who takes the order to the kitchen |
The server transcribe-daemon
|
The head chef, already on duty, always ready |
The library parakeet-rs
|
The restaurant's own recipe book |
The bridge ort
|
The interpreter between the recipe (in French) and the chef (English-speaking) |
The engine libonnxruntime.so
|
General cooking know-how (peeling, chopping, frying…) |
| The hardware driver | The cookware: gas stove (CPU) or induction (GPU) |
The .onnx model |
The detailed recipe, fruit of long research |
All these layers collaborate. Each does one thing well. That's what makes the system robust and maintainable.
- Parakeet-TDT-Deep-Dive — technical details of the Parakeet model
- Canary-1B-Deep-Dive — technical details of the Canary model
- GPU-Setup — how to enable NVIDIA GPU, CUDA and cuDNN prerequisites
- CLI-Reference — all the environment variables (DICTEE_*) that drive this behavior
- Developer-Guide — for those who want to modify the Rust code
Getting started / Premiers pas
- Installation · 🇬🇧 · 🇫🇷
- Setup-Wizard · 🇬🇧 · 🇫🇷
- Configuration · 🇬🇧 · 🇫🇷
- Plasmoid-Widget · 🇬🇧 · 🇫🇷
- Tray-Icon · 🇬🇧 · 🇫🇷
- Keyboard-Shortcuts · 🇬🇧 · 🇫🇷
- Voice-Commands · 🇬🇧 · 🇫🇷
- GPU-Setup · 🇬🇧 · 🇫🇷
- Diarization · 🇬🇧 · 🇫🇷
- LLM-Diarization · 🇬🇧 · 🇫🇷
Speech recognition / ASR
Translation / Traduction
Post-processing / Post-traitement
- Overview · 🇬🇧 · 🇫🇷
- Rules-and-Dictionary · 🇬🇧 · 🇫🇷
- LLM-Correction · 🇬🇧 · 🇫🇷
- Numbers-Dates-Continuation · 🇬🇧 · 🇫🇷
CLI
Reference / Référence
🏠 Repo · 📦 Releases · 🐛 Issues