Skip to content

How It Works

rcspam edited this page May 15, 2026 · 1 revision

🌐 Language: English | Français

How dictee works — under the hood

This page explains what happens between pressing F9 and the text appearing. No programming background needed — the goal is to give you a clear mental picture of the system.

What happens when you press F9?

In one sentence: your voice is recorded, sent to a program that understands speech, the transcribed text is returned, then typed into whichever window had your cursor. All in under a second on a decent machine.

But behind this apparent simplicity, 9 layers of software pass the baton to each other. Each has a specific role. Understanding these layers helps with troubleshooting and picking the right settings.

The stack at a glance

┌─────────────────────────────────────────────────────┐
│  1.  You press F9                                   │  ← You
│                                                     │
│  2.  The dictee script (the conductor)              │
│                                                     │
│  3.  The messenger (transcribe-client)              │
│                       ↓ via a pipe (Unix socket)    │
│  4.  The transcription server (transcribe-daemon)   │
│      — always ready in the background               │
│                                                     │
│  5.  The in-house library (parakeet-rs)             │  ← dictee's code
│                                                     │
│  6.  The bridge to the engine (ort)                 │
│                                                     │
│  7.  The inference engine (libonnxruntime)          │  ← Microsoft's code
│                                                     │
│  8.  The hardware driver (CPU or NVIDIA GPU)        │  ← Hardware
│                                                     │
│  9.  The AI model (.onnx file)                      │  ← Trained "brain"
└─────────────────────────────────────────────────────┘

The 9 layers explained

1. You pressing F9

The user presses F9 (or clicks the taskbar widget, or uses dictee-transcribe for an audio file). That's the starting point.

2. The "dictee" script — the conductor

A text program (a "shell script") living at /usr/bin/dictee. When invoked, it:

  • records audio from your mic while you speak (using a system tool called pw-record)
  • reads your configuration (the file ~/.config/dictee.conf) to know which speech-recognition engine to use (Parakeet? Canary? Whisper?)
  • forwards the recorded audio file to the transcription server (step 4)
  • once it gets the text back, types it into your active window (using a tool called dotool, which acts as a virtual keyboard)
  • handles post-processing too: adding missing punctuation, applying language rules, optionally triggering translation…

That's the program you indirectly drive. All app settings flow through it.

3. The messenger (transcribe-client) — the courier

A tiny program installed at /usr/bin/transcribe-client. Its only job: carry your audio file to the transcription server and bring back the transcribed text.

Picture a bike courier who delivers a package to a workshop and comes back with the answer. It's deliberately minimalist: starts instantly, almost zero resource usage, knows nothing about how transcription works. It just carries.

Why a separate courier? Because the actual transcription server (step 4) is big and slow to start (it needs to load an AI model into memory). If we had to start that server for every dictation, you'd wait 3 seconds instead of getting an instant response.

4. The transcription server (transcribe-daemon) — the resident scribe

The real brain of transcription. Lives in the background permanently (since your session starts), managed by systemd as a system service (dictee.service).

This server:

  • loads the AI model exactly once at startup (takes 2-5 seconds, but only paid once per session)
  • keeps the model in memory (RAM if CPU, VRAM if GPU)
  • listens on a communication channel (a Unix "socket", which is just a special file acting as a pipe between programs)
  • for each audio file received: analyzes, transcribes, returns the text
  • only stops when you close your session

This "permanent server + tiny client" architecture is why dictee is responsive: the heavy lifting of loading happens once, not per dictation.

5. The in-house library (parakeet-rs) — the classroom

At this point, the transcription server has received an audio file. It calls into a code library (a collection of reusable functions) that knows how to turn audio into text. That library is parakeet-rs. It's the code maintained by dictee's developers.

It handles:

  • transforming raw sound (vibrations sampled 16,000 times per second) into a mathematical representation the AI model understands (a "mel-spectrogram" — think of a color photo of the sound's frequency over time)
  • loading the ONNX model (the "brain", step 9) from disk
  • feeding the model the mel-spectrogram
  • getting the model's outputs (numbers representing the probability of each character/word)
  • looping character by character to reconstruct the final text

That's the "business logic" part: everything specific to the Parakeet or Canary formats (the two models dictee supports in Rust). If tomorrow we wanted to add support for a new model, we'd add code here.

6. The bridge to the engine (ort) — the interpreter

parakeet-rs is written in Rust, but the inference engine (step 7) that actually runs the model is written in C++ (a different language). For them to talk, we need a translator: the ort library.

Picture a meeting between a French executive and a Chinese executive: they need an interpreter to understand each other. ort is that interpreter between dictee's Rust code and Microsoft's C++ engine.

Without this layer, developers would have to write all the low-level calls manually (tedious and risky). With ort, we just say in Rust: "load this .onnx file, feed it this data, give me the outputs".

7. The inference engine (libonnxruntime.so) — the engine room

A big library (~50 to 80 MB) developed by Microsoft, free and open-source, used by countless projects worldwide (not just dictee). Installed on your machine at /usr/lib/dictee/libonnxruntime.so.

Its role: run AI models in the ONNX format (a standard format, like PDF for documents).

When asked to "run this model with this data":

  1. Reads the .onnx file: it's a bundle of blueprints (the model's structure) and values (the billions of parameters the model learned during training)
  2. Builds the operation graph: "first this matrix multiplication, then this convolution, then this probability function, etc."
  3. Optimizes the graph: eliminate redundant calculations, fuse operations that can be fused
  4. Delegates each operation to the appropriate hardware driver (step 8)

You never touch this code — it's a binary dependency installed alongside dictee. But it's useful to know it exists: if you ever see an error message mentioning "ONNX Runtime" or "ORT", it's this.

8. The hardware driver (Execution Provider) — the skilled worker

The previous step turned the model into a list of math operations to perform. But each operation has to be translated into the language of the underlying hardware:

  • On CPU (central processor, present in every machine): operations are translated into processor instructions. Modern processors have special AI-optimized instructions (AVX2, AVX-VNNI) that speed up matrix math.

  • On NVIDIA GPU (graphics card with CUDA acceleration): operations are translated into "CUDA kernels", small programs that run in parallel on the GPU's thousands of cores. Much faster for AI models.

The choice between CPU and GPU is made at transcription server startup:

  • If dictee-cuda is installed and an NVIDIA card is detected → GPU
  • If dictee-cpu is installed, no NVIDIA card, or DICTEE_FORCE_CPU=1 is set → CPU

That's why you see speed differences: on the same machine, Parakeet takes 0.18 s on GPU vs 1.17 s on CPU for 16 s of audio.

9. The ONNX model — the trained brain

The model is the speech recognition "brain". It's what knows, after weeks of training by NVIDIA's researchers, how to match sounds to text.

Concretely, it's a file on your disk (typically in /usr/share/dictee/tdt/) containing:

  • The structure of the neural network: how many layers, how they're connected, etc.
  • The weights: billions of numbers tuned during training. They "know" how to map a sound pattern to a word.
  • The precision: 32 bits per weight (FP32, more accurate but heavy) or 8 bits per weight (INT8, lighter but slightly less precise).

dictee supports two main models:

  • Parakeet-TDT 0.6B v3: 25 languages, ~600 million parameters, .onnx format, ~2.4 GB FP32 or ~670 MB INT8
  • Canary 1B v2: 7 languages with built-in translation, ~1 billion parameters, ~4 GB FP32 or ~1 GB INT8

These models are downloaded once from Hugging Face (a public AI model repository) during setup.

Why this layered architecture?

It may seem complicated, but each layer has a good reason to exist:

Client/server split (steps 3 and 4): the client is instant, the server keeps the model loaded. Without this split, every dictation would take 3-5 extra seconds (model loading).

parakeet-rs library (step 5) vs ORT engine (step 7): parakeet-rs handles Parakeet/Canary specifics (formats, audio pre-processing, text post-processing). ORT handles generic model execution. This split lets us:

  • update ORT without rewriting dictee
  • use ORT for other models (whisper, computer vision, etc.) without changing its code
  • benefit from Microsoft's optimizations effortlessly

Hardware driver (step 8): the same ONNX model can run on Linux/Windows/Mac, on Intel/AMD/Apple Silicon, on CPU or GPU. That's the richness of the ONNX format.

Where each layer lives on your disk

Layer Location
dictee (script) /usr/bin/dictee
transcribe-client /usr/bin/transcribe-client
transcribe-daemon /usr/bin/transcribe-daemon
parakeet-rs (compiled into transcribe-daemon) bundled in the binary above
ort (ditto) bundled in the binary above
libonnxruntime.so /usr/lib/dictee/libonnxruntime.so
NVIDIA driver + libcudnn /usr/lib/x86_64-linux-gnu/ or /usr/lib/dictee/
Parakeet TDT model /usr/share/dictee/tdt/ (or ~/.local/share/dictee/tdt/)
Canary model /usr/share/dictee/canary/
Sortformer model (diarization) /usr/share/dictee/sortformer/
User config ~/.config/dictee.conf

Restaurant metaphor

Imagine ordering a dish at a restaurant:

Layer Metaphor
You pressing F9 The customer placing the order
The dictee script The waiter, who writes down the order
The messenger transcribe-client The runner who takes the order to the kitchen
The server transcribe-daemon The head chef, already on duty, always ready
The library parakeet-rs The restaurant's own recipe book
The bridge ort The interpreter between the recipe (in French) and the chef (English-speaking)
The engine libonnxruntime.so General cooking know-how (peeling, chopping, frying…)
The hardware driver The cookware: gas stove (CPU) or induction (GPU)
The .onnx model The detailed recipe, fruit of long research

All these layers collaborate. Each does one thing well. That's what makes the system robust and maintainable.

Going deeper

📖 dictee Wiki

🇬🇧 Home · 🇫🇷 Accueil


Getting started / Premiers pas

Speech recognition / ASR

Translation / Traduction

Post-processing / Post-traitement

CLI

Reference / Référence


🏠 Repo · 📦 Releases · 🐛 Issues

Clone this wiki locally