Speaking is just easier.
Speak freely, type instantly on Wayland (X11 compatible) — 100% local voice dictation for Linux with 25+ languages, 5 translation backends, speaker diarization, and real-time visual feedback. Text appears right where your cursor is.
What is dictee? • System requirements • Quick start • Features • Installation • Configuration • Usage • Post-processing • Limitations • Roadmap • Wiki
dictee is a complete voice dictation system for Linux. Press a shortcut, speak, and the text is typed directly into the active application — any application, any window, any text field.
Transcription is performed 100% locally by default: no audio ever leaves your machine unless you explicitly choose a cloud translation backend.
- 100% local processing by default — no audio leaves the machine unless you explicitly enable a cloud translation backend. Frozen ONNX models, no training on your data.
- 4 ASR backends to choose from — Parakeet-TDT and Canary run as native Rust binaries (ONNX Runtime, low GPU latency), faster-whisper (99 languages) and Vosk (lightweight CPU) in Python. Transparent switching via Unix socket depending on language, latency or hardware. → 4 ASR backends
- 5 translation backends to choose from — from fully local (Canary, LibreTranslate, Ollama) to cloud (Google, Bing), with an explicit privacy table for each option. → Translation backends
- No duration limit on audio files — the chunked pipeline shipped in v1.3 (
dictee-transcribe) diarizes a 54-min keynote in 122 s on an 8 GB GPU, where direct mel loading caps at 10-15 min. Ideal for meeting minutes and long interviews. - Native Linux integration — KDE Plasma 6 plasmoid + PyQt6 system tray (compatible with GNOME, XFCE, Sway via AppIndicator fallback).
| Backend | Min RAM | CPU mode | GPU | Disk |
|---|---|---|---|---|
| Parakeet-TDT (default) | 4 GB | Yes — ~0.8 s per utterance (recent CPU) | NVIDIA 4 GB+ VRAM (~5× faster) | 3 GB |
| Canary-1B v2 | 6 GB | No — encoder too heavy | NVIDIA 6 GB+ VRAM required | 6 GB |
| faster-whisper | 4 GB | Yes — turbo or small |
NVIDIA 4 GB+ VRAM (large-v3) |
3 GB |
| Vosk | 2 GB | Yes — by design | — | 50 MB |
Distributions tested: Ubuntu 22.04 / 24.04 · Debian 12 · Fedora 40 / 44 · openSUSE Tumbleweed · Arch Linux · KDE Neon.
Desktop environments: KDE Plasma 6 (full integration via native plasmoid widget) · GNOME, Xfce, Cinnamon (system tray only — GNOME requires the AppIndicator extension).
Three steps to go from zero to dictation in under two minutes:
1. Install
curl -fsSL https://raw.githubusercontent.com/rcspam/dictee/master/install.sh | bashPrefer to audit before running?
install.shandinstall.sh.sha256are published as release assets — download both, verify withsha256sum -c install.sh.sha256, read the script, then run it.
2. Configure
The first-run wizard walks you through backend selection, model download and keyboard shortcut binding. Re-run anytime with dictee --setup.
3. Speak
Press your shortcut (default F9), speak, release. The transcription appears at your cursor.
For detailed install paths (manual .deb/.rpm, GPU prerequisites, AUR, from source), see Installation below or the wiki's Installation and GPU-Setup pages.
| Backend | Languages | Model size | Warm latency | Notes |
|---|---|---|---|---|
| Parakeet-TDT 0.6B v3 | 25 | ~2.5 GB | ~0.8s CPU · ~0.16s GPU | Default, native punctuation |
| Canary-1B v2 | 25 | ~5 GB | ~0.7s GPU | Built-in translation (25 ↔ EN, 48 pairs) |
| faster-whisper | 99 | ~500 MB–3 GB | ~0.3s | Wide language coverage |
| Vosk | 20+ | ~50 MB | ~1.5s | Lightweight, strict offline |
Each backend runs as a systemd user service with the same Unix socket protocol — switching is transparent. → ASR-Backends wiki
dictee uses Parakeet-TDT 0.6B v3 by default. On the Open ASR Leaderboard, it outperforms Whisper-large-v3 on multilingual transcription while being significantly smaller and faster:
| Model | Size | English WER | FLEURS multilingual (avg) | Relative speed |
|---|---|---|---|---|
| Parakeet-TDT 0.6B v3 (dictee default) | 600M | ~6.5 % | 12.0 % | ~10× Whisper-large-v3 |
| Whisper-large-v3 | 1.55B | 7.4 % | 12.6 % | baseline |
| Canary-1B v2 (also bundled) | 1B | 7.2 % | – | ~5× Whisper-large-v3 |
| Whisper-large-v3-turbo | 809M | ~7.8 % | – | ~3-4× |
| Vosk (CPU fallback) | 50 MB | ~12-18 % | – | – |
Parakeet-TDT v3 wins particularly on French, Greek, Estonian and Maltese. For maximum language coverage (99 languages), switch to faster-whisper; for built-in translation, switch to Canary-1B.
Sources: NVIDIA Parakeet-TDT v3 · Open ASR Leaderboard 2025.
| Backend | Privacy | Speed | Quality | Languages |
|---|---|---|---|---|
| Canary-1B | 🔒 Local | Built-in | Excellent | 4 |
| LibreTranslate | 🔒 Local | 0.1–0.3s | Good | 30+ |
| Ollama | 🔒 Local | 2–3s | Excellent | Any (LLM) |
| Google Translate | 🌐 Cloud | 0.2–0.7s | Excellent | 130+ |
| Bing Translator | 🌐 Cloud | 1.7–2.2s | Very good | 100+ |
→ Translation wiki · Ollama-Setup
A 12-step configurable pipeline transforms raw ASR output before it hits your cursor:
- Regex rules + dictionary — 7 languages, ASR variants, voice commands → Rules-and-Dictionary
- LLM correction — optional fluency polish via local Ollama (first / last / hybrid position) → LLM-Correction
- Numbers & dates — cardinal, ordinal, versions, decimals, French times → Numbers-Dates-Continuation
- Continuation buffer — continue a sentence across dictations with last-word memory
- Short-text keepcaps — per-language exceptions for acronyms and names (new in v1.3)
Answer "who spoke when?" in multi-speaker recordings via NVIDIA's Sortformer model. Up to 4 speakers, ideal for meeting notes and interviews. Triggered via Meeting mode or dictee --meeting. → Diarization wiki
Push-to-talk is dictee's main flow, but the bundled dictee-transcribe window also handles offline transcription of any audio or video file you already have. Multi-tab interface, audio player synchronised with the timeline, per-tab translation and LLM analysis, export to PDF / SRT / JSON / Markdown.
- Any input format (mp3, mp4, wav, opus, flac, mkv…) — auto-converted via ffmpeg
- Multi-tab — keep the original transcription side-by-side with translations and LLM analyses (summary, chapters, ASR cleanup…)
- Speaker diarization built-in — toggle on, get up to 4 speakers labelled and renamable
- LLM analysis — 14 providers configurable side by side (Ollama, OpenAI, Claude, Gemini, Mistral, DeepSeek, Groq, Cerebras, OpenRouter…)
- Per-tab translation — Canary / LibreTranslate / Ollama / Google / Bing
- KDE Plasma 6 widget — native QML plasmoid, 5 animation styles, live state → Plasmoid-Widget
- System tray icon — PyQt6, works on GNOME/XFCE/Sway (AppIndicator fallback) → Tray-Icon
- animation-speech (external) — fullscreen overlay on
wlr-layer-shellcompositors
All three share state via a filesystem watcher — any change is reflected instantly across interfaces (multi-user safe with UID suffix).
animation-speech is a standalone project that provides a fullscreen visual animation during recording, with cancellation via the Escape key. It works on any Wayland compositor supporting wlr-layer-shell (KDE Plasma, Sway, Hyprland…).
sudo dpkg -i animation-speech_1.2.0_all.debDownload: animation-speech releases
Note: animation-speech is not compatible with GNOME (no
wlr-layer-shellsupport). GNOME users can rely ondictee-trayfor visual feedback. Contributions for a GNOME Shell extension are welcome — see the plasmoid source for reference architecture.
Auto-detects distro and GPU, adds the NVIDIA CUDA repo if needed, installs the right package:
curl -fsSL https://raw.githubusercontent.com/rcspam/dictee/master/install.sh | bashSupported: Ubuntu, Debian, Fedora, openSUSE, Arch Linux. Other distros fall back to the tarball path.
Options (after --):
# Force CPU (skip GPU detection)
curl -fsSL https://raw.githubusercontent.com/rcspam/dictee/master/install.sh | bash -s -- --cpu
# Force GPU (CUDA)
curl -fsSL https://raw.githubusercontent.com/rcspam/dictee/master/install.sh | bash -s -- --gpu
# Pin a specific version
curl -fsSL https://raw.githubusercontent.com/rcspam/dictee/master/install.sh | bash -s -- --version 1.3.5
# Non-interactive
curl -fsSL https://raw.githubusercontent.com/rcspam/dictee/master/install.sh | bash -s -- --non-interactiveDownload from Releases.
Ubuntu / Debian (CPU):
sudo apt install ./dictee-cpu_1.3.5_amd64.debUbuntu / Debian (GPU): requires the NVIDIA CUDA APT repo — see GPU-Setup for the one-time setup, then:
sudo apt install ./dictee-cuda_1.3.5_amd64.debFedora / openSUSE (CPU):
sudo dnf install ./dictee-cpu-1.3.5-1.x86_64.rpmFedora / openSUSE (GPU): add the CUDA repo first (see GPU-Setup), then dictee-cuda-1.3.5-1.x86_64.rpm.
Arch Linux (AUR): PKGBUILD in the repo root (x86_64 + aarch64). Clone + makepkg -si.
aarch64 / Jetson: no pre-built package — build from source. CUDA limited to NVIDIA Jetson boards.
Other distros (tarball):
tar xzf dictee-1.3.5_amd64.tar.gz
cd dictee-1.3.5
sudo ./install.shThe tarball ships binaries only — system dependencies must be installed beforehand via your distro's package manager. Names vary by distro; pick the equivalents in yours:
python3(≥3.10),python3-pip,python3-venvpython3-evdev,python3-pyqt6(+qtmultimedia+qtsvg),python3-numpypulseaudio-utils,pipewire(oralsa-utils),libnotify(-bin),soxwl-clipboard(Wayland),xclip(X11)translate-shell,curl
Example for Debian-derived distros (Mint, MX, Pop!_OS…):
sudo apt install python3 python3-pip python3-venv python3-evdev \
python3-pyqt6 python3-pyqt6.qtmultimedia python3-pyqt6.qtsvg python3-numpy \
pulseaudio-utils pipewire libnotify-bin sox wl-clipboard xclip \
translate-shell curlFrom source: cargo build --release --features sortformer then sudo ./install.sh. See Developer-Guide for full Cargo features and package build scripts.
First launch triggers a setup wizard (backend, model, shortcuts).
Reconfigure anytime from the application menu, tray icon, Plasma widget, or by running:
dictee --setup# Show current backends
dictee-switch-backend status
# Switch ASR (parakeet · canary · whisper · vosk)
dictee-switch-backend asr canary
# Switch translation (canary · libretranslate · ollama · google · bing)
dictee-switch-backend translate ollamaThe tray and plasmoid include backend sub-menus — no terminal required.
For detailed configuration (all ASR backends, translation matrix, plasmoid settings, keyboard shortcuts on tiling WMs), see the wiki:
- ASR-Backends · Translation
- Plasmoid-Widget · Tray-Icon
- Keyboard-Shortcuts (KDE/GNOME/Sway/i3/Hyprland)
# Simple dictation — transcribe and type
dictee
# Dictate + translate (default: system language → English)
dictee --translate
dictee --translate --ollama # 100% local via Ollama
# Change target language
DICTEE_LANG_TARGET=es dictee --translate # → Spanish
# Meeting mode (diarization, up to 4 speakers)
dictee --meeting
# Cancel ongoing dictation
dictee --cancel
# Test post-processing rules live
dictee-test-rules # interactive
dictee-test-rules --loop # continuous loop
dictee-test-rules --wav file.wav # from audio file→ Full command reference: CLI-Reference wiki
dictee runs a configurable 12-step pipeline after transcription and before paste:
- ASR variants normalization
- Dictionary substitution
- Numbers & dates conversion
- Continuation buffer merge
- Regex rules (pre-LLM)
- LLM correction (optional, first position)
- Regex rules (post-LLM)
- Short-text exceptions (keepcaps)
- Extended match mode
- Final capitalization
- Translation (optional)
- Paste / inject
Configure via dictee --setup → Post-processing tab, or test rules live with dictee-test-rules.
→ Deep dives: Post-Processing-Overview · Rules-and-Dictionary · LLM-Correction · Numbers-Dates-Continuation
- Live-dictation length: file transcription has no duration limit —
dictee-transcribechunks files of any length (a 54-min keynote diarizes in ~122 s on 8 GB). Continuous push-to-talk dictation is capped per backend — Parakeet ~4 min 30, Canary ~2 min 30; Whisper and Vosk are uncapped. Past the cap the recording is saved and opened indictee-transcribe. v1.4 plans to extend the chunked pipeline to live dictation. → Diarization wiki - Speaker count: diarization labels up to 4 speakers (NVIDIA Sortformer); this limit should be lifted in v1.4.
- AMD / Intel GPUs are not currently supported — dictee falls back to CPU.
- No real-time streaming — Parakeet-TDT and Canary require the full utterance; only Nemotron (EN-only, via Rust binary) streams natively.
- Wayland clipboard breaks Electron typing — On Wayland with
wl-clipboard < 2.3.0(still the default on Ubuntu and other LTS distros), the optional "Copy transcription to clipboard" toggle briefly steals keyboard focus and can prevent the transcript from being typed into some Electron apps (e.g. Claude Desktop). The toggle is disabled by default since v1.3.5; enable it only on distros shippingwl-clipboard ≥ 2.3.0or if you never dictate into Electron apps. → Troubleshooting wiki
For bug reports and workarounds, see Troubleshooting.
v1.3.5 (current) — Int8 Parakeet model + fixes + reliability:
- New Int8 Parakeet model — snappier on CPU — better out-of-the-box performance, with the compact Parakeet model now running where it's fastest.
- Clearer GPU/CPU switch — the GPU/CPU toggle in settings now greys out and shows CPU whenever the GPU can't actually be used (no NVIDIA card, CPU-only build, or the int8 model), so it always reflects what really runs.
- Smoother model switching — the int8 / FP32 switch becomes available as soon as both variants are installed, with a notification confirming each change.
- Push-to-talk typing fix (#8) — the last character no longer repeats itself after dictating for a while, on setups with several keyboards or on Wayland.
- Push-to-talk with remapping tools (#10) — keyboard remappers like logiops, keyd and kanata can now trigger dictation, with a new option in the settings.
- Push-to-talk no longer freezes the mouse — on some keyboard/mouse setups, push-to-talk could lock up the pointer until the app was removed; it now leaves pointing devices untouched.
- Safer model downloads — an interrupted download is now detected instead of leaving a broken model that failed silently the next time you started dictee.
- More reliable Whisper — better automatic CPU/GPU selection and fewer made-up words in the transcription.
- Lighter desktop widget — lower CPU usage when idle.
- Plus smaller fixes — settings carried over more reliably, wider Fedora compatibility, and steadier speaker diarization.
v1.3.4 — Universal chunked transcription + dictee-transcribe UX hardening:
- Universal chunked transcription in
dictee-transcribe— files of any length now split automatically into 180 s chunks on every host (CPU and GPU). New per-backend cap on live-dictation recording duration (Canary 2:30, Parakeet 4:30; Whisper / Vosk uncapped) to prevent silent crashes. - Five-site target-tab UI hardening in
dictee-transcribe— text editor, rename panel, timeline markers, audio player swap, and transcription render now only update the global UI when the target tab is visible. No more cross-tab corruption when transcribing one file while reviewing another. - Translate skip surfacing — silent translate-skip cases now show a colored status message (i18n in 6 languages: fr / de / es / it / pt / uk).
- Diarize falls back to standalone Parakeet + Sortformer when the PTT daemon is Canary — avoids silent mistranscription on files whose language ≠
DICTEE_LANG_SOURCE(Canary daemon is locked at startup). Standalone binary costs ~5–10 s extra model load. - Socket read timeout bumped 30 → 120 s for large files.
- GPU-fallback warning suppressed when stderr is piped.
- Default cheatsheet shortcut now "Same key + Shift" (was Disabled).
- PTT / dictee stale-state cleanup on next F9 keypress (recovers from a daemon killed mid-flight, OOM, signal).
v1.3.0 → v1.3.3 — The v1.3 series. Major additions over v1.2:
dictee-transcribe— dedicated window for offline transcription of audio/video files (timeline player, multi-tab, per-tab translation and LLM analysis, export to PDF / SRT / JSON / Markdown).- Speaker diarization up to 4 speakers via NVIDIA Sortformer, plus a chunked pipeline that lifts the VRAM cap on long files (54-min keynote diarized in 122 s).
- LLM analysis on diarized transcripts — synthesis, chapters, ASR cleanup; 14 providers configurable side by side (Ollama, OpenAI, Claude, Gemini, Mistral, DeepSeek, Groq, Cerebras, OpenRouter…).
- Canary-1B v2 ASR backend (NVIDIA AED) with built-in translation on 12 native pairs.
- Portable CUDA libs via pip venv at postinst — no NVIDIA repo required.
- CUDA → CPU runtime fallback +
DICTEE_FORCE_CPU=1override (v1.3.2). - Cross-distro packaging consistency — Arch
.installgroup hooks,python-evdevhard depends,sg docker/sg inputwrappers, udev0660direct (v1.3.3, closes #5 + #6). - Short-text keepcaps exceptions (7 languages), extended match mode, version-number dictation, multi-user safe state, plasmoid cross-process toggles, 682 postprocess + 148 pipeline tests (v1.3.0).
v1.4+ (planned)
- Hotword boosting — bias ASR decoding toward custom names (shallow fusion on TDT logits, Parakeet only)
- Whisper translate — multi-target translation via
task="translate"(EN-only, offline) - Moonshine CPU backend
- CLI speech-to-text — pipe audio, get text
- VAD — hands-free dictation without push-to-talk
- Streaming transcription with live text display
- Built-in overlay — replace external
animation-speech - AppImage / Flatpak packaging
- COSMIC / GNOME Shell applets (contributions welcome!)
→ Full history: Changelog wiki
The transcription engine builds on parakeet-rs by Enes Altun — Rust library for NVIDIA Parakeet inference via ONNX Runtime. The Rust Canary implementation was originally ported from onnx-asr by Ivan Stupakov and is now fully self-contained. Parakeet and Canary ONNX models are provided by NVIDIA (downloaded separately from HuggingFace, not redistributed by this project).
Keyboard input simulation uses dotool by geb (GPL-3.0).
This project is distributed under the GPL-3.0-or-later license (see LICENSE).
The original parakeet-rs code by Enes Altun is under the MIT license (see LICENSE-MIT). dotool is bundled under GPL-3.0.









