A clean, terminal-based video and audio transcription tool powered by OpenAI Whisper — fully open-source, runs locally, no API keys required.
- 🎬 Transcribe video or audio files directly from the terminal
- ⚡ Choose from 5 Whisper model sizes — from blazing-fast to highly accurate
- 📄 Export to plain text, SRT, WebVTT, or JSON (with timestamps)
- 🌍 Automatic language detection — no configuration needed
- 🔒 Runs entirely offline — your files never leave your machine
- 💅 Clean, interactive UI powered by Rich
- Python 3.8+
- ffmpeg (system-level)
git clone https://github.com/WiseArts/transcribe.git
cd transcribe# macOS
brew install ffmpeg
# Ubuntu / Debian
sudo apt install ffmpeg
# Windows (via Chocolatey)
choco install ffmpegpip install openai-whisper richNote: The first time you run the tool, Whisper will automatically download the selected model weights and cache them locally. This is a one-time download per model.
python transcribe.pyThe tool walks you through three steps:
- File — enter the path to your video or audio file (drag & drop into the terminal works on most systems)
- Model — pick a size based on how fast vs. accurate you need it
- Output format — choose how you want the transcript saved
The output file is saved alongside your source file (e.g. interview.mp4 → interview.srt).
| # | Model | Speed | Quality | VRAM | Best for |
|---|---|---|---|---|---|
| 1 | tiny | ██████████ | ███░░░░░░░ | ~1 GB | Quick drafts, fast machines |
| 2 | base | ████████░░ | █████░░░░░ | ~1 GB | Everyday use (default) |
| 3 | small | ██████░░░░ | ███████░░░ | ~2 GB | Better accuracy, still fast |
| 4 | medium | ████░░░░░░ | █████████░ | ~5 GB | High quality, multilingual |
| 5 | large | ██░░░░░░░░ | ██████████ | ~10 GB | Best possible accuracy |
Video: .mp4 .mov .avi .mkv .webm .flv
Audio: .mp3 .wav .m4a .aac .ogg .flac
| Format | Extension | Description |
|---|---|---|
| Plain text | .txt |
Clean transcript, one line per segment |
| SRT | .srt |
Subtitles with timestamps (video players, Premiere, etc.) |
| WebVTT | .vtt |
Web subtitles for HTML5 <video> tags |
| JSON | .json |
Full Whisper output with segment-level confidence data |
- CPU vs GPU: The script uses CPU by default (
fp16=False) so it works on any machine. If you have an NVIDIA GPU with CUDA, remove thefp16=Falseflag in thetranscribe()call for a significant speedup. - Speed: As a rough guide on CPU,
basetranscribes roughly 4–8× real-time speed. A 10-minute video takes around 2–3 minutes. - Accuracy: Whisper performs best on clear speech with minimal background noise. The
mediumandlargemodels handle accents and technical vocabulary noticeably better.
| Package | Purpose |
|---|---|
| openai-whisper | Speech-to-text transcription |
| rich | Terminal UI |
| ffmpeg | Audio extraction from video files |
MIT — do whatever you like with it.