Local video scene intelligence for Apple Silicon. Processes screen recordings through a LangGraph-orchestrated pipeline: smart keyframe extraction, Qwen3.5-VL captioning via oMLX, CLIP embedding, vector search, and LLM-powered summarization — all running on your machine with no cloud dependencies. Inspired by NVIDIA VSS, rebuilt from scratch for Apple Silicon.
graph TD
A[Video Input<br/>.mov / .mp4] --> B[Hybrid Keyframe Detection<br/>SSIM + pHash + HSV histogram]
B --> C[Vision Captioning<br/>Qwen3.5-VL via oMLX]
C --> D[Caption Archive<br/>frames + timestamps + descriptions]
D --> E[Content Classification<br/>code / docs / PDF / GUI demo]
E --> F[Reconstruction Plan<br/>files, sections, tasks]
F --> G[Hierarchical Reconstruction<br/>oMLX agents]
G --> H[QA Reflection<br/>retry when incomplete]
H --> I[Reconstructed Artifacts<br/>data/*/output]
style A fill:#4a90d9,color:#fff
style B fill:#8e44ad,color:#fff
style C fill:#e74c3c,color:#fff
style E fill:#2ecc71,color:#fff
style G fill:#e67e22,color:#fff
style I fill:#27ae60,color:#fff
stateDiagram-v2
[*] --> Ingest: video_path
Ingest --> Caption: keyframes extracted
Caption --> Embed: captions generated
Embed --> Classify: ingested folder
Classify --> Plan: content type
Plan --> Reconstruct: reconstruction tasks
Reconstruct --> QA: draft artifacts
QA --> Save: accepted or retried
Save --> [*]: output files
| Component | Technology | Purpose |
|---|---|---|
| Frame Extraction | Hybrid keyframe detection (SSIM + pHash + HSV) | Only captures distinct screens — skips duplicates |
| Vision Captioning | Qwen3.5-VL via oMLX (Apple Silicon native) | Dense, high-fidelity frame descriptions through oMLX's OpenAI-compatible VLM API |
| Fallback Captioning | Ollama (llama3.2-vision) | Cross-platform alternative |
| Visual Embeddings | OpenCLIP ViT-B-32 | Semantic vector representations |
| Vector Storage | ChromaDB | Optional inspection index for ingested frames |
| Reconstruction | LangGraph + oMLX | Classify recordings and rebuild source/docs/demo references |
| Orchestration | LangGraph StateGraph | Pipeline state management |
| CLI | Typer + Rich | User interface |
- Hardware: Apple Silicon Mac (M1+). Optimized for M3 Ultra with 512GB unified memory.
- Python 3.11+
- ffmpeg:
brew install ffmpeg - oMLX: Start the local OpenAI-compatible server on
http://127.0.0.1:8000/v1. - oMLX API key: Set
MLX_API_KEYorOMLX_API_KEYif authentication is enabled. ScreenLens loads.envautomatically, and shell exports take precedence. - Ollama (for summarization): Install from ollama.com and pull:
ollama pull llama3.2 # Text model for summarization
ollama pull llama3.2-vision # Only needed if using --backend ollama for captioningoMLX can reuse existing MLX-format model directories and exposes models through /v1/models. MLX_MODEL, OMLX_MODEL, or --omlx-model can select the served model; otherwise ScreenLens uses default. Captioning requires a vision-capable model; text-only models such as DeepSeek V3/V4/R1 and GPT-OSS cannot process frames.
cd screenlens
pip install -e .The terminal GUI is optional:
pip install -e ".[tui]"cp .env.example .env
# Edit .env with your oMLX key and model, or export these variables in your shell.
python -m src.cli ingest "Screen Recording 2026-04-04 at 8.33.55 AM.mov"This uses smart keyframe detection (only captures when the screen actually changes) and the configured oMLX VLM for high-fidelity captions. Dashboard URLs such as http://127.0.0.1:8000/admin/dashboard are normalized to http://127.0.0.1:8000/v1.
python -m src.cli ingest "video.mov" --backend ollama --strategy fixed_fps --fps 1.0python -m src.cli ingest "video.mov" --omlx-model mlx-community/Qwen3.5-35B-A3B-4bitCaptioning submits up to 4 concurrent oMLX requests per chunk by default. To override:
python -m src.cli ingest "video.mov" --batch-size 8python -m src.cli batch "/path/to/recordings/"Each video gets its own data directory under ./data/<video_name>/ with separate frames, captions, embeddings, and ChromaDB collections.
python -m src.cli reconstructScans all folders in ./data/, classifies each recording (Python code, Markdown doc, PDF, or GUI demo), and uses LangGraph deep agents to reconstruct the original artifacts. Features:
- Classification — Auto-detects content type from captions
- Parallel sub-agents — Fan-out via LangGraph
Sendwhen tasks are independent - Reflection QA — Up to 3 iterations of quality review before saving
- Output — Reconstructed files saved to
./data/<video_name>/output/
# List served oMLX models, labeled vision / text-only / draft
python -m src.cli models
# Transcribe a recording character-for-character
python -m src.cli transcribe input/policies.mov
# Code recordings: add the Apple Vision deterministic cross-check
python -m src.cli transcribe input/code.mov --deterministic # requires: pip install ocrmac
# Opt in to the LLM seam/indent cleanup pass (off by default)
python -m src.cli transcribe input/doc.mov --cleanupA separate pipeline from captioning. Instead of describing frames it copies them: it densely samples frames, OCRs each with a vision model (transcribe, never paraphrase), and stitches them in text space to undo scroll overlap. Designed for faithfully recovering source code, docs, and dense text from a scrolling screen recording.
- Two models, two jobs — a vision model reads pixels (
OCR_MODEL, defaultQwen3.6-27B-bf16); a text model optionally tidies seams (LLM_MODEL). The OCR model is probed with one real frame before processing, so a text-only choice fails instantly instead of producing empty output. - Thinking disabled for OCR — a reasoning model would otherwise burn its whole token budget on chain-of-thought and never emit the transcription.
- Cleanup is off by default — the raw stitched OCR is already verbatim. When enabled with
--cleanup, a per-chunk coverage guard discards any LLM output that drops content and keeps the raw chunk, sotranscript.mdcan never lose text vs.transcript.raw.md. - Output —
data/<slug>/output/transcript.md(+transcript.raw.md,ocr/all_ocr.json).
python -m src.cli infopython -m src.cli tuiThe Textual/Rich GUI provides inputs for video paths, data/output directories, and oMLX model selection. Use Ingest + Reconstruct for the main workflow: ingest the video, classify what it shows, and reconstruct the matching output from the new ingested folder.
The hybrid change detector uses three complementary signals to decide when the screen has actually changed:
| Signal | What it detects | Threshold |
|---|---|---|
| SSIM (Structural Similarity) | Pixel-level structural changes | < 0.97 |
| pHash (Perceptual Hash) | Perceptual content changes via DCT | hamming >= 8 |
| HSV Histogram | Color distribution shifts | correlation <= 0.90 |
A keyframe is emitted when any signal triggers AND enough time has passed (min 0.5s). A forced keyframe is always emitted every 4s (configurable) to catch slow scrolls.
For a typical screen recording, this captures 5-15% of frames vs. fixed FPS, dramatically reducing captioning time while missing nothing.
All settings live in src/config.py (Pydantic models). Key parameters:
| Parameter | Default | Description |
|---|---|---|
frame_extraction.strategy |
keyframe | keyframe (smart) or fixed_fps |
frame_extraction.max_interval_seconds |
4.0 | Max gap between keyframes |
captioning.backend |
omlx | omlx or ollama |
captioning.omlx_base_url |
http://127.0.0.1:8000/v1 | oMLX OpenAI-compatible API URL; dashboard/root URLs are normalized |
captioning.omlx_model |
null | oMLX model ID; falls back to MLX_MODEL/OMLX_MODEL/LLM_MODEL env vars or default |
captioning.batch_size |
4 | Concurrent oMLX caption requests per chunk |
captioning.max_tokens |
1024 | Max tokens per caption |
embedding.model_name |
ViT-B-32 | CLIP model |
embedding.device |
mps | Apple Silicon GPU |
The default oMLX backend sends each caption as an OpenAI-compatible vision request and lets oMLX handle scheduling, continuous batching, and KV caching server-side. captioning.batch_size controls how many frame requests ScreenLens submits concurrently per chunk.
On Apple Silicon with large vision inputs, prefill (vision encoder + prompt) dominates per-frame time, not decode. The main levers for wall-clock improvement are a smaller VLM, a smaller frame_extraction.max_dimension, and an oMLX concurrency value that matches the host.
src/
config.py # Pydantic configuration (extraction, captioning, embedding, search)
frame_extractor.py # Hybrid keyframe detection + fixed FPS fallback
captioner.py # Backends: oMLX (default) and Ollama
embedder.py # CLIP embedding via OpenCLIP
vector_store.py # ChromaDB storage + search
pipeline.py # LangGraph StateGraph orchestration (ingest/search/summarize)
reconstruct.py # LangGraph deep agents — artifact reconstruction with QA reflection
frame_select.py # Scroll-safe frame selection for transcribe (dense sample + drop near-dupes)
ocr.py # VerbatimOCR — vision OCR with capability guard, probe, anti-loop controls
stitch.py # Text-space stitcher — undo scroll overlap, strip headers/footers
transcribe.py # Verbatim transcription orchestrator + optional coverage-guarded cleanup
omlx_client.py # oMLX OpenAI-compatible client (shared by all pipelines)
cli.py # Typer CLI interface
tui.py # Optional Textual/Rich terminal GUI
data/
frames/ # Extracted keyframe images
captions/ # JSON caption files
chromadb/ # Persistent vector database
tests/
test_pipeline.py # Integration tests
test_cases.yaml # Use-case definitions + computer-use agent script
| Feature | NVIDIA VSS | ScreenLens |
|---|---|---|
| Frame extraction | Custom + TensorRT | Hybrid keyframe detection (SSIM/pHash/HSV) |
| Vision model | NVIDIA VILA | Qwen3.5-VL via oMLX |
| Embeddings | TensorRT Visual Encoder | OpenCLIP ViT-B-32 |
| Vector DB | Milvus | ChromaDB |
| LLM | Llama 3.1 70B (NIM) | Ollama (configurable) |
| Hardware | NVIDIA GPU (DGX) | Apple Silicon (M-series) |
| Deployment | Docker + NIM | pip install |
| Cloud dependency | None (self-hosted) | None (fully local) |
- Harden near-duplicate keyframe filtering (perceptual hash + SSIM fusion threshold tuning)
- Cross-video deduplication for multi-file ingestion
- Consider leveraging Karpathy's autoresearch — its autonomous agent architecture is a natural fit for iterating on dedup thresholds and evaluating detection quality at scale
Pre-configured extraction & captioning strategies tailored to content type:
| Profile | Description | Audio | Typical Source |
|---|---|---|---|
code |
Silent screen recording of browsing / editing code | No | IDE walkthroughs, code reviews |
demo |
Screencast with voice-over demonstrating software | Yes | Product demos, tutorials, onboarding videos |
pdf |
Continuous scroll/browse of a PDF document | No | Recorded PDF read-throughs, slide decks |
meeting |
Video call or presentation recording | Yes | Zoom/Teams recordings, webinars |
Each profile auto-tunes: frame extraction strategy, captioning prompt, chunking window, and whether the audio pipeline is activated.
- Integrate Whisper speech-to-text via ONNX Runtime and/or MLX
- Support model sizes:
small,medium,large - Word-level timestamps aligned to keyframe timeline
- Fused caption+transcript context for richer semantic search
- Profile-aware activation — auto-enabled for
demoandmeeting, skipped forcodeandpdf
Agentic pipelines that consume ingestion results and produce structured deliverables:
- Manual Generator (
demoprofile) — Watch a software demo and auto-generate a step-by-step user manual with extracted screenshots, annotated UI elements, and navigation flow - PDF Summary (
pdfprofile) — Ingest a screen-recorded PDF browse and produce a structured summary document preserving headings, key points, and referenced figures - Source Code Reconstruction (
codeprofile) — Scan a code walkthrough video and reconstruct/export the visible source files, function signatures, and project structure - Meeting Notes (
meetingprofile) — Transcribe + summarize a recorded meeting with action items, decisions, and speaker attribution
Each generator is implemented as a LangGraph sub-graph with its own state machine, allowing composition, retry, and human-in-the-loop review before final export.
