feat: add VideoModalProcessor with visual + audio dual-channel analysis#281
feat: add VideoModalProcessor with visual + audio dual-channel analysis#281liuruing wants to merge 2 commits into
Conversation
Add support for audio file processing (MP3, WAV, FLAC, M4A, OGG, etc.) using faster-whisper for local ASR transcription. Key features: - Timestamped transcription output for precise retrieval - VAD filtering to skip silence - Lazy model loading (only loads whisper when first audio is processed) - Configurable via WHISPER_MODEL and WHISPER_LANGUAGE env vars - Added as optional dependency: pip install raganything[audio] Use cases: meeting recordings, phone calls, podcasts, lectures.
Add support for video file processing (MP4, MOV, WebM, AVI, MKV, etc.) using a dual-channel approach: - Visual: SceneDetect for scene boundaries + OpenCV keyframe + VLM description - Audio: moviepy audio extraction + faster-whisper transcription - Merged output aligned by timestamps Key features: - Scene-aware splitting (not fixed intervals) - Dual-channel merged output: [时间] 画面:... | 语音:... - Handles video-only (surveillance) and audio-only gracefully - Lazy dependency loading - Added as optional dependency: pip install raganything[video] Use cases: meeting recordings, lectures, product demos, surveillance.
|
Thanks for extending the audio work into video processing. The dual visual/audio-channel direction is useful and worth continuing. I retested this PR against the current The immediate blocker is that Because video support is optional, heavyweight video dependencies such as OpenCV, SceneDetect, and MoviePy should be imported lazily inside the methods that actually need them. Basic imports and helpers such as This PR also includes the audio processor from #280, so it inherits the current audio test failures there as well. A good next step would be to first get #280's audio tests passing, then update this video module so optional dependencies are lazy and the video tests can be collected reliably. Thanks again for the substantial work here. |
|
Thanks @liuruing — this is a great, clean implementation and it's the basis for #292. 🙌 #292 keeps your processing engine and unit tests, and adds the pipeline integration this PR was missing so the processors are actually invoked end-to-end:
It's verified end-to-end (real transcription/scene-analysis → insert → retrieve). You're credited via co-authorship in #292. Leaving this open for now — it can be closed once #292 lands. |
Summary
Add support for video file processing (MP4, MOV, WebM, AVI, MKV, FLV, WMV, M4V) using a visual + audio dual-channel approach.
This builds on #280 (AudioModalProcessor) and enables RAG-Anything to index and retrieve content from:
Architecture
Key Design Decisions
pip install raganything[video]Changes
raganything/modalprocessors_video.py— New VideoModalProcessor classraganything/__init__.py— Optional export of VideoModalProcessorraganything/config.py— Extended SUPPORTED_FILE_EXTENSIONS with video formatsraganything/processor.py— Added "video" content type in_apply_chunk_templatepyproject.toml— Added[video]optional dependency groupenv.example— Added VIDEO_SCENE_THRESHOLD / VIDEO_MIN_SCENE_DURATION / VIDEO_MAX_SCENEStests/test_video_processor.py— Unit testsDependencies
Usage
Applicable Scenarios
[0:00-0:45] 客户反馈延迟问题...画面:10号杆塔绝缘子裂纹Test plan
Related
🤖 Generated with Claude Code