feat: add VideoModalProcessor with visual + audio dual-channel analysis by liuruing · Pull Request #281 · HKUDS/RAG-Anything

liuruing · 2026-05-22T13:26:15Z

Summary

Add support for video file processing (MP4, MOV, WebM, AVI, MKV, FLV, WMV, M4V) using a visual + audio dual-channel approach.

This builds on #280 (AudioModalProcessor) and enables RAG-Anything to index and retrieve content from:

🎥 Meeting recordings (screen share + voice)
🎓 Lectures and tutorials (slides + narration)
📱 Product demos (UI operations + voiceover)
📹 Surveillance / inspection videos (visual scenes)
🎬 Any video content

Architecture

Video File
    │
    ├── Visual Channel                    Audio Channel
    │   SceneDetect (scene boundaries)    moviepy (extract audio track)
    │   OpenCV (keyframe at mid-scene)    faster-whisper (transcribe)
    │   VLM (describe frame)
    │                                     
    └── Merge by timestamp alignment
        ↓
    [0:00-0:30] 画面：PPT showing chart | 语音：Revenue grew 23%...
    [0:30-1:05] 画面：Demo of product   | 语音：Click settings then...

Key Design Decisions

Scene-aware splitting (SceneDetect) instead of fixed intervals — produces natural segments aligned with content changes
Dual-channel merge — visual + audio aligned by time, giving complete context
Graceful degradation — works with video-only (surveillance) or audio-only content
Lazy imports — scenedetect/moviepy/cv2 only imported when first video is processed
Reuses AudioModalProcessor — shares whisper model instance for transcription
Optional dependency — pip install raganything[video]

Changes

raganything/modalprocessors_video.py — New VideoModalProcessor class
raganything/__init__.py — Optional export of VideoModalProcessor
raganything/config.py — Extended SUPPORTED_FILE_EXTENSIONS with video formats
raganything/processor.py — Added "video" content type in _apply_chunk_template
pyproject.toml — Added [video] optional dependency group
env.example — Added VIDEO_SCENE_THRESHOLD / VIDEO_MIN_SCENE_DURATION / VIDEO_MAX_SCENES
tests/test_video_processor.py — Unit tests

Dependencies

scenedetect[opencv]>=0.6.0  # Scene boundary detection
moviepy>=2.0.0              # Audio track extraction
faster-whisper>=1.0.0       # Speech transcription (shared with audio PR)
opencv-python>=4.8.0        # Keyframe extraction

Usage

from raganything import VideoModalProcessor

processor = VideoModalProcessor(
    lightrag=rag_instance,
    modal_caption_func=caption_func,
    whisper_model="large-v3",
    min_scene_duration=5.0,
    max_scenes=50,
)

result = await processor.process_multimodal_content(
    modal_content={"video_path": "/path/to/meeting.mp4"},
    content_type="video",
)

Applicable Scenarios

Scenario	Visual	Audio	Output
Meeting recording	Screen share	Discussion	`画面：Q3 chart
Phone call	—	Full transcript	`[0:00-0:45] 客户反馈延迟问题...`
Tutorial	Slides/code	Narration	`画面：Architecture diagram
Product demo	UI steps	Voiceover	`画面：Click settings
Surveillance	Visual scenes	(none)	`画面：10号杆塔绝缘子裂纹`

Test plan

Unit tests for file detection, timestamp formatting, transcript filtering, channel merging
Mock-based integration tests for generate_description_only
End-to-end test with real video (requires full dependency install)

Add support for audio file processing (MP3, WAV, FLAC, M4A, OGG, etc.) using faster-whisper for local ASR transcription. Key features: - Timestamped transcription output for precise retrieval - VAD filtering to skip silence - Lazy model loading (only loads whisper when first audio is processed) - Configurable via WHISPER_MODEL and WHISPER_LANGUAGE env vars - Added as optional dependency: pip install raganything[audio] Use cases: meeting recordings, phone calls, podcasts, lectures.

Add support for video file processing (MP4, MOV, WebM, AVI, MKV, etc.) using a dual-channel approach: - Visual: SceneDetect for scene boundaries + OpenCV keyframe + VLM description - Audio: moviepy audio extraction + faster-whisper transcription - Merged output aligned by timestamps Key features: - Scene-aware splitting (not fixed intervals) - Dual-channel merged output: [时间] 画面：... | 语音：... - Handles video-only (surveillance) and audio-only gracefully - Lazy dependency loading - Added as optional dependency: pip install raganything[video] Use cases: meeting recordings, lectures, product demos, surveillance.

LarFii · 2026-06-01T06:42:57Z

Thanks for extending the audio work into video processing. The dual visual/audio-channel direction is useful and worth continuing.

I retested this PR against the current main. The branch merges cleanly and git diff --check passes, but the focused video/audio tests are still blocked during test collection:

PYTHONPATH=. python -m pytest -q tests/test_audio_processor.py tests/test_video_processor.py
# collection error in tests/test_video_processor.py

The immediate blocker is that raganything/modalprocessors_video.py imports cv2 at module import time. In an environment with NumPy 2.x and an incompatible OpenCV wheel, simply importing the module fails before tests can run:

ImportError: numpy.core.multiarray failed to import
AttributeError: _ARRAY_API not found

Because video support is optional, heavyweight video dependencies such as OpenCV, SceneDetect, and MoviePy should be imported lazily inside the methods that actually need them. Basic imports and helpers such as is_video_file() should work even when optional video dependencies are not installed or are incompatible.

This PR also includes the audio processor from #280, so it inherits the current audio test failures there as well. A good next step would be to first get #280's audio tests passing, then update this video module so optional dependencies are lazy and the video tests can be collected reliably.

Thanks again for the substantial work here.

LarFii · 2026-06-01T08:46:15Z

Thanks @liuruing — this is a great, clean implementation and it's the basis for #292. 🙌

#292 keeps your processing engine and unit tests, and adds the pipeline integration this PR was missing so the processors are actually invoked end-to-end:

register audio/video in _initialize_processors behind guarded optional imports
route audio/video in get_processor_for_type (was falling back to generic)
enable_audio_processing / enable_video_processing config flags
made cv2 a lazy import so import raganything doesn't break when opencv is absent

It's verified end-to-end (real transcription/scene-analysis → insert → retrieve). You're credited via co-authorship in #292. Leaving this open for now — it can be closed once #292 lands.

array added 2 commits May 22, 2026 16:53

This was referenced Jun 1, 2026

feat: audio & video modal processors (integrated, end-to-end tested) #292

Open

feat: add AudioModalProcessor for speech-to-text transcription #280

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add VideoModalProcessor with visual + audio dual-channel analysis#281

feat: add VideoModalProcessor with visual + audio dual-channel analysis#281
liuruing wants to merge 2 commits into
HKUDS:mainfrom
liuruing:feature/video-modal-processor

liuruing commented May 22, 2026

Uh oh!

LarFii commented Jun 1, 2026

Uh oh!

LarFii commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

liuruing commented May 22, 2026

Summary

Architecture

Key Design Decisions

Changes

Dependencies

Usage

Applicable Scenarios

Test plan

Related

Uh oh!

LarFii commented Jun 1, 2026

Uh oh!

LarFii commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants