Skip to content

feat: add VideoModalProcessor with visual + audio dual-channel analysis#281

Open
liuruing wants to merge 2 commits into
HKUDS:mainfrom
liuruing:feature/video-modal-processor
Open

feat: add VideoModalProcessor with visual + audio dual-channel analysis#281
liuruing wants to merge 2 commits into
HKUDS:mainfrom
liuruing:feature/video-modal-processor

Conversation

@liuruing

Copy link
Copy Markdown

Summary

Add support for video file processing (MP4, MOV, WebM, AVI, MKV, FLV, WMV, M4V) using a visual + audio dual-channel approach.

This builds on #280 (AudioModalProcessor) and enables RAG-Anything to index and retrieve content from:

  • 🎥 Meeting recordings (screen share + voice)
  • 🎓 Lectures and tutorials (slides + narration)
  • 📱 Product demos (UI operations + voiceover)
  • 📹 Surveillance / inspection videos (visual scenes)
  • 🎬 Any video content

Architecture

Video File
    │
    ├── Visual Channel                    Audio Channel
    │   SceneDetect (scene boundaries)    moviepy (extract audio track)
    │   OpenCV (keyframe at mid-scene)    faster-whisper (transcribe)
    │   VLM (describe frame)
    │                                     
    └── Merge by timestamp alignment
        ↓
    [0:00-0:30] 画面:PPT showing chart | 语音:Revenue grew 23%...
    [0:30-1:05] 画面:Demo of product   | 语音:Click settings then...

Key Design Decisions

  • Scene-aware splitting (SceneDetect) instead of fixed intervals — produces natural segments aligned with content changes
  • Dual-channel merge — visual + audio aligned by time, giving complete context
  • Graceful degradation — works with video-only (surveillance) or audio-only content
  • Lazy imports — scenedetect/moviepy/cv2 only imported when first video is processed
  • Reuses AudioModalProcessor — shares whisper model instance for transcription
  • Optional dependencypip install raganything[video]

Changes

  • raganything/modalprocessors_video.py — New VideoModalProcessor class
  • raganything/__init__.py — Optional export of VideoModalProcessor
  • raganything/config.py — Extended SUPPORTED_FILE_EXTENSIONS with video formats
  • raganything/processor.py — Added "video" content type in _apply_chunk_template
  • pyproject.toml — Added [video] optional dependency group
  • env.example — Added VIDEO_SCENE_THRESHOLD / VIDEO_MIN_SCENE_DURATION / VIDEO_MAX_SCENES
  • tests/test_video_processor.py — Unit tests

Dependencies

scenedetect[opencv]>=0.6.0  # Scene boundary detection
moviepy>=2.0.0              # Audio track extraction
faster-whisper>=1.0.0       # Speech transcription (shared with audio PR)
opencv-python>=4.8.0        # Keyframe extraction

Usage

from raganything import VideoModalProcessor

processor = VideoModalProcessor(
    lightrag=rag_instance,
    modal_caption_func=caption_func,
    whisper_model="large-v3",
    min_scene_duration=5.0,
    max_scenes=50,
)

result = await processor.process_multimodal_content(
    modal_content={"video_path": "/path/to/meeting.mp4"},
    content_type="video",
)

Applicable Scenarios

Scenario Visual Audio Output
Meeting recording Screen share Discussion `画面:Q3 chart
Phone call Full transcript [0:00-0:45] 客户反馈延迟问题...
Tutorial Slides/code Narration `画面:Architecture diagram
Product demo UI steps Voiceover `画面:Click settings
Surveillance Visual scenes (none) 画面:10号杆塔绝缘子裂纹

Test plan

  • Unit tests for file detection, timestamp formatting, transcript filtering, channel merging
  • Mock-based integration tests for generate_description_only
  • End-to-end test with real video (requires full dependency install)

Related

🤖 Generated with Claude Code

array added 2 commits May 22, 2026 16:53
Add support for audio file processing (MP3, WAV, FLAC, M4A, OGG, etc.)
using faster-whisper for local ASR transcription.

Key features:
- Timestamped transcription output for precise retrieval
- VAD filtering to skip silence
- Lazy model loading (only loads whisper when first audio is processed)
- Configurable via WHISPER_MODEL and WHISPER_LANGUAGE env vars
- Added as optional dependency: pip install raganything[audio]

Use cases: meeting recordings, phone calls, podcasts, lectures.
Add support for video file processing (MP4, MOV, WebM, AVI, MKV, etc.)
using a dual-channel approach:
- Visual: SceneDetect for scene boundaries + OpenCV keyframe + VLM description
- Audio: moviepy audio extraction + faster-whisper transcription
- Merged output aligned by timestamps

Key features:
- Scene-aware splitting (not fixed intervals)
- Dual-channel merged output: [时间] 画面:... | 语音:...
- Handles video-only (surveillance) and audio-only gracefully
- Lazy dependency loading
- Added as optional dependency: pip install raganything[video]

Use cases: meeting recordings, lectures, product demos, surveillance.
@LarFii

LarFii commented Jun 1, 2026

Copy link
Copy Markdown
Collaborator

Thanks for extending the audio work into video processing. The dual visual/audio-channel direction is useful and worth continuing.

I retested this PR against the current main. The branch merges cleanly and git diff --check passes, but the focused video/audio tests are still blocked during test collection:

PYTHONPATH=. python -m pytest -q tests/test_audio_processor.py tests/test_video_processor.py
# collection error in tests/test_video_processor.py

The immediate blocker is that raganything/modalprocessors_video.py imports cv2 at module import time. In an environment with NumPy 2.x and an incompatible OpenCV wheel, simply importing the module fails before tests can run:

ImportError: numpy.core.multiarray failed to import
AttributeError: _ARRAY_API not found

Because video support is optional, heavyweight video dependencies such as OpenCV, SceneDetect, and MoviePy should be imported lazily inside the methods that actually need them. Basic imports and helpers such as is_video_file() should work even when optional video dependencies are not installed or are incompatible.

This PR also includes the audio processor from #280, so it inherits the current audio test failures there as well. A good next step would be to first get #280's audio tests passing, then update this video module so optional dependencies are lazy and the video tests can be collected reliably.

Thanks again for the substantial work here.

@LarFii

LarFii commented Jun 1, 2026

Copy link
Copy Markdown
Collaborator

Thanks @liuruing — this is a great, clean implementation and it's the basis for #292. 🙌

#292 keeps your processing engine and unit tests, and adds the pipeline integration this PR was missing so the processors are actually invoked end-to-end:

  • register audio/video in _initialize_processors behind guarded optional imports
  • route audio/video in get_processor_for_type (was falling back to generic)
  • enable_audio_processing / enable_video_processing config flags
  • made cv2 a lazy import so import raganything doesn't break when opencv is absent

It's verified end-to-end (real transcription/scene-analysis → insert → retrieve). You're credited via co-authorship in #292. Leaving this open for now — it can be closed once #292 lands.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants