CCTV Video Filtering Module

Note: This project was completed during an internship at PIASpace. Source code is not publicly available due to confidentiality restrictions.

A VLM-powered video classification module that automatically identifies CCTV and surveillance footage using frame-level analysis with InternVL and Qwen-VL models.

Models: InternVL3-1B, InternVL3-2B, Qwen3-VL-2B
Input: Single videos or batch JSONL datasets
Output: JSONL predictions with class labels, confidence scores, and reasoning
Target Anomaly Classes: Fire, Smoke, Violence, Falldown

The first half of this README covers the Filtering Module in detail. Scroll past the divider for the full pipeline (Downloader → Filtering → Refinement).

Demo Samples

Module Overview

Input Example

Input video sample: YouTube demo

Prompt Example:

Prompt design had a strong impact on accuracy — optimized prompts incorporating CCTV-specific terminology (surveillance camera, security footage, fixed camera, incident context) significantly improved classification performance.

Output Example

Lightweight Baseline

A grayscale-difference heuristic was also tested as a fast, non-VLM baseline: frames with high pixel-level change over time are more likely to originate from static surveillance cameras. While useful for pre-filtering, the VLM approach ultimately delivered higher accuracy.

Performance

Dataset	Model	Videos	Accuracy
Small labeled test set	Qwen3-VL-2B	18	94.44%
Large-scale internal dataset	InternVL3-2B	425	83.53%

Qwen3-VL-2B achieved the highest single accuracy on the curated test set with optimized prompts.
InternVL3-2B performed best on the larger, real-world internal dataset.
Prompt engineering was critical — tailored prompts incorporating surveillance-specific vocabulary consistently outperformed generic ones.

Prerequisites

A GPU with sufficient VRAM and the following dependencies:

pip install torch torchvision opencv-python numpy pillow pyyaml transformers accelerate

The base_model.py file containing InternVLModel and QwenVLModel class definitions is also required (included in filtering_module/).

Project Structure

.
├── filtering_module/
│   ├── base_model.py       # Model wrappers for InternVL/Qwen
│   └── main.py             # Main execution script
├── config/
│   └── config.yaml         # Configuration parameters
├── prompt/                 # The instruction sent to the model
├── dataset.jsonl           # List of videos and ground truth
├── videos/                 # Folder containing .mp4/.avi files
└── results/                # Generated JSONL prediction logs

Configuration

1. config.yaml

model_type: "both"          # Options: internvl, qwenvl, both
output_mode: json           # Options: json, text
num_frames: 8
prompt_path: "prompt.txt"
dataset_path: "dataset.jsonl"
video_folder: "videos"

models:
  - name: "InternVL3-1B"
    type: "internvl"
    path: "path/to/internvl/model"
  - name: "Qwen3-VL-2B"
    type: "qwenvl"
    path: "path/to/qwen/model"

2. dataset.jsonl

{"video": "sample1.mp4", "is_cctv": true}
{"video": "sample2.mp4", "is_cctv": false}

3. Prompt

The prompt instructs the model to return a JSON object. For best results, include CCTV-specific terminology:

Analyze these frames from a video. Determine if this is CCTV/surveillance footage.
Look for indicators such as: fixed camera angle, surveillance camera placement,
security footage characteristics, timestamp overlays, and static perspective.
Return a JSON object with keys: 'class' (either 'cctv' or 'non-cctv'),
'confidence' (0-1), and 'reasoning' (string).

Usage

python main.py --config config/config.yaml

Command Line Arguments

Argument	Default	Description
`--config`	`config/config.yaml`	Path to YAML configuration file
`--quiet` / `-q`	`False`	Reduce output to final summaries only

Output & Results

The script provides detailed terminal output for every video processed:

======================================================================
CCTV DETECTION RESULT - InternVL3-1B
======================================================================
Video File:        office_cam.mp4
Resolution:        1920x1080
Is CCTV:           True
Confidence:        0.95
Correctness:       ✓
Model Response:    {'class': 'cctv', 'confidence': 0.95, 'reasoning': '...'}
----------------------------------------------------------------------
PERFORMANCE METRICS
Preprocessing:         0.45s  ( 12.1%)
Model Inference:       3.20s  ( 87.9%)
======================================================================

Results are also saved to results_[model_name].jsonl containing all predictions, ground truths, and timing data for further analysis.

Automatic CCTV Video Dataset Construction

Description

This project provides an end-to-end pipeline for automatically collecting, filtering, and refining CCTV anomaly video datasets. It consists of three modules:

Downloader — generates search queries with CCTV terminology and downloads candidate videos from the web
Filtering — classifies videos as CCTV or non-CCTV using VLMs (see above)
Refinement — splits full-length CCTV videos into short anomaly clips (1–5 seconds) using VQA-based frame detection

The pipeline targets four anomaly classes — Fire, Smoke, Violence, and Falldown — producing 20–30 high-quality clips per class for downstream computer vision research.

Installation

# Create environment with Python 3.10+
conda create -n piaspace python=3.10
conda activate piaspace

# Install PyTorch with CUDA support (adjust version as needed)
conda install pytorch torchvision pytorch-cuda=12.4 -c pytorch -c nvidia

# Install all project dependencies
pip install -r requirements/all.txt

# Optional: flash attention for filtering module
pip install --extra-index-url https://miropsota.github.io/torch_packages_builder flash_attn==2.8.3+pt2.6.0cu124

For OpenAI-powered search query generation, set your API key:

export OPENAI_API_KEY='your-api-key-here'

How to Use

Before running, adjust hyperparameters in the configuration files under assets/cfg/ (see Configuration & Hyperparameters).

Complete Pipeline

bash scripts/all.sh

Runs downloader → filtering → refinement sequentially.

Module 1: Downloader

bash scripts/run_downloader.sh

Searches and downloads videos from YouTube, Google, or other sources using LLM-generated queries enhanced with CCTV-specific terminology.

Module 2: Filtering

bash scripts/filtering_module.sh

Samples frames from downloaded videos and classifies them as CCTV or non-CCTV using InternVL or Qwen-VL models. Outputs classification results with confidence scores and reasoning.

Module 3: Refinement

bash scripts/refinement_module.sh

Samples 64 frames per CCTV video, uses VQA to identify frames containing the target anomaly (fire, smoke, violence, or falldown), maps frame IDs to timestamps, and cuts 1–5 second clips around detected moments.

Configuration & Hyperparameters

All modules are configured via YAML files in assets/cfg/. Edit these before running. Command-line arguments override config file values.

Downloader Module (`assets/cfg/downloader_module/config.yaml`)

Parameter	Type	Description
`use_llm`	bool	Use LLM to auto-generate search queries (set `false` for manual queries)
`llm.backend`	str	LLM backend: `openai` or `qwen`
`llm.model`	str	Model path (e.g., `gpt-4o-mini`, `Qwen/Qwen2.5-1.5B-Instruct`)
`search_queries`	list	Manual search queries when `use_llm` is false
`search_queries_file`	str	Path to file with queries (one per line)
`search_queries_per_class`	int	Queries per class when using LLM (default: 5)
`data_source`	str	Video source: `youtube`, `google`, `ddgs`, or `browser_use`
`search.videos_per_keyword`	int	Videos to download per query (default: 10)
`search.max_results`	int	Max results per query
`download.output_dir`	str	Output directory for downloaded videos
`class_name`	list	Anomaly classes: `["Fire", "Violence", "Smoke", "Falldown"]`

Filtering Module (`assets/cfg/filtering_module/config4.yaml`)

Parameter	Type	Description
`model_type`	str	VLM(s) to use: `internvl`, `qwenvl`, or `both`
`models`	list	Model entries with `name`, `type`, and `path` fields
`num_frames`	int	Frames to extract per video (default: 8)
`importance_sampling`	bool	Use importance-based frame sampling
`prompt_path`	str	Path to prompt file for CCTV classification
`output_mode`	str	Output format: `json` or `text`
`output_file`	str	Path to save classification results
`mode`	str	`test` (from downloader_folder) or `eval` (from dataset)
`downloader_folder`	str	Input folder for downloaded videos (mode=test)
`dataset_path`	str	Path to dataset JSONL (mode=eval)
`video_folder`	str	Folder containing video files (mode=eval)

Refinement Module (`assets/cfg/refinement_module/exp_1.yaml`)

Parameter	Type	Description
`refinement_module.enabled`	bool	Enable or disable refinement
`refinement_module.input_dir`	str	Input path (filtering module output JSONL)
`refinement_module.output_dir`	str	Output directory for refined clips
`sampling.num_frames`	int	Frames to extract for VQA (default: 64)
`sampling.chunk_size`	int	Chunk size for processing (default: 16)
`vqa.max_new_tokens`	int	Max tokens for VQA output (default: 512)
`vqa.output_mode`	str	VQA output mode — `indices_scores_v4` recommended
`vqa.score_threshold`	float	Score threshold for detection (default: 0.3)
`refinement_module.N`	float	Padding seconds before/after detected segments (default: 0.0)

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
docs		docs
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CCTV Video Filtering Module

Demo Samples

Module Overview

Input Example

Output Example

Lightweight Baseline

Performance

Prerequisites

Project Structure

Configuration

1. config.yaml

2. dataset.jsonl

3. Prompt

Usage

Command Line Arguments

Output & Results

Automatic CCTV Video Dataset Construction

Description

Installation

How to Use

Complete Pipeline

Module 1: Downloader

Module 2: Filtering

Module 3: Refinement

Configuration & Hyperparameters

Downloader Module (`assets/cfg/downloader_module/config.yaml`)

Filtering Module (`assets/cfg/filtering_module/config4.yaml`)

Refinement Module (`assets/cfg/refinement_module/exp_1.yaml`)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

CCTV Video Filtering Module

Demo Samples

Module Overview

Input Example

Output Example

Lightweight Baseline

Performance

Prerequisites

Project Structure

Configuration

1. config.yaml

2. dataset.jsonl

3. Prompt

Usage

Command Line Arguments

Output & Results

Automatic CCTV Video Dataset Construction

Description

Installation

How to Use

Complete Pipeline

Module 1: Downloader

Module 2: Filtering

Module 3: Refinement

Configuration & Hyperparameters

Downloader Module (assets/cfg/downloader_module/config.yaml)

Filtering Module (assets/cfg/filtering_module/config4.yaml)

Refinement Module (assets/cfg/refinement_module/exp_1.yaml)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Downloader Module (`assets/cfg/downloader_module/config.yaml`)

Filtering Module (`assets/cfg/filtering_module/config4.yaml`)

Refinement Module (`assets/cfg/refinement_module/exp_1.yaml`)

Packages