Note: This project was completed during an internship at PIASpace. Source code is not publicly available due to confidentiality restrictions.
A VLM-powered video classification module that automatically identifies CCTV and surveillance footage using frame-level analysis with InternVL and Qwen-VL models.
- Models: InternVL3-1B, InternVL3-2B, Qwen3-VL-2B
- Input: Single videos or batch JSONL datasets
- Output: JSONL predictions with class labels, confidence scores, and reasoning
- Target Anomaly Classes: Fire, Smoke, Violence, Falldown
The first half of this README covers the Filtering Module in detail. Scroll past the divider for the full pipeline (Downloader → Filtering → Refinement).
Input video sample: YouTube demo
Prompt Example:
Prompt design had a strong impact on accuracy — optimized prompts incorporating CCTV-specific terminology (surveillance camera, security footage, fixed camera, incident context) significantly improved classification performance.
A grayscale-difference heuristic was also tested as a fast, non-VLM baseline: frames with high pixel-level change over time are more likely to originate from static surveillance cameras. While useful for pre-filtering, the VLM approach ultimately delivered higher accuracy.
| Dataset | Model | Videos | Accuracy |
|---|---|---|---|
| Small labeled test set | Qwen3-VL-2B | 18 | 94.44% |
| Large-scale internal dataset | InternVL3-2B | 425 | 83.53% |
- Qwen3-VL-2B achieved the highest single accuracy on the curated test set with optimized prompts.
- InternVL3-2B performed best on the larger, real-world internal dataset.
- Prompt engineering was critical — tailored prompts incorporating surveillance-specific vocabulary consistently outperformed generic ones.
A GPU with sufficient VRAM and the following dependencies:
pip install torch torchvision opencv-python numpy pillow pyyaml transformers accelerateThe base_model.py file containing InternVLModel and QwenVLModel class definitions is also required (included in filtering_module/).
.
├── filtering_module/
│ ├── base_model.py # Model wrappers for InternVL/Qwen
│ └── main.py # Main execution script
├── config/
│ └── config.yaml # Configuration parameters
├── prompt/ # The instruction sent to the model
├── dataset.jsonl # List of videos and ground truth
├── videos/ # Folder containing .mp4/.avi files
└── results/ # Generated JSONL prediction logs
model_type: "both" # Options: internvl, qwenvl, both
output_mode: json # Options: json, text
num_frames: 8
prompt_path: "prompt.txt"
dataset_path: "dataset.jsonl"
video_folder: "videos"
models:
- name: "InternVL3-1B"
type: "internvl"
path: "path/to/internvl/model"
- name: "Qwen3-VL-2B"
type: "qwenvl"
path: "path/to/qwen/model"{"video": "sample1.mp4", "is_cctv": true}
{"video": "sample2.mp4", "is_cctv": false}The prompt instructs the model to return a JSON object. For best results, include CCTV-specific terminology:
Analyze these frames from a video. Determine if this is CCTV/surveillance footage.
Look for indicators such as: fixed camera angle, surveillance camera placement,
security footage characteristics, timestamp overlays, and static perspective.
Return a JSON object with keys: 'class' (either 'cctv' or 'non-cctv'),
'confidence' (0-1), and 'reasoning' (string).
python main.py --config config/config.yaml| Argument | Default | Description |
|---|---|---|
--config |
config/config.yaml |
Path to YAML configuration file |
--quiet / -q |
False |
Reduce output to final summaries only |
The script provides detailed terminal output for every video processed:
======================================================================
CCTV DETECTION RESULT - InternVL3-1B
======================================================================
Video File: office_cam.mp4
Resolution: 1920x1080
Is CCTV: True
Confidence: 0.95
Correctness: ✓
Model Response: {'class': 'cctv', 'confidence': 0.95, 'reasoning': '...'}
----------------------------------------------------------------------
PERFORMANCE METRICS
Preprocessing: 0.45s ( 12.1%)
Model Inference: 3.20s ( 87.9%)
======================================================================
Results are also saved to results_[model_name].jsonl containing all predictions, ground truths, and timing data for further analysis.
This project provides an end-to-end pipeline for automatically collecting, filtering, and refining CCTV anomaly video datasets. It consists of three modules:
- Downloader — generates search queries with CCTV terminology and downloads candidate videos from the web
- Filtering — classifies videos as CCTV or non-CCTV using VLMs (see above)
- Refinement — splits full-length CCTV videos into short anomaly clips (1–5 seconds) using VQA-based frame detection
The pipeline targets four anomaly classes — Fire, Smoke, Violence, and Falldown — producing 20–30 high-quality clips per class for downstream computer vision research.
# Create environment with Python 3.10+
conda create -n piaspace python=3.10
conda activate piaspace
# Install PyTorch with CUDA support (adjust version as needed)
conda install pytorch torchvision pytorch-cuda=12.4 -c pytorch -c nvidia
# Install all project dependencies
pip install -r requirements/all.txt
# Optional: flash attention for filtering module
pip install --extra-index-url https://miropsota.github.io/torch_packages_builder flash_attn==2.8.3+pt2.6.0cu124For OpenAI-powered search query generation, set your API key:
export OPENAI_API_KEY='your-api-key-here'Before running, adjust hyperparameters in the configuration files under assets/cfg/ (see Configuration & Hyperparameters).
bash scripts/all.shRuns downloader → filtering → refinement sequentially.
bash scripts/run_downloader.shSearches and downloads videos from YouTube, Google, or other sources using LLM-generated queries enhanced with CCTV-specific terminology.
bash scripts/filtering_module.shSamples frames from downloaded videos and classifies them as CCTV or non-CCTV using InternVL or Qwen-VL models. Outputs classification results with confidence scores and reasoning.
bash scripts/refinement_module.shSamples 64 frames per CCTV video, uses VQA to identify frames containing the target anomaly (fire, smoke, violence, or falldown), maps frame IDs to timestamps, and cuts 1–5 second clips around detected moments.
All modules are configured via YAML files in assets/cfg/. Edit these before running. Command-line arguments override config file values.
| Parameter | Type | Description |
|---|---|---|
use_llm |
bool | Use LLM to auto-generate search queries (set false for manual queries) |
llm.backend |
str | LLM backend: openai or qwen |
llm.model |
str | Model path (e.g., gpt-4o-mini, Qwen/Qwen2.5-1.5B-Instruct) |
search_queries |
list | Manual search queries when use_llm is false |
search_queries_file |
str | Path to file with queries (one per line) |
search_queries_per_class |
int | Queries per class when using LLM (default: 5) |
data_source |
str | Video source: youtube, google, ddgs, or browser_use |
search.videos_per_keyword |
int | Videos to download per query (default: 10) |
search.max_results |
int | Max results per query |
download.output_dir |
str | Output directory for downloaded videos |
class_name |
list | Anomaly classes: ["Fire", "Violence", "Smoke", "Falldown"] |
| Parameter | Type | Description |
|---|---|---|
model_type |
str | VLM(s) to use: internvl, qwenvl, or both |
models |
list | Model entries with name, type, and path fields |
num_frames |
int | Frames to extract per video (default: 8) |
importance_sampling |
bool | Use importance-based frame sampling |
prompt_path |
str | Path to prompt file for CCTV classification |
output_mode |
str | Output format: json or text |
output_file |
str | Path to save classification results |
mode |
str | test (from downloader_folder) or eval (from dataset) |
downloader_folder |
str | Input folder for downloaded videos (mode=test) |
dataset_path |
str | Path to dataset JSONL (mode=eval) |
video_folder |
str | Folder containing video files (mode=eval) |
| Parameter | Type | Description |
|---|---|---|
refinement_module.enabled |
bool | Enable or disable refinement |
refinement_module.input_dir |
str | Input path (filtering module output JSONL) |
refinement_module.output_dir |
str | Output directory for refined clips |
sampling.num_frames |
int | Frames to extract for VQA (default: 64) |
sampling.chunk_size |
int | Chunk size for processing (default: 16) |
vqa.max_new_tokens |
int | Max tokens for VQA output (default: 512) |
vqa.output_mode |
str | VQA output mode — indices_scores_v4 recommended |
vqa.score_threshold |
float | Score threshold for detection (default: 0.3) |
refinement_module.N |
float | Padding seconds before/after detected segments (default: 0.0) |


