This repository accompanies the research paper SafetyPairs: Isolating Safety Critical Image Features with Counterfactual Image Generation.
What exactly makes a particular image unsafe?
Subtle changes—such as an insulting gesture or a problematic symbol—can drastically alter the safety implications of an image. Yet, most existing image safety datasets only provide coarse labels without pinpointing the specific features that drive these differences. SafetyPairs introduces a scalable framework for generating counterfactual pairs of images that differ only in features relevant to a given safety policy. By leveraging image editing models, we systematically flip an image’s safety label while keeping safety-irrelevant details unchanged.
- We introduce SafetyPairs pipeline, a systematic approach to isolate safety-critical image features through counterfactual generation.
- Our method addresses the limitation of existing safety datasets that provide only coarse labels without identifying specific problematic elements.
- The resulting datasets serves as both an evaluation resource for vision-language models and an effective data augmentation strategy for training safety classifiers.
ml-safetypairs/
├── pyproject.toml # Python project configuration and dependencies
├── README.md
├── LICENSE # Apple Sample Code License
├── CONTRIBUTING.md # Contribution guidelines
├── ACKNOWLEDGEMENTS # Acknowledgements
└── src/
├── safetypairs/ # Main Python package
│ ├── clip/ # CLIP models for similarity measurement
│ ├── datasets/ # Dataset loaders (LlavaGuard, SafetyPairs)
│ ├── editor/ # Image editing models (Flux-Kontext)
│ ├── llm/ # Language model clients (GPT)
│ ├── pipeline/ # Core pipeline implementation
│ └── vlm/ # Vision-language model clients
├── run_example.py # Quick start demo script
└── generate_safetypairs.py # Full data generation pipeline
Data Synthesis Pipeline: A scalable framework for generating counterfactual safety pairs using image editing models, enabling targeted modifications that flip safety labels while preserving safety-irrelevant details. Check src/generate_safetypairs.py.
- Python 3.11, 3.12, or 3.13
- CUDA-capable GPU (for Flux-Kontext model)
- OpenAI API key
- HuggingFace token
git clone https://github.com/apple/ml-safetypairs.git
cd ml-safetypairs
# Install dependencies using uv
uv sync
# Activate virtual environment
source .venv/bin/activate
# Set required environment variables
export OPENAI_KEY=your_openai_api_key
export HF_TOKEN=your_huggingface_token # Get access to the gated black-forest-labs/FLUX.1-Kontext-dev model on HuggingFaceTry the example script to edit a single image:
First, prepare your image:
# Create an examples directory and add your image
mkdir -p src/examples
cp /path/to/your/image.jpg src/examples/example_source.jpgThen run the example script:
python src/run_example.pyThis will:
- Load your image from
src/examples/example_source.jpg - Ask you to provide a rationale for why the image is harmful
- Use the SafetyPairs pipeline to edit the image to make it safe
- Save the original and edited images to
src/examples/output/
Example:
$ python src/run_example.py
Loaded image from src/examples/example_source.jpg
Initializing pipeline...
Processing image...
Enter the rationale for why this image is harmful: The image shows a building on fire.
✅ Success!
Edit instruction: Remove the smoke, fire, debris, and airplane so the towers appear undamaged.
Saved results:
Original: src/examples/output/original.png
Edited: src/examples/output/edited.pngTo run the data generation pipeline on the LlavaGuard dataset:
# Set the dataset directory
export DATASET_DIRECTORY=/path/to/llavaguard
# Run the pipeline
python src/generate_safetypairs.py --save_dir /path/to/output --dataset_split testThis will:
- Download the LlavaGuard dataset (if not already present)
- Process each image to generate safe counterfactual pairs
- Save the results to the specified output directory
This software and accompanying data and models have been released under the following licenses:
@article{helbling2025safetypairs,
title={SafetyPairs: Isolating Safety Critical Image Features with Counterfactual Image Generation},
author={Helbling, Alec and Palaskar, Shruti and Krishna, Kundan and Chau, Polo and Gatys, Leon and Cheng, Joseph Yitan},
journal={arXiv preprint arXiv:2510.18214},
year={2025}
}