Evaluate LLM robustness under lexical distortions using semantic paraphrasing.
Chameleon tests how well language models handle semantically equivalent but lexically varied questions. It applies controlled distortions (ΞΌ=0.0 to ΞΌ=0.9) while preserving meaning and correct answers, then measures performance degradation.
| Component | Provider | Purpose |
|---|---|---|
| Distortion Engine | Mistral AI | Generates semantic paraphrases |
| Target Model | OpenAI | Model being evaluated (GPT-4o, GPT-5.1, etc.) |
Note: You need API keys from both providers. Get your Mistral key at console.mistral.ai and OpenAI key at platform.openai.com.
- π¬ Semantic Distortion Engine: Uses Mistral to generate meaning-preserving paraphrases at 10 intensity levels
- π Statistical Analysis: McNemar's tests, confidence intervals, significance testing
- π Rich Visualizations: Heatmaps, accuracy plots, degradation analysis
- π Batch API Support: OpenAI & Mistral batch APIs for efficient large-scale evaluation
- π Executive Reports: Auto-generated markdown reports with charts and insights
# Clone the repository
git clone https://github.com/stevesolun/Chameleon.git
cd Chameleon
# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate # Linux/Mac
# or: venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txt# Build the Docker image
docker build -t chameleon .
# Run interactive CLI
docker run -it --rm \
-v $(pwd)/Projects:/app/Projects \
-e OPENAI_API_KEY=$OPENAI_API_KEY \
-e MISTRAL_API_KEY=$MISTRAL_API_KEY \
chameleon python cli.py --helpYou need API keys from:
- Mistral AI: console.mistral.ai - for distortion generation
- OpenAI: platform.openai.com - for target model evaluation
Set them as environment variables or the CLI will prompt you:
export MISTRAL_API_KEY="your-mistral-key"
export OPENAI_API_KEY="your-openai-key"python cli.py initFollow the interactive prompts to configure:
- Project name
- Target model (e.g., gpt-5.1, gpt-4o)
- Distortion settings (ΞΌ values, distortions per question)
- API keys
When prompted, provide CSV files with questions. See Data Formats below for details.
python cli.py distort --project MyProjectUses Mistral to create semantic paraphrases at each ΞΌ level.
python cli.py evaluate --project MyProjectSends distorted questions to your target model via OpenAI Batch API.
python cli.py analyze --project MyProjectGenerates statistical analysis, visualizations, and executive report.
# Project Management
python cli.py init # Create new project (interactive)
python cli.py list # List all projects
python cli.py status -p PROJECT # Show project status
python cli.py edit -p PROJECT # Edit project configuration
python cli.py delete -p PROJECT # Delete project (double confirmation)
# Distortion & Evaluation
python cli.py distort -p PROJECT # Generate distortions (requires Mistral)
python cli.py evaluate -p PROJECT # Evaluate target model (requires OpenAI)
# Analysis
python cli.py analyze -p PROJECT # Run full analysis
# Help
python cli.py help # Show all commandsChameleon/
βββ chameleon/ # Main package
β βββ core/ # Config, project management
β βββ distortion/ # Mistral-based distortion engine
β βββ evaluation/ # OpenAI batch evaluation
β βββ analysis/ # Statistics and visualizations
βββ Projects/ # Your evaluation projects
β βββ MyProject/
β βββ original_data/ # Input CSV files
β βββ distorted_data/ # Generated distortions
β βββ results/ # Evaluation results & analysis
β βββ config.yaml # Project settings
βββ cli.py # CLI entry point
βββ requirements.txt # Dependencies
βββ Dockerfile # Docker support
| ΞΌ Level | Distortion Type | Description |
|---|---|---|
| 0.0 | None (baseline) | Original question unchanged |
| 0.1-0.2 | Minimal | 1-3 word synonyms |
| 0.3-0.4 | Moderate | Phrase restructuring |
| 0.5-0.6 | Mixed | Lexical + structural changes |
| 0.7-0.8 | Heavy | Major paraphrasing |
| 0.9 | Full | Complete reconstruction |
After running python cli.py analyze --project YourProject, all outputs are saved to Projects/YourProject/results/analysis/ (~23 files):
| File | Description | Key Insight |
|---|---|---|
01_accuracy_by_miu.csv |
Accuracy data by ΞΌ level | Raw numbers for each distortion level |
01_accuracy_by_miu.png |
π Line chart: accuracy vs distortion | Visualize degradation curve |
02_accuracy_by_subject_miu.csv |
Per-subject accuracy data | Which subjects are most vulnerable? |
02_subject_ranking.png |
π Bar chart: subject performance | Rank subjects by baseline accuracy |
02_subject_miu_heatmap.png |
π₯ Heatmap: absolute accuracy (Subject Γ ΞΌ) | See accuracy patterns |
02_degradation_heatmap.png |
π₯ Heatmap: % degradation from baseline | Identify vulnerable subjects |
03_chameleon_robustness_index.csv |
CRI scores (global + per-subject) | Single metric for model ranking |
04_elasticity.csv |
Degradation slope data | Quantify fragility numerically |
04_elasticity.png |
π Scatter + regression: degradation rate | Visualize slope |
05_model_comparison.csv |
Head-to-head comparison table | Compare all metrics |
05_model_comparison.png |
π Scatter plot: CRI vs accuracy | Compare models visually |
| File | Description | Key Insight |
|---|---|---|
06_error_taxonomy.json |
Classification: blank, wrong_choice, invalid_format, multiple_options | Where do failures come from? |
07_confusion_clusters.json |
TF-IDF + KMeans clustering of failures | Which linguistic patterns cause errors? |
| File | Description | Key Insight |
|---|---|---|
08_bootstrap_intervals.csv |
95% confidence intervals (500 samples) | Are differences statistically significant? |
11_mcnemar_distortion.csv |
McNemar's test: ΞΌ=0 vs each ΞΌ>0 | Paired significance testing |
11_mcnemar_distortion.png |
π Bar chart: baseline vs distorted (* = p<0.05) | Visualize significant differences |
12_mcnemar_subject.csv |
Per-subject McNemar tests | Subject-specific significance |
12_mcnemar_subject.png |
π Bar chart: per-subject significance | Which subjects show real degradation? |
| File | Description | Key Insight |
|---|---|---|
09_delta_accuracy_heatmap.csv |
Subject Γ ΞΌ degradation matrix (data) | Raw delta values |
09_delta_accuracy_heatmap.png |
π₯ Heatmap: change from baseline | Visual: Red = high degradation |
10_question_difficulty_tiers.json |
Easy/Medium/Hard/Chameleon Breakers | Find pattern-matching evidence |
13_key_insights.png |
π 4-panel summary: curve + bars + pie + stats | Quick visual overview |
EXECUTIVE_REPORT.md |
π START HERE - Full findings report | Comprehensive interpretation |
Weighted accuracy that emphasizes high-distortion performance:
CRI = Ξ£(accuracy(ΞΌ) Γ w(ΞΌ)) where w(ΞΌ) = exp(2.0 Γ ΞΌ) / Ξ£ exp(2.0 Γ ΞΌ)
- CRI > 0.7: Highly robust
- CRI 0.5-0.7: Moderately robust
- CRI < 0.5: Fragile
Linear regression of accuracy vs ΞΌ:
- Slope β 0: Robust (stable across distortions)
- Slope < -0.05: Fragile (>5% accuracy loss per 0.1 ΞΌ)
| Tier | Definition | Interpretation |
|---|---|---|
| π’ Easy | β₯80% at ΞΌ=0, β₯70% at ΞΌ=0.9 | True understanding |
| π‘ Medium | Good at low ΞΌ, struggles at high | Partial understanding |
| π΄ Hard | <50% even at ΞΌ=0 | Knowledge gap |
| π Chameleon Breaker | β₯70% at ΞΌ=0, <30% at ΞΌ=0.9 | Surface pattern matching |
Chameleon Breakers are the most important finding - they reveal questions where the model appears to understand at baseline but fails catastrophically under paraphrasing, indicating reliance on lexical patterns rather than semantic comprehension.
# Set your API keys in .env or export them
export MISTRAL_API_KEY="your-mistral-key"
export OPENAI_API_KEY="your-openai-key"
# Build and run
docker-compose build
docker-compose run chameleon python cli.py init
docker-compose run chameleon python cli.py distort -p MyProject
docker-compose run chameleon python cli.py evaluate -p MyProject
docker-compose run chameleon python cli.py analyze -p MyProject
# Or run analysis only (no API keys needed)
PROJECT=MyProject docker-compose run analyze# Build
docker build -t chameleon .
# Run with mounted projects and API keys
docker run -it --rm \
-v $(pwd)/Projects:/app/Projects \
-e MISTRAL_API_KEY=$MISTRAL_API_KEY \
-e OPENAI_API_KEY=$OPENAI_API_KEY \
chameleon python cli.py initChameleon works with closed-answer multiple-choice questions in CSV format.
Your source data should have these columns:
| Column | Required | Description |
|---|---|---|
subject |
Optional | Category/topic (e.g., "Biology", "History") |
question |
Yes | The question text |
answer_options |
Yes | JSON object with options: {"A": "...", "B": "...", "C": "...", "D": "..."} |
correct_answer |
Yes | Correct answer letter(s): "A" or "A, D" for multiple |
question_id |
Optional | Unique identifier (auto-generated if missing) |
Example:
subject,question,answer_options,correct_answer,question_id
Biology,"What is the powerhouse of the cell?","{""A"": ""Nucleus"", ""B"": ""Mitochondria"", ""C"": ""Ribosome"", ""D"": ""Golgi""}",B,BIO_001The full results CSV includes all processing columns:
| Column | Description |
|---|---|
subject |
Category/topic |
question_id |
Unique question identifier |
question_text |
Original question |
options_json |
Answer options as JSON |
distorted_question |
Paraphrased version (or original if ΞΌ=0) |
distortion_id |
Unique ID: {question_id}_d{N}_m{miu} |
miu |
Distortion level (0.0 - 0.9) |
answer |
Correct answer(s) |
target_model_name |
Model evaluated (e.g., "gpt-5.1") |
target_model_answer |
Model's response |
is_correct |
Whether model answered correctly |
By default, Chameleon uses Mistral API for distortion generation (recommended). However, you can configure local models during project setup.
β οΈ Hardware Requirements for Local ModelsRunning local LLMs requires significant computational resources:
- GPU: NVIDIA GPU with 8GB+ VRAM recommended (16GB+ for larger models)
- RAM: 16GB+ system memory
- Storage: 10-50GB for model weights
- Time: Local inference is significantly slower than API calls
If you don't have a powerful workstation, stick with the API option. It's faster and more reliable for large datasets.
Chameleon supports questions with multiple correct answers. Use comma-separated letters:
- Single answer:
"B" - Multiple answers:
"A, D"(order doesn't matter, case-insensitive)
The evaluation uses smart comparison: "A, D" equals "D, A" equals "a,d".
If you use Chameleon in your research, please cite:
@software{chameleon2025,
title={Chameleon: LLM Robustness Testing Framework},
author={Steve Solun},
year={2025},
url={https://github.com/stevesolun/Chameleon}
}Foundational Work:
@article{cohen2025forget,
title={Forget What You Know about LLMs Evaluations - LLMs are Like a Chameleon},
author={Cohen-Inger, Nurit and Elisha, Yehonatan and Shapira, Bracha and Rokach, Lior and Cohen, Seffi},
journal={arXiv preprint arXiv:2502.07445},
year={2025},
url={https://arxiv.org/abs/2502.07445}
}MIT License - see LICENSE for details.
Built with β€οΈ for the AI research community