Thank you for considering contributing to EvalScope! This guide covers everything you need to get started.
- Contributing to EvalScope
# 1. Fork & clone
git clone https://github.com/<your-username>/evalscope.git
cd evalscope
# 2. Install in editable mode with dev dependencies
make dev
# 3. Install pre-commit hooks
pre-commit installEvalScope requires Python >= 3.10.
# Base install (editable)
pip install -e .
# With all dev tools
pip install -e '.[dev,perf,docs]'
# With the web service
pip install -e '.[service]'Run the backend service:
evalscope service --host 0.0.0.0 --port 9000Optional dependency groups (install via pip install -e '.[<group>]'):
| Group | Purpose | Key Packages |
|---|---|---|
dev |
Testing & linting | pytest, pytest-cov |
service |
Web dashboard & REST API | flask, plotly |
perf |
Performance benchmarking | — |
docs |
Documentation build | sphinx |
rag |
RAG evaluation | — |
aigc |
AIGC evaluation | — |
sandbox |
Sandboxed code execution | ms-enclave |
Some benchmarks have their own extra dependencies (e.g. pip install -e '.[bfcl]').
The dashboard is a React SPA located at evalscope/web/.
# Install dependencies
make web-install
# Start dev server (hot reload, proxies API to localhost:9000)
make web-dev
# Production build
make web-buildThe dev server runs at http://localhost:5173 and automatically proxies /api/v1/* and /health to the backend at http://127.0.0.1:9000.
Tech stack: React 19 · TypeScript · Vite · Tailwind CSS 4 · React Router · Plotly.js
For the best development experience, run both servers simultaneously:
# Terminal 1: Backend
evalscope service --debug
# Terminal 2: Frontend (hot reload)
make web-devOpen http://localhost:5173 in your browser — changes to frontend code are reflected instantly.
evalscope/
├── api/ # Core API: registry, benchmark base classes, dataset, metric, model
│ ├── benchmark/ # DataAdapter, BenchmarkMeta, adapter subclasses
│ ├── dataset/ # Dataset loading, Sample dataclass
│ ├── evaluator/ # TaskState, evaluation loop
│ ├── messages/ # Chat message types
│ ├── metric/ # Score, AggScore, metric registry
│ ├── model/ # Model abstraction (OpenAI-compatible)
│ └── registry.py # register_benchmark(), BENCHMARK_REGISTRY
│
├── benchmarks/ # All benchmark adapters (auto-discovered)
│ └── <name>/
│ ├── __init__.py
│ └── <name>_adapter.py
│
├── cli/ # CLI entry points (evalscope eval/perf/service/app)
├── constants.py # Global constants & tags
├── perf/ # Performance benchmarking subsystem
├── report/ # Report generation & visualization
├── service/ # Flask REST API + SPA serving
│ ├── app.py # Flask app factory, run_service()
│ └── blueprints/ # API route handlers (eval, perf, reports)
├── utils/ # Shared utilities (logging, IO, etc.)
└── web/ # React SPA (dashboard UI)
├── src/
│ ├── api/ # API client & type definitions
│ ├── components/ # UI components
│ ├── pages/ # Route pages
│ └── i18n/ # Internationalization
└── vite.config.ts
EvalScope uses a decorator-based registry pattern. Adding a benchmark requires only two files.
evalscope/benchmarks/my_benchmark/
├── __init__.py # (empty)
└── my_benchmark_adapter.py
Adapters are auto-discovered: any *_adapter.py under evalscope/benchmarks/ is automatically imported at startup, which triggers the @register_benchmark decorator.
Choose a base class depending on your benchmark type:
| Base Class | Use When |
|---|---|
DefaultDataAdapter |
General text QA, math, coding |
MultiChoiceAdapter |
Multiple-choice questions |
AgentAdapter |
Function calling, tool use |
VisionLanguageAdapter |
Image + text (VQA, etc.) |
MultiTurnAdapter |
Multi-turn conversations |
Text2ImageAdapter |
Text-to-image generation |
NERAdapter |
Named entity recognition |
ImageEditAdapter |
Image editing |
Minimal example (text QA benchmark):
from typing import Any, Dict
from evalscope.api.benchmark import BenchmarkMeta, DefaultDataAdapter
from evalscope.api.dataset import Sample
from evalscope.api.registry import register_benchmark
from evalscope.constants import Tags
DESCRIPTION = """
## Overview
Brief description of what this benchmark evaluates.
## Task Description
- **Task Type**: ...
- **Input**: ...
- **Output**: ...
## Evaluation Notes
- Default configuration uses **0-shot** evaluation
"""
@register_benchmark(
BenchmarkMeta(
name='my_benchmark', # unique identifier (snake_case)
pretty_name='MyBenchmark', # display name
dataset_id='org/dataset-name', # ModelScope / HuggingFace dataset ID
tags=[Tags.REASONING], # category tags
description=DESCRIPTION,
subset_list=['default'], # dataset subsets
metric_list=['acc'], # evaluation metrics
eval_split='test', # split to evaluate on
few_shot_num=0, # number of few-shot examples
prompt_template='{question}', # prompt template with placeholders
)
)
class MyBenchmarkAdapter(DefaultDataAdapter):
def record_to_sample(self, record: Dict[str, Any]) -> Sample:
"""Convert a dataset row to a Sample object."""
return Sample(
input=record['question'],
target=record['answer'],
)| Method | Purpose | When to Override |
|---|---|---|
record_to_sample() |
Map dataset row → Sample |
Always |
extract_answer() |
Extract structured answer from model output | When default extraction is insufficient |
match_score() |
Custom scoring logic | When acc / built-in metrics don't fit |
sample_to_fewshot() |
Format a sample as few-shot example | When using few-shot with custom format |
_on_inference() |
Custom model interaction | For agent/tool-use benchmarks |
BenchmarkMeta(
name='...', # Required: unique snake_case ID
dataset_id='...', # Required: remote dataset ID or local path
pretty_name='...', # Display name
tags=[...], # From evalscope.constants.Tags
description='...', # Markdown description for docs
subset_list=['default'], # Dataset subsets
metric_list=['acc'], # Metric names or dicts: [{'acc': {'numeric': True}}]
aggregation='mean', # 'mean', 'pass@k', 'f1', etc.
eval_split='test', # Evaluation split name
train_split='train', # Training split (for few-shot)
few_shot_num=0, # Few-shot count
prompt_template='...', # Prompt template with {placeholders}
filters=OrderedDict(), # Output filters
extra_params={}, # Additional configurable parameters
sandbox_config={}, # Sandboxed execution config (for code benchmarks)
review_timeout=None, # Per-sample timeout in seconds
)If your benchmark needs additional packages, create a requirements.txt in the benchmark directory:
evalscope/benchmarks/my_benchmark/requirements.txt
Then register it in pyproject.toml:
[tool.setuptools.dynamic.optional-dependencies]
my_benchmark = {file = ["evalscope/benchmarks/my_benchmark/requirements.txt"]}Users can install via pip install 'evalscope[my_benchmark]'.
Run the doc pipeline to auto-generate benchmark documentation:
make docs-update BENCHMARK=my_benchmark
make docs-translate BENCHMARK=my_benchmark
make docs-generate# Check it's registered
evalscope eval --benchmarks my_benchmark --model dummy --limit 5
# Run via service
evalscope service
# Then use the Web dashboard or API: POST /api/v1/eval/invokeThis project uses pre-commit with the following hooks:
- flake8 — Python style checker
- isort — Import sorting
- yapf — Code formatting
- Trailing whitespace, YAML checks, line ending fixes
# Run all checks
make lint
# or
pre-commit run --all-files# Run all tests
pytest tests/
# Run a specific test
pytest tests/benchmark/test_xxx.py-
Create a branch with a descriptive name:
git checkout -b feature/my-benchmark
-
Make your changes and commit with clear messages:
git commit -m "feat: add MyBenchmark adapter" -
Run quality checks before pushing:
pre-commit run --all-files pytest tests/
-
Push and open a Pull Request against the
mainbranch. Provide a clear description of your changes.
Thank you for your contribution!