TL;DR — WBench evaluates 20 video world models across 5 dimensions and 22 metrics.
- [2026/06/10] 🧭 Added HY-World 1.5 pose exports to WBench-examples.
- [2026/06/01] WBench is now an official benchmark on Hugging Face 🤗 (navi & full tasks)!
- [2026/06/01] 📦 Released WBench-examples: ready-to-eval videos from HY-World 1.5 & Kling 3.0.
- [2026/06/01] 🎮 Added camera- & action-conditioned examples + web automation (Genie3, Happy Oyster).
- [2026/06/01] Added Claude Code skills 🤖 for generation, evaluation & submission.
- [2026/05/29] Paper ranked #2 🏅 on Hugging Face Daily Papers!
- [2026/05/28] Paper now available on arXiv 📄!
- [2026/05/28] Homepage with interactive leaderboard & dataset gallery is live! 🌐
- [2026/05/28] 🚀 Released the full WBench dataset, evaluation code & model weights.
- A comprehensive evaluation framework with 289 cases, 1,058 interaction turns, covering 4 interaction types (navigation, subject action, event editing, perspective switching) across diverse scenes and perspectives.
- A unified navigation protocol that bridges text, 6-DoF camera pose, and discrete-action interfaces, enabling fair comparison across model families.
- 22 automatic metrics spanning 5 complementary dimensions, validated against human judgments, ensuring reliable automatic evaluation at scale.
- Systematic diagnosis of 20 models revealing that current world models have not yet unified high-fidelity rendering with reliable controllability, consistency, and physics compliance.
20 Models — Navigation Split (5 Dimensions, sorted by average)
9 Text-driven Models — Full Split (5 Dimensions, sorted by average)
20 Models — Navigation Split (19 metrics)
9 Text-driven Models — Full Split (22 metrics)
# Install
git clone --recursive https://github.com/meituan-longcat/WBench.git
cd WBench
# If you already cloned without submodules
git submodule update --init --recursive
# Download data and weights
pip install huggingface_hub
hf download meituan-longcat/WBench --repo-type dataset --local-dir data/ --exclude "splits/*"
hf download meituan-longcat/WBench-weights --local-dir weights/
# Environment 1: wbench-main (all metrics except visual_plausibility)
# 2nd arg = PyTorch's CUDA build — match it to YOUR system (check via `nvcc --version`):
# cu124 → CUDA 12.x cu121 → CUDA 12.1 cu118 → CUDA 11.8
# Always pass it explicitly: if omitted, auto-detection falls back to cu118 when nvcc
# isn't on PATH, which makes the MegaSAM CUDA extensions fail to build on CUDA-12 machines.
bash tools/install.sh wbench-main cu124
conda activate wbench-main
export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH
# Verify
conda activate wbench-main
python tools/verify_install.py
# Run evaluation (auto multi-GPU)
python main.py --model your_modelSee docs/installation.md for detailed setup instructions.
Set environment variables for VLM metrics first (we use Doubao-Seed-2.0-lite via Volcengine ARK):
export VLM_API_KEY="<your-ark-api-key>"
# Optional (defaults shown):
# export VLM_API_URL="https://ark.cn-beijing.volces.com/api/v3"
# export VLM_MODEL_NAME="doubao-seed-2-0-lite-260215"- Generate multi-turn videos → place in
work_dirs/<model>/videos/case_{id}_combined.mp4 - Run the 3-phase pipeline:
# Full pipeline (precompute → GPU metrics → VLM metrics → report)
python main.py --model my_model --gpus 0,1,2,3,4,5,6,7
# Or run phases independently:
python main.py --model my_model --phase precompute # SAM2 + DA3 + MegaSAM
python main.py --model my_model --phase gpu # GPU metrics (per-metric)
python main.py --model my_model --phase vlm # VLM metrics (API)
python main.py --model my_model --phase report # Aggregate reportNote: the pipeline above covers 21 of the 22 metrics. visual_plausibility is the exception — it runs in the separate wbench-vp environment (set up in Quick Start):
conda activate wbench-vp
python tools/run_visual_plausibility.py --model my_model # uses all available GPUs- Results:
work_dirs/<model>/evaluation/{metric}/case_{id}.json+report.json
# Run specific metrics (by name or dimension)
python main.py --model my_model --phase gpu --metrics hpsv3_quality
python main.py --model my_model --phase gpu --metrics quality # all 6 video quality
python main.py --model my_model --phase gpu --metrics consistency # all consistency metrics
# Skip pre-computation if already done
python main.py --model my_model --phase gpu --skip_megasam --skip_sam2 --skip_da3
# Single video evaluation
python main.py --video video.mp4 --case data/cases/case_1.jsonDimensions (--metrics supports these as shorthand):
| Dimension | Metrics |
|---|---|
quality |
aesthetic_quality, imaging_quality, temporal_flickering, dynamic_degree, motion_smoothness, hpsv3_quality |
consistency |
background_consistency, segment_continuity, perspective_consistency, subject_consistency, geometric_consistency, photometric_consistency, spatial_consistency, gated_spatial_consistency |
interaction |
navigation_trajectory, event_edit_adherence, subject_action_adherence, perspective_switch_adherence |
setting |
scene_adherence, subject_adherence |
physical |
visual_plausibility, causal_fidelity |
WBench supports 3 model types with different control interfaces:
| Type | Input | Cases | Status |
|---|---|---|---|
| Text-conditioned | Text prompt + first-frame image | 289 (all) | ✅ Implemented |
| Camera-conditioned | First-frame image + 6-DoF camera pose | 158 (navi) | ✅ Implemented |
| Action-conditioned | First-frame image + discrete action | 158 (navi) | ✅ Implemented |
from src.models import get_model
# Available: wan, kling, seedance (or register your own)
model = get_model("wan")
# Generate multi-turn video from a case
result = model.generate_multi_turn(
case=case_dict,
output_path="work_dirs/wan/videos/case_1_combined.mp4",
data_root="data/",
)Each turn: build prompt from interaction → call I2V API → extract last frame → next turn.
Set API credentials:
export VIDEO_API_URL="https://your-video-api.com"
export VIDEO_API_KEY="your-key"The benchmark's navigation actions (W/A/S/D + arrows) are converted to per-turn
{move, yaw, pitch} intent and then to a 6-DoF camera trajectory. Subclass
CameraConditionedModel and implement one hook — case parsing, action→pose
conversion, and video writing are handled for you:
from src.models.camera import CameraConditionedModel
class MyWorldModel(CameraConditionedModel):
def generate_with_poses(self, image, poses, video_length, **kw):
# image: first-frame path; poses: {"<latent_idx>": {"extrinsic": 4x4, "K": 3x3}, ...}
# return: list of `video_length` BGR uint8 frames
return my_model.infer(image, poses, video_length)
MyWorldModel("mymodel").generate_multi_turn(case_dict,
"work_dirs/mymodel/videos/case_1_combined.mp4", data_root="data/")The pose convention (axes, speeds, intrinsics) lives in src/models/camera/poses.py
— copy and adapt it to your model; the navigation metric normalises scale, so what
matters is matching the per-action intent. Quick look at one case:
python -m src.models.camera.demo --case data/cases/case_1.json # prints poses + renders a previewNote: Camera/action models only cover the 158 navigation cases (cases containing at least one W/A/S/D/arrow action). When generating at scale, pass only those cases — e.g. via
generate.py --model your_model --cases <navi_list>.
Two flavours, both fed from the same per-turn navigation plan:
Programmatic controllers (e.g. Matrix-Game-3). Subclass ActionConditionedModel
and implement generate_with_actions. Each action carries both raw key tokens
and an MG3-style {keyboard, mouse} tensor:
from src.models.action import ActionConditionedModel
class MyActionModel(ActionConditionedModel):
def generate_with_actions(self, image, actions, video_length, **kw):
# actions: [{"turn", "tokens", "keyboard", "mouse", "duration"}, ...]
return my_model.infer(image, actions, video_length)
MyActionModel("mymodel").generate_multi_turn(case_dict,
"work_dirs/mymodel/videos/case_1_combined.mp4", data_root="data/")python -m src.models.action.demo --case data/cases/case_1.json # prints actions + renders a previewWeb products (e.g. Project Genie, Happy Oyster) — no weights/API; driven by
browser automation + simulated keystrokes. See
src/models/action/web/.
If you use Claude Code, this repo ships skills that drive the full workflow — just ask in natural language and Claude runs the right commands:
| Skill | Triggers on | What it does |
|---|---|---|
wbench-generate |
"generate kling videos" | Runs generate.py over the dataset → work_dirs/<model>/videos/ |
wbench-evaluate |
"evaluate kling3" | Runs the 4-phase main.py pipeline (precompute → gpu → vlm → report) |
wbench-submit |
"package my model for submission" | Builds the meta.json / turns.json bundle and uploads to HuggingFace |
genie3 / happy |
"run case_5 on genie3" | Browser automation for the web products (details) |
Skills live in .claude/skills/ (and src/models/action/web/.claude/skills/) and
are auto-discovered when you open the repo in Claude Code.
- Text-conditioned model generation (Wan, Kling, Seedance)
- Homepage with interactive leaderboard
- Dataset and weights release on HuggingFace
- Camera-conditioned model generation example
- Action-conditioned model generation example
- Hosted submission & evaluation service (submit videos, get scores)
- ArXiv paper release
If you find our work useful, please consider citing:
@article{ying2026wbenchcomprehensivemultiturnbenchmark,
title={WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation},
author={Ying, Kaining and Hu, Hengrui and Ren, Siyu and Li, Jiamu and Chen, Fengjiao and Wang, Ziwen and Cao, Xuezhi and Cai, Xunliang and Ding, Henghui},
journal={arXiv preprint arXiv:2605.25874},
year={2026}
}This project builds upon the following excellent works:
- WorldScore — World model evaluation framework
- VBench — Video quality metrics
- SAM2 — Segment Anything Model 2 for mask tracking
- Depth-Anything-V3 — Monocular depth estimation
- MegaSAM — Camera pose estimation
- DreamSim — Perceptual similarity metric
- HPSv3 — Human Preference Score
- AMT — Frame interpolation for motion smoothness
- RAFT — Optical flow estimation
- TransNetV2 — Scene boundary detection
- ... and many other excellent open-source projects
Feel free to open an Issue or Pull Request. You can also reach us directly:
- Kaining Ying:
kaining.ying.cv@gmail.com - Siyu Ren:
rensiyu07@meituan.com
Code and data: MIT License. Model weights retain their original licenses.


