WBench: A Comprehensive Multi-turn Benchmark for
Interactive Video World Model Evaluation

Is Your World Model an All-Round Player?

TL;DR — WBench evaluates 20 video world models across 5 dimensions and 22 metrics.

📢 News

[2026/06/10] 🧭 Added HY-World 1.5 pose exports to WBench-examples.
[2026/06/01] WBench is now an official benchmark on Hugging Face 🤗 (navi & full tasks)!
[2026/06/01] 📦 Released WBench-examples: ready-to-eval videos from HY-World 1.5 & Kling 3.0.
[2026/06/01] 🎮 Added camera- & action-conditioned examples + web automation (Genie3, Happy Oyster).
[2026/06/01] Added Claude Code skills 🤖 for generation, evaluation & submission.
[2026/05/29] Paper ranked #2 🏅 on Hugging Face Daily Papers!
[2026/05/28] Paper now available on arXiv 📄!
[2026/05/28] Homepage with interactive leaderboard & dataset gallery is live! 🌐
[2026/05/28] 🚀 Released the full WBench dataset, evaluation code & model weights.

✨ Contributions

A comprehensive evaluation framework with 289 cases, 1,058 interaction turns, covering 4 interaction types (navigation, subject action, event editing, perspective switching) across diverse scenes and perspectives.
A unified navigation protocol that bridges text, 6-DoF camera pose, and discrete-action interfaces, enabling fair comparison across model families.
22 automatic metrics spanning 5 complementary dimensions, validated against human judgments, ensuring reliable automatic evaluation at scale.
Systematic diagnosis of 20 models revealing that current world models have not yet unified high-fidelity rendering with reliable controllability, consistency, and physics compliance.

🏆 Leaderboard

20 Models — Navigation Split (5 Dimensions, sorted by average)

#	Model	Average	Quality	Setting	Interaction	Consistency	Physical
1	Kling 3.0	79.2 🥇	83.0 🥈	91.0 🥈	70.3	82.5	69.3 🥉
2	LingBot-World	78.8 🥈	81.5	72.6	79.8	88.9 🥇	71.2 🥈
3	Wan 2.7	78.5 🥉	82.6 🥉	91.4 🥇	66.0	80.5	71.8 🥇
4	HY-World 1.5	78.4	80.2	72.2	87.5 🥇	86.0	66.3
5	HY-Video 1.5	78.2	79.7	85.6 🥉	71.8	86.7 🥉	67.4
6	Happy Oyster	77.1	79.3	74.2	85.1 🥈	83.3	63.5
7	Seedance 1.5	76.5	83.2 🥇	82.9	68.0	80.2	68.4
8	Cosmos 2.5	75.2	75.6	83.3	64.1	85.6	67.4
9	LTX 2.3	74.4	78.7	85.2	67.6	75.6	64.9
10	InSpatio-World	74.3	74.9	71.4	72.8	87.4 🥈	65.2
11	Fantasy-World	74.2	75.5	71.3	72.1	85.3	66.8
12	Genie 3	74.1	77.4	72.5	73.3	81.4	65.7
13	LongCat-Video	73.7	78.2	72.3	63.1	85.9	68.9
14	YUME 1.5	73.5	79.5	72.4	72.0	78.6	65.2
15	Infinite-World	72.9	78.7	69.3	75.9	78.7	62.1
16	MatrixGame3	71.2	76.9	63.6	83.5 🥉	72.9	59.3
17	Kairos 3.0	70.7	76.4	70.3	65.1	81.4	60.4
18	HY-GameCraft	68.5	74.9	66.6	67.8	70.6	62.4
19	MatrixGame2	68.5	75.7	67.1	80.6	62.0	57.2
20	Astra	64.0	69.7	59.6	67.7	71.6	51.4

9 Text-driven Models — Full Split (5 Dimensions, sorted by average)

#	Model	Average	Quality	Setting	Interaction	Consistency	Physical
1	Kling 3.0	79.5 🥇	81.8 🥉	91.0 🥈	73.1 🥇	82.6	69.2 🥈
2	Wan 2.7	78.2 🥈	82.2 🥈	91.4 🥇	72.1 🥈	73.8	71.6 🥇
3	Seedance 1.5	76.2 🥉	83.0 🥇	82.9	68.3 🥉	78.5	68.2
4	HY-Video 1.5	74.6	78.9	85.6 🥉	54.7	86.8 🥇	67.1
5	LTX 2.3	71.0	78.8	85.2	49.4	76.4	65.1
6	Cosmos 2.5	70.8	74.6	83.3	43.5	85.4 🥉	67.0
7	LongCat-Video	70.2	79.7	72.3	45.1	85.5 🥈	68.4 🥉
8	YUME 1.5	69.0	79.7	72.4	48.4	79.3	65.4
9	Kairos 3.0	66.0	75.8	70.3	41.6	81.9	60.5

20 Models — Navigation Split (19 metrics)

Model	Aesthetic Quality	Imaging Quality	Background Consistency	Temporal Flickering	Dynamic Degree	Motion Smoothness	HPSv3 Quality	Scene Adherence	Subject Adherence	Navigation Trajectory	Spatial Consistency	Gated Spatial Consistency	Perspective Consistency	Segment Continuity	Geometric Consistency	Photometric Consistency	Subject Consistency Cross-Model	Visual Plausibility	Causal Fidelity
HY-Video 1.5	63.4	67.4	92.1	94.2	73.9	98.7	68.0	77.5	93.6	71.8	79.2	75.1	86.6	99.4	94.6	80.3	91.6	59.7	75.0
Kling 3.0	63.0	68.1	92.3	93.2	97.5	97.6	69.1	89.0	92.9	70.3	75.2	75.1	76.8	93.0	88.9	79.9	88.5	60.7	78.0
Cosmos 2.5	61.8	66.9	92.3	94.8	49.0	98.2	66.5	72.4	94.2	64.1	78.1	74.3	84.3	94.3	94.6	81.6	92.3	60.1	74.7
LTX 2.3	57.9	61.0	88.3	93.2	98.1	96.4	56.1	81.3	89.2	67.6	70.2	70.2	69.8	75.8	76.9	79.2	87.2	55.7	74.0
Seedance 1.5	61.0	69.3	89.6	92.4	99.4	97.5	73.0	71.6	94.2	68.0	72.7	72.4	70.5	96.2	82.4	76.8	90.1	60.7	76.0
Wan 2.7	61.4	68.0	89.4	92.2	100.0	96.3	71.1	88.3	94.6	66.0	71.0	71.0	78.2	92.4	83.7	76.4	90.7	60.3	83.3
Kairos 3.0	59.9	62.7	91.1	95.4	70.1	97.5	58.5	52.2	88.5	65.1	76.8	62.0	76.3	94.3	89.0	80.8	90.8	58.0	62.7
LongCat-Video	66.5	69.6	95.1	94.8	45.9	97.9	77.6	53.1	91.5	63.1	83.3	66.2	81.5	99.4	95.4	82.2	93.4	61.8	76.0
YUME 1.5	58.7	63.3	90.3	93.0	96.8	97.0	57.0	53.1	91.7	72.0	71.5	71.4	48.0	99.4	88.0	83.3	88.8	57.7	72.7
Astra	48.6	52.5	85.3	96.0	79.6	97.7	28.0	43.4	75.9	67.7	64.7	63.3	30.0	86.6	85.6	87.5	83.5	54.6	48.3
Fantasy-World	63.0	62.8	94.2	95.8	49.0	97.9	65.8	52.4	90.1	72.1	80.6	64.2	79.8	100.0	95.3	84.8	92.5	59.7	74.0
HY-GameCraft	52.6	58.7	86.5	93.7	96.8	97.6	38.3	50.6	82.5	67.8	60.5	60.5	17.9	99.4	88.3	85.0	82.6	56.5	68.3
Genie 3	51.6	59.3	90.7	95.0	92.4	97.8	55.2	61.1	83.8	73.3	79.9	78.4	54.5	93.6	88.6	84.5	90.4	59.7	71.7
Happy Oyster	56.6	63.9	91.4	94.0	94.2	97.0	58.3	57.4	91.1	85.1	77.7	75.8	75.0	96.2	87.2	79.8	91.5	57.6	69.3
HY-World 1.5	60.1	65.4	92.7	93.5	91.1	98.1	60.5	53.5	90.8	87.5	90.6	84.9	62.5	100.0	92.0	83.1	89.1	58.6	74.0
Infinite-World	58.7	66.1	88.8	94.1	82.8	98.0	62.3	54.0	84.5	75.9	74.9	74.4	33.8	100.0	94.3	85.1	88.4	57.2	67.0
InSpatio-World	64.4	67.6	95.0	96.0	26.1	98.8	76.1	51.7	91.1	72.8	93.8	66.5	72.5	100.0	97.3	87.4	94.4	63.1	67.3
LingBot-World	66.9	67.9	96.9	94.1	66.2	96.9	81.4	51.6	93.6	79.8	92.7	67.1	90.9	99.4	95.4	83.3	93.5	64.8	77.7
MatrixGame2	54.0	60.3	86.9	94.6	94.9	98.2	41.0	49.4	84.9	80.6	64.5	64.5	29.2	21.0	86.1	81.3	87.2	55.0	59.3
MatrixGame3	46.4	70.0	85.7	86.3	97.5	95.4	57.1	48.9	78.4	83.5	81.0	80.4	13.3	89.8	87.6	75.3	83.0	54.0	64.7

9 Text-driven Models — Full Split (22 metrics)

Model	Aesthetic Quality	Imaging Quality	Background Consistency	Temporal Flickering	Dynamic Degree	Motion Smoothness	HPSv3 Quality	Scene Adherence	Subject Adherence	Navigation Trajectory	Event Edit Adherence	Subject Action Adherence	Perspective Switch Adherence	Spatial Consistency	Gated Spatial Consistency	Perspective Consistency	Segment Continuity	Geometric Consistency	Photometric Consistency	Subject Consistency Cross-Model	Visual Plausibility	Causal Fidelity
HY-Video 1.5	61.9	67.4	92.4	95.5	68.8	98.8	67.5	77.5	93.6	71.8	63.8	55.6	27.6	79.2	75.1	86.6	99.3	94.4	81.4	91.5	59.3	75.0
Kling 3.0	61.3	67.7	92.7	94.5	89.9	97.9	68.8	89.0	92.9	70.3	81.4	85.6	55.0	75.2	75.1	76.8	92.7	89.4	80.4	88.5	60.4	78.0
Cosmos 2.5	60.1	67.2	92.3	96.0	42.4	98.3	65.9	72.4	94.2	64.1	48.2	41.6	20.0	78.1	74.3	84.3	93.1	94.2	82.1	91.8	59.3	74.7
LTX 2.3	56.9	62.3	89.3	94.1	94.4	96.8	57.7	81.3	89.2	67.6	53.0	51.8	25.0	70.2	70.2	69.8	77.8	81.1	79.4	86.7	56.2	74.0
Seedance 1.5	59.7	69.8	89.6	93.4	98.3	97.6	72.9	71.6	94.2	68.0	80.4	80.0	45.0	72.7	72.4	62.7	92.4	83.5	76.7	89.3	60.5	76.0
Wan 2.7	59.6	68.1	89.5	93.0	99.3	96.5	69.4	88.3	94.6	66.0	84.0	83.4	55.0	71.0	71.0	62.2	65.6	82.6	75.5	88.7	59.8	83.3
Kairos 3.0	58.4	63.6	91.8	96.3	63.5	97.9	58.8	52.2	88.5	65.1	46.8	41.4	13.3	76.8	62.0	76.3	94.1	91.5	82.1	90.7	58.2	62.7
LongCat-Video	64.7	69.8	94.7	94.9	59.7	97.7	76.3	53.1	91.5	63.1	50.4	48.4	18.3	83.3	66.2	81.5	98.6	94.7	81.5	92.4	60.8	76.0
YUME 1.5	59.3	65.7	92.0	94.8	86.1	97.7	62.0	53.1	91.7	72.0	57.8	47.0	16.7	71.5	71.4	48.0	99.3	91.1	84.1	89.4	58.1	72.7

🚀 Quick Start

# Install
git clone --recursive https://github.com/meituan-longcat/WBench.git
cd WBench

# If you already cloned without submodules
git submodule update --init --recursive

# Download data and weights
pip install huggingface_hub
hf download meituan-longcat/WBench --repo-type dataset --local-dir data/ --exclude "splits/*"
hf download meituan-longcat/WBench-weights --local-dir weights/

# Environment 1: wbench-main (all metrics except visual_plausibility)
# 2nd arg = PyTorch's CUDA build — match it to YOUR system (check via `nvcc --version`):
#   cu124 → CUDA 12.x    cu121 → CUDA 12.1    cu118 → CUDA 11.8
# Always pass it explicitly: if omitted, auto-detection falls back to cu118 when nvcc
# isn't on PATH, which makes the MegaSAM CUDA extensions fail to build on CUDA-12 machines.
bash tools/install.sh wbench-main cu124
conda activate wbench-main
export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH



# Verify
conda activate wbench-main
python tools/verify_install.py

# Run evaluation (auto multi-GPU)
python main.py --model your_model

See docs/installation.md for detailed setup instructions.

🎮 Evaluate Your Model

Set environment variables for VLM metrics first (we use Doubao-Seed-2.0-lite via Volcengine ARK):

export VLM_API_KEY="<your-ark-api-key>"
# Optional (defaults shown):
# export VLM_API_URL="https://ark.cn-beijing.volces.com/api/v3"
# export VLM_MODEL_NAME="doubao-seed-2-0-lite-260215"

Generate multi-turn videos → place in work_dirs/<model>/videos/case_{id}_combined.mp4
Run the 3-phase pipeline:

# Full pipeline (precompute → GPU metrics → VLM metrics → report)
python main.py --model my_model --gpus 0,1,2,3,4,5,6,7

# Or run phases independently:
python main.py --model my_model --phase precompute    # SAM2 + DA3 + MegaSAM
python main.py --model my_model --phase gpu           # GPU metrics (per-metric)
python main.py --model my_model --phase vlm           # VLM metrics (API)
python main.py --model my_model --phase report        # Aggregate report

Note: the pipeline above covers 21 of the 22 metrics. visual_plausibility is the exception — it runs in the separate wbench-vp environment (set up in Quick Start):

conda activate wbench-vp
python tools/run_visual_plausibility.py --model my_model  # uses all available GPUs

Results: work_dirs/<model>/evaluation/{metric}/case_{id}.json + report.json

# Run specific metrics (by name or dimension)
python main.py --model my_model --phase gpu --metrics hpsv3_quality
python main.py --model my_model --phase gpu --metrics quality         # all 6 video quality
python main.py --model my_model --phase gpu --metrics consistency     # all consistency metrics

# Skip pre-computation if already done
python main.py --model my_model --phase gpu --skip_megasam --skip_sam2 --skip_da3

# Single video evaluation
python main.py --video video.mp4 --case data/cases/case_1.json

Dimensions (--metrics supports these as shorthand):

Dimension	Metrics
`quality`	aesthetic_quality, imaging_quality, temporal_flickering, dynamic_degree, motion_smoothness, hpsv3_quality
`consistency`	background_consistency, segment_continuity, perspective_consistency, subject_consistency, geometric_consistency, photometric_consistency, spatial_consistency, gated_spatial_consistency
`interaction`	navigation_trajectory, event_edit_adherence, subject_action_adherence, perspective_switch_adherence
`setting`	scene_adherence, subject_adherence
`physical`	visual_plausibility, causal_fidelity

🔌 Implement Your Model

WBench supports 3 model types with different control interfaces:

Type	Input	Cases	Status
Text-conditioned	Text prompt + first-frame image	289 (all)	✅ Implemented
Camera-conditioned	First-frame image + 6-DoF camera pose	158 (navi)	✅ Implemented
Action-conditioned	First-frame image + discrete action	158 (navi)	✅ Implemented

Text-conditioned models

from src.models import get_model

# Available: wan, kling, seedance (or register your own)
model = get_model("wan")

# Generate multi-turn video from a case
result = model.generate_multi_turn(
    case=case_dict,
    output_path="work_dirs/wan/videos/case_1_combined.mp4",
    data_root="data/",
)

Each turn: build prompt from interaction → call I2V API → extract last frame → next turn.

Set API credentials:

export VIDEO_API_URL="https://your-video-api.com"
export VIDEO_API_KEY="your-key"

Camera-conditioned models

The benchmark's navigation actions (W/A/S/D + arrows) are converted to per-turn {move, yaw, pitch} intent and then to a 6-DoF camera trajectory. Subclass CameraConditionedModel and implement one hook — case parsing, action→pose conversion, and video writing are handled for you:

from src.models.camera import CameraConditionedModel

class MyWorldModel(CameraConditionedModel):
    def generate_with_poses(self, image, poses, video_length, **kw):
        # image: first-frame path; poses: {"<latent_idx>": {"extrinsic": 4x4, "K": 3x3}, ...}
        # return: list of `video_length` BGR uint8 frames
        return my_model.infer(image, poses, video_length)

MyWorldModel("mymodel").generate_multi_turn(case_dict,
    "work_dirs/mymodel/videos/case_1_combined.mp4", data_root="data/")

The pose convention (axes, speeds, intrinsics) lives in src/models/camera/poses.py — copy and adapt it to your model; the navigation metric normalises scale, so what matters is matching the per-action intent. Quick look at one case:

python -m src.models.camera.demo --case data/cases/case_1.json   # prints poses + renders a preview

Note: Camera/action models only cover the 158 navigation cases (cases containing at least one W/A/S/D/arrow action). When generating at scale, pass only those cases — e.g. via generate.py --model your_model --cases <navi_list>.

Action-conditioned models

Two flavours, both fed from the same per-turn navigation plan:

Programmatic controllers (e.g. Matrix-Game-3). Subclass ActionConditionedModel and implement generate_with_actions. Each action carries both raw key tokens and an MG3-style {keyboard, mouse} tensor:

from src.models.action import ActionConditionedModel

class MyActionModel(ActionConditionedModel):
    def generate_with_actions(self, image, actions, video_length, **kw):
        # actions: [{"turn", "tokens", "keyboard", "mouse", "duration"}, ...]
        return my_model.infer(image, actions, video_length)

MyActionModel("mymodel").generate_multi_turn(case_dict,
    "work_dirs/mymodel/videos/case_1_combined.mp4", data_root="data/")

python -m src.models.action.demo --case data/cases/case_1.json   # prints actions + renders a preview

Web products (e.g. Project Genie, Happy Oyster) — no weights/API; driven by browser automation + simulated keystrokes. See src/models/action/web/.

🤖 Claude Code Skills

If you use Claude Code, this repo ships skills that drive the full workflow — just ask in natural language and Claude runs the right commands:

Skill	Triggers on	What it does
`wbench-generate`	"generate kling videos"	Runs `generate.py` over the dataset → `work_dirs/<model>/videos/`
`wbench-evaluate`	"evaluate kling3"	Runs the 4-phase `main.py` pipeline (precompute → gpu → vlm → report)
`wbench-submit`	"package my model for submission"	Builds the `meta.json` / `turns.json` bundle and uploads to HuggingFace
`genie3` / `happy`	"run case_5 on genie3"	Browser automation for the web products (details)

Skills live in .claude/skills/ (and src/models/action/web/.claude/skills/) and are auto-discovered when you open the repo in Claude Code.

📋 TODO

Text-conditioned model generation (Wan, Kling, Seedance)
Homepage with interactive leaderboard
Dataset and weights release on HuggingFace
Camera-conditioned model generation example
Action-conditioned model generation example
Hosted submission & evaluation service (submit videos, get scores)
ArXiv paper release

📝 Citation

If you find our work useful, please consider citing:

@article{ying2026wbenchcomprehensivemultiturnbenchmark,
  title={WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation},
  author={Ying, Kaining and Hu, Hengrui and Ren, Siyu and Li, Jiamu and Chen, Fengjiao and Wang, Ziwen and Cao, Xuezhi and Cai, Xunliang and Ding, Henghui},
  journal={arXiv preprint arXiv:2605.25874},
  year={2026}
}

🙏 Acknowledgement

This project builds upon the following excellent works:

WorldScore — World model evaluation framework
VBench — Video quality metrics
SAM2 — Segment Anything Model 2 for mask tracking
Depth-Anything-V3 — Monocular depth estimation
MegaSAM — Camera pose estimation
DreamSim — Perceptual similarity metric
HPSv3 — Human Preference Score
AMT — Frame interpolation for motion smoothness
RAFT — Optical flow estimation
TransNetV2 — Scene boundary detection
... and many other excellent open-source projects

📧 Contact

Feel free to open an Issue or Pull Request. You can also reach us directly:

Kaining Ying: kaining.ying.cv@gmail.com
Siyu Ren: rensiyu07@meituan.com

📄 License

Code and data: MIT License. Model weights retain their original licenses.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WBench: A Comprehensive Multi-turn Benchmark for
Interactive Video World Model Evaluation

📢 News

✨ Contributions

🏆 Leaderboard

🚀 Quick Start

🎮 Evaluate Your Model

🔌 Implement Your Model

Text-conditioned models

Camera-conditioned models

Action-conditioned models

🤖 Claude Code Skills

📋 TODO

📝 Citation

🙏 Acknowledgement

📧 Contact

📄 License

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

WBench: A Comprehensive Multi-turn Benchmark forInteractive Video World Model Evaluation

📢 News

✨ Contributions

🏆 Leaderboard

🚀 Quick Start

🎮 Evaluate Your Model

🔌 Implement Your Model

Text-conditioned models

Camera-conditioned models

Action-conditioned models

🤖 Claude Code Skills

📋 TODO

📝 Citation

🙏 Acknowledgement

📧 Contact

📄 License

WBench: A Comprehensive Multi-turn Benchmark for
Interactive Video World Model Evaluation