Skip to content

Latest commit

 

History

History
399 lines (310 loc) · 35.8 KB

File metadata and controls

399 lines (310 loc) · 35.8 KB

WBench: A Comprehensive Multi-turn Benchmark for
Interactive Video World Model Evaluation

Homepage Paper HF Daily Paper Leaderboard Datasets Weights Examples ModelScope 中文解读 WeChat Live TWITTER POST WeChat Group

Is Your World Model an All-Round Player?

TL;DR — WBench evaluates 20 video world models across 5 dimensions and 22 metrics.

📢 News

✨ Contributions

  • A comprehensive evaluation framework with 289 cases, 1,058 interaction turns, covering 4 interaction types (navigation, subject action, event editing, perspective switching) across diverse scenes and perspectives.
  • A unified navigation protocol that bridges text, 6-DoF camera pose, and discrete-action interfaces, enabling fair comparison across model families.
  • 22 automatic metrics spanning 5 complementary dimensions, validated against human judgments, ensuring reliable automatic evaluation at scale.
  • Systematic diagnosis of 20 models revealing that current world models have not yet unified high-fidelity rendering with reliable controllability, consistency, and physics compliance.

🏆 Leaderboard

20 Models — Navigation Split (5 Dimensions, sorted by average)

# Model Average Quality Setting Interaction Consistency Physical
1 Kling 3.0 79.2 🥇 83.0 🥈 91.0 🥈 70.3    82.5    69.3 🥉
2 LingBot-World 78.8 🥈 81.5    72.6    79.8    88.9 🥇 71.2 🥈
3 Wan 2.7 78.5 🥉 82.6 🥉 91.4 🥇 66.0    80.5    71.8 🥇
4 HY-World 1.5 78.4      80.2    72.2    87.5 🥇 86.0    66.3   
5 HY-Video 1.5 78.2    79.7    85.6 🥉 71.8    86.7 🥉 67.4   
6 Happy Oyster 77.1    79.3    74.2    85.1 🥈 83.3    63.5   
7 Seedance 1.5 76.5    83.2 🥇 82.9    68.0    80.2    68.4   
8 Cosmos 2.5 75.2    75.6    83.3    64.1    85.6    67.4   
9 LTX 2.3 74.4    78.7    85.2    67.6    75.6    64.9   
10 InSpatio-World 74.3    74.9    71.4    72.8    87.4 🥈 65.2   
11 Fantasy-World 74.2    75.5    71.3    72.1    85.3    66.8   
12 Genie 3 74.1    77.4    72.5    73.3    81.4    65.7   
13 LongCat-Video 73.7    78.2    72.3    63.1    85.9    68.9   
14 YUME 1.5 73.5    79.5    72.4    72.0    78.6    65.2   
15 Infinite-World 72.9    78.7    69.3    75.9    78.7    62.1   
16 MatrixGame3 71.2    76.9    63.6    83.5 🥉 72.9    59.3   
17 Kairos 3.0 70.7    76.4    70.3    65.1    81.4    60.4   
18 HY-GameCraft 68.5    74.9    66.6    67.8    70.6    62.4   
19 MatrixGame2 68.5    75.7    67.1    80.6    62.0    57.2   
20 Astra 64.0    69.7    59.6    67.7    71.6    51.4   

9 Text-driven Models — Full Split (5 Dimensions, sorted by average)

# Model Average Quality Setting Interaction Consistency Physical
1 Kling 3.0 79.5 🥇 81.8 🥉 91.0 🥈 73.1 🥇 82.6    69.2 🥈
2 Wan 2.7 78.2 🥈 82.2 🥈 91.4 🥇 72.1 🥈 73.8    71.6 🥇
3 Seedance 1.5 76.2 🥉 83.0 🥇 82.9    68.3 🥉 78.5    68.2   
4 HY-Video 1.5 74.6    78.9    85.6 🥉 54.7    86.8 🥇 67.1   
5 LTX 2.3 71.0    78.8    85.2    49.4    76.4    65.1   
6 Cosmos 2.5 70.8    74.6    83.3    43.5    85.4 🥉 67.0   
7 LongCat-Video 70.2    79.7    72.3    45.1    85.5 🥈 68.4 🥉
8 YUME 1.5 69.0    79.7    72.4    48.4    79.3    65.4   
9 Kairos 3.0 66.0    75.8    70.3    41.6    81.9    60.5   
20 Models — Navigation Split (19 metrics)
Model Aesthetic Quality Imaging Quality Background Consistency Temporal Flickering Dynamic Degree Motion Smoothness HPSv3 Quality Scene Adherence Subject Adherence Navigation Trajectory Spatial Consistency Gated Spatial Consistency Perspective Consistency Segment Continuity Geometric Consistency Photometric Consistency Subject Consistency Cross-Model Visual Plausibility Causal Fidelity
HY-Video 1.5 63.4    67.4    92.1    94.2    73.9    98.7    68.0    77.5    93.6    71.8    79.2    75.1    86.6    99.4    94.6    80.3    91.6    59.7    75.0   
Kling 3.0 63.0    68.1    92.3    93.2    97.5    97.6    69.1    89.0    92.9    70.3    75.2    75.1    76.8    93.0    88.9    79.9    88.5    60.7    78.0   
Cosmos 2.5 61.8    66.9    92.3    94.8    49.0    98.2    66.5    72.4    94.2    64.1    78.1    74.3    84.3    94.3    94.6    81.6    92.3    60.1    74.7   
LTX 2.3 57.9    61.0    88.3    93.2    98.1    96.4    56.1    81.3    89.2    67.6    70.2    70.2    69.8    75.8    76.9    79.2    87.2    55.7    74.0   
Seedance 1.5 61.0    69.3    89.6    92.4    99.4    97.5    73.0    71.6    94.2    68.0    72.7    72.4    70.5    96.2    82.4    76.8    90.1    60.7    76.0   
Wan 2.7 61.4    68.0    89.4    92.2    100.0    96.3    71.1    88.3    94.6    66.0    71.0    71.0    78.2    92.4    83.7    76.4    90.7    60.3    83.3   
Kairos 3.0 59.9    62.7    91.1    95.4    70.1    97.5    58.5    52.2    88.5    65.1    76.8    62.0    76.3    94.3    89.0    80.8    90.8    58.0    62.7   
LongCat-Video 66.5    69.6    95.1    94.8    45.9    97.9    77.6    53.1    91.5    63.1    83.3    66.2    81.5    99.4    95.4    82.2    93.4    61.8    76.0   
YUME 1.5 58.7    63.3    90.3    93.0    96.8    97.0    57.0    53.1    91.7    72.0    71.5    71.4    48.0    99.4    88.0    83.3    88.8    57.7    72.7   
Astra 48.6    52.5    85.3    96.0    79.6    97.7    28.0    43.4    75.9    67.7    64.7    63.3    30.0    86.6    85.6    87.5    83.5    54.6    48.3   
Fantasy-World 63.0    62.8    94.2    95.8    49.0    97.9    65.8    52.4    90.1    72.1    80.6    64.2    79.8    100.0    95.3    84.8    92.5    59.7    74.0   
HY-GameCraft 52.6    58.7    86.5    93.7    96.8    97.6    38.3    50.6    82.5    67.8    60.5    60.5    17.9    99.4    88.3    85.0    82.6    56.5    68.3   
Genie 3 51.6    59.3    90.7    95.0    92.4    97.8    55.2    61.1    83.8    73.3    79.9    78.4    54.5    93.6    88.6    84.5    90.4    59.7    71.7   
Happy Oyster 56.6    63.9    91.4    94.0    94.2    97.0    58.3    57.4    91.1    85.1    77.7    75.8    75.0    96.2    87.2    79.8    91.5    57.6    69.3   
HY-World 1.5 60.1    65.4    92.7    93.5    91.1    98.1    60.5    53.5    90.8    87.5    90.6    84.9    62.5    100.0    92.0    83.1    89.1    58.6    74.0   
Infinite-World 58.7    66.1    88.8    94.1    82.8    98.0    62.3    54.0    84.5    75.9    74.9    74.4    33.8    100.0    94.3    85.1    88.4    57.2    67.0   
InSpatio-World 64.4    67.6    95.0    96.0    26.1    98.8    76.1    51.7    91.1    72.8    93.8    66.5    72.5    100.0    97.3    87.4    94.4    63.1    67.3   
LingBot-World 66.9    67.9    96.9    94.1    66.2    96.9    81.4    51.6    93.6    79.8    92.7    67.1    90.9    99.4    95.4    83.3    93.5    64.8    77.7   
MatrixGame2 54.0    60.3    86.9    94.6    94.9    98.2    41.0    49.4    84.9    80.6    64.5    64.5    29.2    21.0    86.1    81.3    87.2    55.0    59.3   
MatrixGame3 46.4    70.0    85.7    86.3    97.5    95.4    57.1    48.9    78.4    83.5    81.0    80.4    13.3    89.8    87.6    75.3    83.0    54.0    64.7   
9 Text-driven Models — Full Split (22 metrics)
Model Aesthetic Quality Imaging Quality Background Consistency Temporal Flickering Dynamic Degree Motion Smoothness HPSv3 Quality Scene Adherence Subject Adherence Navigation Trajectory Event Edit Adherence Subject Action Adherence Perspective Switch Adherence Spatial Consistency Gated Spatial Consistency Perspective Consistency Segment Continuity Geometric Consistency Photometric Consistency Subject Consistency Cross-Model Visual Plausibility Causal Fidelity
HY-Video 1.5 61.9    67.4    92.4    95.5    68.8    98.8    67.5    77.5    93.6    71.8    63.8    55.6    27.6    79.2    75.1    86.6    99.3    94.4    81.4    91.5    59.3    75.0   
Kling 3.0 61.3    67.7    92.7    94.5    89.9    97.9    68.8    89.0    92.9    70.3    81.4    85.6    55.0    75.2    75.1    76.8    92.7    89.4    80.4    88.5    60.4    78.0   
Cosmos 2.5 60.1    67.2    92.3    96.0    42.4    98.3    65.9    72.4    94.2    64.1    48.2    41.6    20.0    78.1    74.3    84.3    93.1    94.2    82.1    91.8    59.3    74.7   
LTX 2.3 56.9    62.3    89.3    94.1    94.4    96.8    57.7    81.3    89.2    67.6    53.0    51.8    25.0    70.2    70.2    69.8    77.8    81.1    79.4    86.7    56.2    74.0   
Seedance 1.5 59.7    69.8    89.6    93.4    98.3    97.6    72.9    71.6    94.2    68.0    80.4    80.0    45.0    72.7    72.4    62.7    92.4    83.5    76.7    89.3    60.5    76.0   
Wan 2.7 59.6    68.1    89.5    93.0    99.3    96.5    69.4    88.3    94.6    66.0    84.0    83.4    55.0    71.0    71.0    62.2    65.6    82.6    75.5    88.7    59.8    83.3   
Kairos 3.0 58.4    63.6    91.8    96.3    63.5    97.9    58.8    52.2    88.5    65.1    46.8    41.4    13.3    76.8    62.0    76.3    94.1    91.5    82.1    90.7    58.2    62.7   
LongCat-Video 64.7    69.8    94.7    94.9    59.7    97.7    76.3    53.1    91.5    63.1    50.4    48.4    18.3    83.3    66.2    81.5    98.6    94.7    81.5    92.4    60.8    76.0   
YUME 1.5 59.3    65.7    92.0    94.8    86.1    97.7    62.0    53.1    91.7    72.0    57.8    47.0    16.7    71.5    71.4    48.0    99.3    91.1    84.1    89.4    58.1    72.7   

🚀 Quick Start

# Install
git clone --recursive https://github.com/meituan-longcat/WBench.git
cd WBench

# If you already cloned without submodules
git submodule update --init --recursive

# Download data and weights
pip install huggingface_hub
hf download meituan-longcat/WBench --repo-type dataset --local-dir data/ --exclude "splits/*"
hf download meituan-longcat/WBench-weights --local-dir weights/

# Environment 1: wbench-main (all metrics except visual_plausibility)
# 2nd arg = PyTorch's CUDA build — match it to YOUR system (check via `nvcc --version`):
#   cu124 → CUDA 12.x    cu121 → CUDA 12.1    cu118 → CUDA 11.8
# Always pass it explicitly: if omitted, auto-detection falls back to cu118 when nvcc
# isn't on PATH, which makes the MegaSAM CUDA extensions fail to build on CUDA-12 machines.
bash tools/install.sh wbench-main cu124
conda activate wbench-main
export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH



# Verify
conda activate wbench-main
python tools/verify_install.py

# Run evaluation (auto multi-GPU)
python main.py --model your_model

See docs/installation.md for detailed setup instructions.

🎮 Evaluate Your Model

Set environment variables for VLM metrics first (we use Doubao-Seed-2.0-lite via Volcengine ARK):

export VLM_API_KEY="<your-ark-api-key>"
# Optional (defaults shown):
# export VLM_API_URL="https://ark.cn-beijing.volces.com/api/v3"
# export VLM_MODEL_NAME="doubao-seed-2-0-lite-260215"
  1. Generate multi-turn videos → place in work_dirs/<model>/videos/case_{id}_combined.mp4
  2. Run the 3-phase pipeline:
# Full pipeline (precompute → GPU metrics → VLM metrics → report)
python main.py --model my_model --gpus 0,1,2,3,4,5,6,7

# Or run phases independently:
python main.py --model my_model --phase precompute    # SAM2 + DA3 + MegaSAM
python main.py --model my_model --phase gpu           # GPU metrics (per-metric)
python main.py --model my_model --phase vlm           # VLM metrics (API)
python main.py --model my_model --phase report        # Aggregate report

Note: the pipeline above covers 21 of the 22 metrics. visual_plausibility is the exception — it runs in the separate wbench-vp environment (set up in Quick Start):

conda activate wbench-vp
python tools/run_visual_plausibility.py --model my_model  # uses all available GPUs
  1. Results: work_dirs/<model>/evaluation/{metric}/case_{id}.json + report.json
# Run specific metrics (by name or dimension)
python main.py --model my_model --phase gpu --metrics hpsv3_quality
python main.py --model my_model --phase gpu --metrics quality         # all 6 video quality
python main.py --model my_model --phase gpu --metrics consistency     # all consistency metrics

# Skip pre-computation if already done
python main.py --model my_model --phase gpu --skip_megasam --skip_sam2 --skip_da3

# Single video evaluation
python main.py --video video.mp4 --case data/cases/case_1.json

Dimensions (--metrics supports these as shorthand):

Dimension Metrics
quality aesthetic_quality, imaging_quality, temporal_flickering, dynamic_degree, motion_smoothness, hpsv3_quality
consistency background_consistency, segment_continuity, perspective_consistency, subject_consistency, geometric_consistency, photometric_consistency, spatial_consistency, gated_spatial_consistency
interaction navigation_trajectory, event_edit_adherence, subject_action_adherence, perspective_switch_adherence
setting scene_adherence, subject_adherence
physical visual_plausibility, causal_fidelity

🔌 Implement Your Model

WBench supports 3 model types with different control interfaces:

Type Input Cases Status
Text-conditioned Text prompt + first-frame image 289 (all) ✅ Implemented
Camera-conditioned First-frame image + 6-DoF camera pose 158 (navi) ✅ Implemented
Action-conditioned First-frame image + discrete action 158 (navi) ✅ Implemented

Text-conditioned models

from src.models import get_model

# Available: wan, kling, seedance (or register your own)
model = get_model("wan")

# Generate multi-turn video from a case
result = model.generate_multi_turn(
    case=case_dict,
    output_path="work_dirs/wan/videos/case_1_combined.mp4",
    data_root="data/",
)

Each turn: build prompt from interaction → call I2V API → extract last frame → next turn.

Set API credentials:

export VIDEO_API_URL="https://your-video-api.com"
export VIDEO_API_KEY="your-key"

Camera-conditioned models

The benchmark's navigation actions (W/A/S/D + arrows) are converted to per-turn {move, yaw, pitch} intent and then to a 6-DoF camera trajectory. Subclass CameraConditionedModel and implement one hook — case parsing, action→pose conversion, and video writing are handled for you:

from src.models.camera import CameraConditionedModel

class MyWorldModel(CameraConditionedModel):
    def generate_with_poses(self, image, poses, video_length, **kw):
        # image: first-frame path; poses: {"<latent_idx>": {"extrinsic": 4x4, "K": 3x3}, ...}
        # return: list of `video_length` BGR uint8 frames
        return my_model.infer(image, poses, video_length)

MyWorldModel("mymodel").generate_multi_turn(case_dict,
    "work_dirs/mymodel/videos/case_1_combined.mp4", data_root="data/")

The pose convention (axes, speeds, intrinsics) lives in src/models/camera/poses.py — copy and adapt it to your model; the navigation metric normalises scale, so what matters is matching the per-action intent. Quick look at one case:

python -m src.models.camera.demo --case data/cases/case_1.json   # prints poses + renders a preview

Note: Camera/action models only cover the 158 navigation cases (cases containing at least one W/A/S/D/arrow action). When generating at scale, pass only those cases — e.g. via generate.py --model your_model --cases <navi_list>.

Action-conditioned models

Two flavours, both fed from the same per-turn navigation plan:

Programmatic controllers (e.g. Matrix-Game-3). Subclass ActionConditionedModel and implement generate_with_actions. Each action carries both raw key tokens and an MG3-style {keyboard, mouse} tensor:

from src.models.action import ActionConditionedModel

class MyActionModel(ActionConditionedModel):
    def generate_with_actions(self, image, actions, video_length, **kw):
        # actions: [{"turn", "tokens", "keyboard", "mouse", "duration"}, ...]
        return my_model.infer(image, actions, video_length)

MyActionModel("mymodel").generate_multi_turn(case_dict,
    "work_dirs/mymodel/videos/case_1_combined.mp4", data_root="data/")
python -m src.models.action.demo --case data/cases/case_1.json   # prints actions + renders a preview

Web products (e.g. Project Genie, Happy Oyster) — no weights/API; driven by browser automation + simulated keystrokes. See src/models/action/web/.

🤖 Claude Code Skills

If you use Claude Code, this repo ships skills that drive the full workflow — just ask in natural language and Claude runs the right commands:

Skill Triggers on What it does
wbench-generate "generate kling videos" Runs generate.py over the dataset → work_dirs/<model>/videos/
wbench-evaluate "evaluate kling3" Runs the 4-phase main.py pipeline (precompute → gpu → vlm → report)
wbench-submit "package my model for submission" Builds the meta.json / turns.json bundle and uploads to HuggingFace
genie3 / happy "run case_5 on genie3" Browser automation for the web products (details)

Skills live in .claude/skills/ (and src/models/action/web/.claude/skills/) and are auto-discovered when you open the repo in Claude Code.

📋 TODO

  • Text-conditioned model generation (Wan, Kling, Seedance)
  • Homepage with interactive leaderboard
  • Dataset and weights release on HuggingFace
  • Camera-conditioned model generation example
  • Action-conditioned model generation example
  • Hosted submission & evaluation service (submit videos, get scores)
  • ArXiv paper release

📝 Citation

If you find our work useful, please consider citing:

@article{ying2026wbenchcomprehensivemultiturnbenchmark,
  title={WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation},
  author={Ying, Kaining and Hu, Hengrui and Ren, Siyu and Li, Jiamu and Chen, Fengjiao and Wang, Ziwen and Cao, Xuezhi and Cai, Xunliang and Ding, Henghui},
  journal={arXiv preprint arXiv:2605.25874},
  year={2026}
}

🙏 Acknowledgement

This project builds upon the following excellent works:

  • WorldScore — World model evaluation framework
  • VBench — Video quality metrics
  • SAM2 — Segment Anything Model 2 for mask tracking
  • Depth-Anything-V3 — Monocular depth estimation
  • MegaSAM — Camera pose estimation
  • DreamSim — Perceptual similarity metric
  • HPSv3 — Human Preference Score
  • AMT — Frame interpolation for motion smoothness
  • RAFT — Optical flow estimation
  • TransNetV2 — Scene boundary detection
  • ... and many other excellent open-source projects

📧 Contact

Feel free to open an Issue or Pull Request. You can also reach us directly:

  • Kaining Ying: kaining.ying.cv@gmail.com
  • Siyu Ren: rensiyu07@meituan.com

📄 License

Code and data: MIT License. Model weights retain their original licenses.