This repository contains all supplementary assets submitted with the paper, including code, human annotation study materials, and links to data used in support of the experiments and findings reported in the main manuscript.
├── code/ # Source code
├── dataset/README-dataset.md # Dataset description (dataset itself hosted externally)
├── human-annotation-study/ # Human annotation study materials
├── LICENSE.txt # Licensing terms for included assets
└── README.md # This file
We release the full codebase used to run all experiments in the paper, including training and evaluation of UNIVERSE, as well as the code used to obtain baseline results—both zero-shot and fine-tuned—for PaliGemma, VideoLLaMA3, and CLIP. The codebase includes configuration files, data loaders, training and evaluation scripts, and supporting utilities.
For full usage instructions, refer to code/README-code.md.
We release a subset of our evaluation dataset, curated from realistic human gameplay in a complex, multi-agent game environment.
Details on file structure and data formats are provided in dataset/README-dataset.md.
Note: Due to file size limitations, the dataset is hosted externally on Google Drive using a burner email account: link.
This directory contains rollouts used in our human annotation study, designed to assess the fine-grained evaluation accuracy of UNIVERSE on rollouts generated by world models. We provide a total of 656 rollouts, generated by two world models across seven diverse environments.
The annotation scores are currently under internal review and will be released upon approval. For generation protocol and data breakdown, refer to human-annotation-study/README-human-annotation-study.md.
Note: Due to file size limitations, the dataset is hosted externally on Google Drive using a burner email account: link.
All materials are included in the supplementary ZIP file submitted with the paper and will be publicly released upon publication.