Act2Answer is an embodied evaluation protocol for testing whether Vision-Language-Action (VLA) models retain commonsense and world knowledge after robotics adaptation. Instead of asking a model to answer in text, each VLM-style question becomes a short tabletop episode: the agent reads a natural-language instruction and answers by placing a cube on the image tile it believes is correct.
The goal is to keep the motor problem deliberately simple, so failures are more informative about missing, forgotten, or action-inaccessible knowledge rather than long-horizon control difficulty.
Act2Answer adapts established VLM benchmarks into an embodied binary-choice format. Each task has a short action-compatible instruction, two visual answer options, and a common selection action in simulation. The benchmark focuses on knowledge categories that matter for everyday embodied agents: social, physical, quantitative, temporal, normative, cultural, and biological knowledge.
The Act2Answer suite contains 1,720 unique binary questions and 3,440 evaluation episodes after including the original and swapped layouts. It covers 12 categories adapted from five source benchmarks.
- Current VLAs usually preserve simple perceptual distinctions such as Color and Shape.
- Richer semantic categories are much harder: Emotion, Attribute, State, Time, Counting, Symmetry, Traffic, Public Info, Celebrity, and Living World often remain near chance for many models.
- Strong VLM baselines can outperform their VLA counterparts by roughly 20-40 points on many knowledge-sensitive categories, suggesting a substantial VLM-to-VLA gap.
- Layerwise probing shows that answer-relevant information often remains recoverable in intermediate backbone layers, but weakens near the layers used for action prediction.
- VLA models trained with continued vision-language supervision tend to do better on knowledge-sensitive tasks than models trained mainly on robotics data.
- Downstream action fine-tuning can improve control while further weakening some forms of knowledge-sensitive behavior.
Prerequisites:
- Linux with an NVIDIA GPU and working CUDA driver.
- Conda or Miniconda available at
/opt/conda, or configured withCONDA_ROOT. git,tmux, and enough disk for model weights.- Optional but recommended:
huggingface-cli loginbefore the first model download.
Clone the repository and external model repos:
git clone <this-repo-url> Act2Answer
cd Act2Answer
bash scripts/setup/clone_external_repos.shBuild the conda environment for the model you want to run. For example, SpatialVLA:
bash scripts/setup/setup_spatialvla_env.shFull setup instructions for all supported model stacks are in SETUP_README.md.
Every evaluation wrapper sources scripts/env.sh, activates the expected conda
environment, runs both noswap and swap, and writes FINAL_STATS to
$A2A_LOG_DIR/<model>_<asset>_eval.log.
In-process models:
ASSETS=test_colors COUNT=6 EVAL_GPU=3 bash scripts/eval_pi0.sh
ASSETS=test_colors COUNT=6 EVAL_GPU=3 bash scripts/eval_magma.sh
ASSETS=test_colors COUNT=6 EVAL_GPU=3 bash scripts/eval_openvla.sh
ASSETS=test_colors COUNT=6 EVAL_GPU=3 bash scripts/eval_spatialvla.shServer-based models need their policy server first:
GPU=0 bash scripts/servers/run_xiaomi_policy_server.sh
ASSETS=test_colors COUNT=6 EVAL_GPU=3 bash scripts/eval_xiaomi.shAvailable evaluation wrappers:
| VLA | Script | Environment |
|---|---|---|
| pi0 | scripts/eval_pi0.sh |
pi0_act2answer |
| Magma | scripts/eval_magma.sh |
magma_act2answer |
| OpenVLA | scripts/eval_openvla.sh |
openvla_rl4vla |
| SpatialVLA | scripts/eval_spatialvla.sh |
spatialvla_act2answer |
| Xiaomi-Robotics-0 | scripts/eval_xiaomi.sh |
server mibot, client act2ans |
| InternVLA-M1 | scripts/eval_internvla.sh |
server internvla, client act2ans |
| MolmoAct2 | scripts/eval_molmoact.sh |
server molmoact2, client act2ans |
You can also run the combined helper:
bash scripts/eval_all3.sh test_colors 6 3Act2Answer assets live under:
ManiSkill/mani_skill/assets/carrot/<asset_name>/
Each asset set contains pairs.json, tile models/textures, and metadata. Use ASSETS=<asset_name>
and COUNT=<n> to select an evaluation slice. COUNT=0 means all tasks.
Evaluation creates local files only:
- Videos and per-run YAML:
$A2A_OUTPUT_DIR/<run-name>/glob/(default:outputs/). - Logs:
$A2A_LOG_DIR/(default:logs/). - No wandb initialization, runs, or artifacts are created by evaluation.
If you find Act2Answer useful, please cite our paper:
@misc{kachaev2026doesvlaknowbasics,
title={Does VLA Even Know the Basics? Measuring Commonsense and World Knowledge Retention in Vision-Language-Action Models},
author={Nikita Kachaev and Andrey Moskalenko and Matvey Skripkin and Nikita Kurlaev and Daria Pugacheva and Albina Burlova and Mikhail Kolosov and Denis Shepelev and Andrey Kuznetsov and Elena Tutubalina and Aleksandr I. Panov and Alexey K. Kovalev and Vlad Shakhuro},
year={2026},
eprint={2606.19297},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2606.19297}
}Act2Answer builds on SimplerEnv and ManiSkill, with evaluation harness pieces derived from RL4VLA. The README structure follows the public BlindVLA project style.



