Does VLA Even Know the Basics? Measuring Commonsense and World Knowledge Retention in Vision-Language-Action Models

Act2Answer is an embodied evaluation protocol for testing whether Vision-Language-Action (VLA) models retain commonsense and world knowledge after robotics adaptation. Instead of asking a model to answer in text, each VLM-style question becomes a short tabletop episode: the agent reads a natural-language instruction and answers by placing a cube on the image tile it believes is correct.

The goal is to keep the motor problem deliberately simple, so failures are more informative about missing, forgotten, or action-inaccessible knowledge rather than long-horizon control difficulty.

✨ Overview

Act2Answer adapts established VLM benchmarks into an embodied binary-choice format. Each task has a short action-compatible instruction, two visual answer options, and a common selection action in simulation. The benchmark focuses on knowledge categories that matter for everyday embodied agents: social, physical, quantitative, temporal, normative, cultural, and biological knowledge.

📚 Act2Answer

The Act2Answer suite contains 1,720 unique binary questions and 3,440 evaluation episodes after including the original and swapped layouts. It covers 12 categories adapted from five source benchmarks.

🔍 Key Findings

Current VLAs usually preserve simple perceptual distinctions such as Color and Shape.
Richer semantic categories are much harder: Emotion, Attribute, State, Time, Counting, Symmetry, Traffic, Public Info, Celebrity, and Living World often remain near chance for many models.
Strong VLM baselines can outperform their VLA counterparts by roughly 20-40 points on many knowledge-sensitive categories, suggesting a substantial VLM-to-VLA gap.
Layerwise probing shows that answer-relevant information often remains recoverable in intermediate backbone layers, but weakens near the layers used for action prediction.
VLA models trained with continued vision-language supervision tend to do better on knowledge-sensitive tasks than models trained mainly on robotics data.
Downstream action fine-tuning can improve control while further weakening some forms of knowledge-sensitive behavior.

⚙️ Installation

Prerequisites:

Linux with an NVIDIA GPU and working CUDA driver.
Conda or Miniconda available at /opt/conda, or configured with CONDA_ROOT.
git, tmux, and enough disk for model weights.
Optional but recommended: huggingface-cli login before the first model download.

Clone the repository and external model repos:

git clone <this-repo-url> Act2Answer
cd Act2Answer

bash scripts/setup/clone_external_repos.sh

Build the conda environment for the model you want to run. For example, SpatialVLA:

bash scripts/setup/setup_spatialvla_env.sh

Full setup instructions for all supported model stacks are in SETUP_README.md.

📈 Evaluation

Every evaluation wrapper sources scripts/env.sh, activates the expected conda environment, runs both noswap and swap, and writes FINAL_STATS to $A2A_LOG_DIR/<model>_<asset>_eval.log.

In-process models:

ASSETS=test_colors COUNT=6 EVAL_GPU=3 bash scripts/eval_pi0.sh
ASSETS=test_colors COUNT=6 EVAL_GPU=3 bash scripts/eval_magma.sh
ASSETS=test_colors COUNT=6 EVAL_GPU=3 bash scripts/eval_openvla.sh
ASSETS=test_colors COUNT=6 EVAL_GPU=3 bash scripts/eval_spatialvla.sh

Server-based models need their policy server first:

GPU=0 bash scripts/servers/run_xiaomi_policy_server.sh
ASSETS=test_colors COUNT=6 EVAL_GPU=3 bash scripts/eval_xiaomi.sh

Available evaluation wrappers:

VLA	Script	Environment
pi0	`scripts/eval_pi0.sh`	`pi0_act2answer`
Magma	`scripts/eval_magma.sh`	`magma_act2answer`
OpenVLA	`scripts/eval_openvla.sh`	`openvla_rl4vla`
SpatialVLA	`scripts/eval_spatialvla.sh`	`spatialvla_act2answer`
Xiaomi-Robotics-0	`scripts/eval_xiaomi.sh`	server `mibot`, client `act2ans`
InternVLA-M1	`scripts/eval_internvla.sh`	server `internvla`, client `act2ans`
MolmoAct2	`scripts/eval_molmoact.sh`	server `molmoact2`, client `act2ans`

You can also run the combined helper:

bash scripts/eval_all3.sh test_colors 6 3

Assets

Act2Answer assets live under:

ManiSkill/mani_skill/assets/carrot/<asset_name>/

Each asset set contains pairs.json, tile models/textures, and metadata. Use ASSETS=<asset_name> and COUNT=<n> to select an evaluation slice. COUNT=0 means all tasks.

Outputs

Evaluation creates local files only:

Videos and per-run YAML: $A2A_OUTPUT_DIR/<run-name>/glob/ (default: outputs/).
Logs: $A2A_LOG_DIR/ (default: logs/).
No wandb initialization, runs, or artifacts are created by evaluation.

❤️ Citation

If you find Act2Answer useful, please cite our paper:

@misc{kachaev2026doesvlaknowbasics,
  title={Does VLA Even Know the Basics? Measuring Commonsense and World Knowledge Retention in Vision-Language-Action Models},
  author={Nikita Kachaev and Andrey Moskalenko and Matvey Skripkin and Nikita Kurlaev and Daria Pugacheva and Albina Burlova and Mikhail Kolosov and Denis Shepelev and Andrey Kuznetsov and Elena Tutubalina and Aleksandr I. Panov and Alexey K. Kovalev and Vlad Shakhuro},
  year={2026},
  eprint={2606.19297},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2606.19297}
}

Acknowledgements

Act2Answer builds on SimplerEnv and ManiSkill, with evaluation harness pieces derived from RL4VLA. The README structure follows the public BlindVLA project style.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
ManiSkill		ManiSkill
SimplerEnv		SimplerEnv
figs		figs
openvla		openvla
requirements		requirements
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
SETUP_README.md		SETUP_README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Does VLA Even Know the Basics? Measuring Commonsense and World Knowledge Retention in Vision-Language-Action Models

Contents

✨ Overview

📚 Act2Answer

🔍 Key Findings

⚙️ Installation

📈 Evaluation

Assets

Outputs

❤️ Citation

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Does VLA Even Know the Basics? Measuring Commonsense and World Knowledge Retention in Vision-Language-Action Models

Contents

✨ Overview

📚 Act2Answer

🔍 Key Findings

⚙️ Installation

📈 Evaluation

Assets

Outputs

❤️ Citation

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages