Skip to content

CognitiveAISystems/Act2Answer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Does VLA Even Know the Basics? Measuring Commonsense and World Knowledge Retention in Vision-Language-Action Models

Paper Project-Page
HF Papers

Act2Answer overview and evaluation results

Act2Answer is an embodied evaluation protocol for testing whether Vision-Language-Action (VLA) models retain commonsense and world knowledge after robotics adaptation. Instead of asking a model to answer in text, each VLM-style question becomes a short tabletop episode: the agent reads a natural-language instruction and answers by placing a cube on the image tile it believes is correct.

The goal is to keep the motor problem deliberately simple, so failures are more informative about missing, forgotten, or action-inaccessible knowledge rather than long-horizon control difficulty.

Contents

✨ Overview

Act2Answer adapts established VLM benchmarks into an embodied binary-choice format. Each task has a short action-compatible instruction, two visual answer options, and a common selection action in simulation. The benchmark focuses on knowledge categories that matter for everyday embodied agents: social, physical, quantitative, temporal, normative, cultural, and biological knowledge.

Act2Answer task examples

📚 Act2Answer

The Act2Answer suite contains 1,720 unique binary questions and 3,440 evaluation episodes after including the original and swapped layouts. It covers 12 categories adapted from five source benchmarks.

Act2Answer data curation pipeline

🔍 Key Findings

Layerwise probing results

  • Current VLAs usually preserve simple perceptual distinctions such as Color and Shape.
  • Richer semantic categories are much harder: Emotion, Attribute, State, Time, Counting, Symmetry, Traffic, Public Info, Celebrity, and Living World often remain near chance for many models.
  • Strong VLM baselines can outperform their VLA counterparts by roughly 20-40 points on many knowledge-sensitive categories, suggesting a substantial VLM-to-VLA gap.
  • Layerwise probing shows that answer-relevant information often remains recoverable in intermediate backbone layers, but weakens near the layers used for action prediction.
  • VLA models trained with continued vision-language supervision tend to do better on knowledge-sensitive tasks than models trained mainly on robotics data.
  • Downstream action fine-tuning can improve control while further weakening some forms of knowledge-sensitive behavior.

⚙️ Installation

Prerequisites:

  • Linux with an NVIDIA GPU and working CUDA driver.
  • Conda or Miniconda available at /opt/conda, or configured with CONDA_ROOT.
  • git, tmux, and enough disk for model weights.
  • Optional but recommended: huggingface-cli login before the first model download.

Clone the repository and external model repos:

git clone <this-repo-url> Act2Answer
cd Act2Answer

bash scripts/setup/clone_external_repos.sh

Build the conda environment for the model you want to run. For example, SpatialVLA:

bash scripts/setup/setup_spatialvla_env.sh

Full setup instructions for all supported model stacks are in SETUP_README.md.

📈 Evaluation

Every evaluation wrapper sources scripts/env.sh, activates the expected conda environment, runs both noswap and swap, and writes FINAL_STATS to $A2A_LOG_DIR/<model>_<asset>_eval.log.

In-process models:

ASSETS=test_colors COUNT=6 EVAL_GPU=3 bash scripts/eval_pi0.sh
ASSETS=test_colors COUNT=6 EVAL_GPU=3 bash scripts/eval_magma.sh
ASSETS=test_colors COUNT=6 EVAL_GPU=3 bash scripts/eval_openvla.sh
ASSETS=test_colors COUNT=6 EVAL_GPU=3 bash scripts/eval_spatialvla.sh

Server-based models need their policy server first:

GPU=0 bash scripts/servers/run_xiaomi_policy_server.sh
ASSETS=test_colors COUNT=6 EVAL_GPU=3 bash scripts/eval_xiaomi.sh

Available evaluation wrappers:

VLA Script Environment
pi0 scripts/eval_pi0.sh pi0_act2answer
Magma scripts/eval_magma.sh magma_act2answer
OpenVLA scripts/eval_openvla.sh openvla_rl4vla
SpatialVLA scripts/eval_spatialvla.sh spatialvla_act2answer
Xiaomi-Robotics-0 scripts/eval_xiaomi.sh server mibot, client act2ans
InternVLA-M1 scripts/eval_internvla.sh server internvla, client act2ans
MolmoAct2 scripts/eval_molmoact.sh server molmoact2, client act2ans

You can also run the combined helper:

bash scripts/eval_all3.sh test_colors 6 3

Assets

Act2Answer assets live under:

ManiSkill/mani_skill/assets/carrot/<asset_name>/

Each asset set contains pairs.json, tile models/textures, and metadata. Use ASSETS=<asset_name> and COUNT=<n> to select an evaluation slice. COUNT=0 means all tasks.

Outputs

Evaluation creates local files only:

  • Videos and per-run YAML: $A2A_OUTPUT_DIR/<run-name>/glob/ (default: outputs/).
  • Logs: $A2A_LOG_DIR/ (default: logs/).
  • No wandb initialization, runs, or artifacts are created by evaluation.

❤️ Citation

If you find Act2Answer useful, please cite our paper:

@misc{kachaev2026doesvlaknowbasics,
  title={Does VLA Even Know the Basics? Measuring Commonsense and World Knowledge Retention in Vision-Language-Action Models},
  author={Nikita Kachaev and Andrey Moskalenko and Matvey Skripkin and Nikita Kurlaev and Daria Pugacheva and Albina Burlova and Mikhail Kolosov and Denis Shepelev and Andrey Kuznetsov and Elena Tutubalina and Aleksandr I. Panov and Alexey K. Kovalev and Vlad Shakhuro},
  year={2026},
  eprint={2606.19297},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2606.19297}
}

Acknowledgements

Act2Answer builds on SimplerEnv and ManiSkill, with evaluation harness pieces derived from RL4VLA. The README structure follows the public BlindVLA project style.

About

Does VLA Even Know the Basics? Measuring Commonsense and World Knowledge Retention in Vision-Language-Action Models

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors