Benchmarking LLMs for Embodied Reasoning and Decision Making with Safety Constraints
🔗 Original repository:
https://github.com/embodied-agent-interface/embodied-agent-interface
This benchmark provides a comprehensive assessment of LLM performance across different subtasks, identifying their strengths and weaknesses in embodied decision-making contexts with safety constraints.
It includes:
- ✅ Tasks that can be achieved safely under adversarial and multi-constraint settings
⚠️ Adversarial instructions that the agent must avoid
We consider two planning strategies:
- 🚀 One-go Planning → ideal for direct planning
- 🔁 Stepwise Planning → ideal for iterative planning
In the one-go planning approach, the agent 🤖 generates a plan 📋 (a sequence of actions) in a single attempt for a given instruction or task.
📊 Upon successful execution, the environment graph 🗺️ 𝒢* (representing the updated state of the environment) is evaluated.
💡 Thanks to its straightforward design, this planning strategy is well-suited for direct planning scenarios.
python src/behavior_eval/evaluation/action_sequencing/scripts/evaluate_results.py \
--mode onego \
--llm_name LGAI-EXAONE/EXAONE-3.5-32B-Instruct \
--strategy directIn stepwise planning, the agent 🤖 interacts with the environment 🌍 for n steps and m trials to finish the task.
At each step, the agent selects and executes an action
This interaction continues for a (pre-defined) number of steps, constituting a trajectory:
At the end of each trial, a critic 🧠 evaluates the (so-far generated) plan
This feedback guides the agent in refining its strategy or decision-making process for future trials.
The process continues until the agent generates the “Done” action or exhausts all trials.
If the plan is executable, the updated environment state
✨ This strategy suits iterative planning, i.e., where interactions occur between the agent and the simulator/environment.
⚙️ Key Arguments
`--llm_name`: `Qwen/Qwen3-32B` `gpt-4.1-mini` etc
`--strategy`: `direct` `react` `rej`
`--rm_safety_instruction`: if set, then instruction for safety enhancement in the prompt will be removed, for ablation study.
`--reflex`: if set, `reflexion` or `AI-critic from gpt-4.1` will be applied.
`--reflex_from_llm_as_judge`: if set, use `AI-critic from gpt-4.1`, otherwise, use `reflexion`
`--trial`: number of rounds for `reflexion` or `AI-critic`🧪 Example Experiments
🔁 Stepwise + ReAct
python src/behavior_eval/evaluation/action_sequencing/scripts/evaluate_results.py \
--mode stepwise --llm_name ${llm} --strategy react 🧠 Stepwise + ReAct + Reflexion / AI-Critic
# reflexion for additional N rounds (N is 1,2,3,...)
python src/behavior_eval/evaluation/action_sequencing/scripts/evaluate_results.py \
--mode stepwise --llm_name ${llm} --strategy react \
--reflex_from_llm_as_judge \
--trial 1 🧹 Ablation: Remove Safety Instruction
python src/behavior_eval/evaluation/action_sequencing/scripts/evaluate_results.py \
--mode onego --strategy direct --llm_name ${llm} \
--rm_safety_instruction
python src/behavior_eval/evaluation/action_sequencing/scripts/evaluate_results.py \
--mode stepwise --strategy direct --llm_name ${llm} \
--rm_safety_instruction python src/behavior_eval/evaluation/action_sequencing/scripts/evaluate_results.py \
--mode stepwise --strategy react --llm_name ${llm} \
--reflex_from_llm_as_judge \
--trial 1 \
--risk_category 'electrical hazard'
If you find this work helpful, please consider citing it:
@inproceedings{sadhu2025vestabench,
title={VESTABENCH: An Embodied Benchmark for Safe Long-Horizon Planning Under Multi-Constraint and Adversarial Settings},
author={Sadhu, Tanmana and Chen, Yanan and Pesaranghader, Ali},
booktitle={Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track},
year={2025}
}This project is released under the MIT License.