Train flow matching action expert for VLA robotic manipulation in simulation environment

Key components

🔬 This repo contains
Training and evaluation examples of using flow matching on Robomimic, PushT, Libero and Franka Kitchen benchmarks. Modified from paper of Affordance-based Robot Manipulation with Flow Matching( https://hri-eu.github.io/flow-matching-policy/) with bugs fixed, more features and more comprehensive experiments.

🌷 Getting Started 🚀 Install the Python dependencies: pip install -r requirements.txt

🧪 Run training and evaluation on benchmarks

pusht: examples/flow_pusht.py
franka kitchen: examples/flow_kitchen.py
libero:
- 👁️ Vision + 🤖 Robot State → 🎯 Action: examples/flow_libero_unet_ablation_text.py
- 👁️ Vision + 📝 Textual Instruction + 🤖 Robot State → 🎯 Action: examples/flow_libero_unet_qwen.py
- 👁️ Vision + 📝 Textual Instruction + 🤖 Robot State → 🧠 VLM → 🎯 Action: examples/flow_libero_vlm.py
- 👁️ Vision + 📝 Textual Instruction + 🤖 Robot State → 🧠 VLM → CoT → 🎯 Action: examples/flow_libero_unet_qwen_cot.py example : The training procedure will save checkpoints at some intervals to local folder checkpoints

Kick off training:

python examples/flow_libero_unet_qwen.py \
--save_cp 	\
--cp_name unfrozen_text_model

Then with the saved checkpoints, do evaluation on libero or libero-plus

python examples/flow_libero_unet_qwen.py \
--eval_cp checkpoints/libero/unet_qwen/cp-frozen_text_model-100.pth

Here just the eval_cp controls if it goes to training mode or evaluation mode.

Train with CoT (parameters need to be tuned)

python examples/flow_libero_unet_qwen_cot.py \
--batchsize 256 \
--save_cp 	\
--frozen_text_model \
--cp_name cot_frozen_text_model  \
--lambda_fm_start  0.1 \
--lambda_fm_end  1.0 \
--lambda_depth_start  1.0 \
--lambda_depth_end  0.1 \
--lambda_eef_start  1.0 \
--lambda_eef_end  0.1 \
--lambda_plateau_min_epochs 100 \
--lambda_ramp_epochs  60 \
--lambda_force_ramp_epoch 240 \
--lambda_force_action_epoch  360

python examples/flow_libero_unet_qwen_cot.py \
--batchsize 128 \
--save_cp 	\
--frozen_text_model \
--cp_name cot_frozen_text_model_cot_first  \
--lambda_fm_start 0.0  \
--lambda_fm_end 1.0  \
--lambda_depth_start 1.0  \
--lambda_depth_end 0.1  \
--lambda_eef_start 1.0  \
--lambda_eef_end 0.1  \
--lambda_plateau_min_epochs 120  \
--lambda_ramp_epochs 80  \
--lambda_force_ramp_epoch 280  \
--lambda_force_action_epoch 420

🧱 VLA Datasets on huggingface 🤗

on top of the vanilla libero VLA training set (https://huggingface.co/datasets/physical-intelligence/libero) here is the enhanced dataset version with additional two columns: depth and eef_traj: https://huggingface.co/datasets/yananchen/libero_cot_contious

Please note that it is not Lerobot format anymore, it is vanilla huggingface datasets.

Refer to the original MolmoAct dataset: https://huggingface.co/datasets/allenai/libero where the CoT is extracted from.

⚙️ Benchmark: LIBERO

In the experiments, LIBERO is used as benchmark to evaluate the success rate on the rollouts. Each time, an LIBERO environment is initialized with some randomness seed to make it a bit various. Then the model checkpoint is loaded and return action (7-dimension) given state observations. Finally, either the trajectory is judged as success or break the max iteration limit which is flagged as failure (model does not finish the job given limited steps).

To fully reconstruct the anaconda environment, refer to file environment.yml.

📈 Findings

Without using VLM, just vanilla flow-matching model without pre-training or fine-tuning, equipped with an off-the-shelf text encoder LLM ( for example, Qwen/Qwen3-0.6B ), can easily reach ~85% success rate average across four benchmark suites: libero_spatial libero_object libero_goal libero_10 from original LIBERO benchmark (https://libero-project.github.io/datasets) which is widely tested in current VLAs papers.

Epoch	Success Rate (%)
100	38.1
200	42.7
300	62.3
400	64.3
500	63.9
600	75.2
800	79.1
1000	78.6
1200	81.3
1400	84.65
1600	84.95
1800	84.5
2000	85.4

To systematically expose the hidden vulnerabilities of VLA models through comprehensive robustness evaluation across seven perturbation dimensions, we can also test the same model checkpoint on this enhanced benchmark LIBERO PLUS("In-depth Robustness Analysis for Vision-Language-Action Models" [https://github.com/sylvestf/LIBERO-plus]).

In vanilla LIBERO, before evaluation, we can set export PYTHONPATH=$PYTHONPATH:~/robotics/LIBERO; ln -sf ~/.libero/config_libero.yaml ~/.libero/config.yaml to use LIBERO-PLUS, just install it from source and set export PYTHONPATH=$PYTHONPATH:~/robotics/LIBERO-plus; ln -sf ~/.libero/config_libero_plus.yaml ~/.libero/config.yaml

However, This is a more challenging benchmark that introduces more tasks spanning multiple dimensions: Objects Layout, Camera Viewpoints, Robot Initial States, Language Instructions, Light Conditions, Background Textures, Sensor Noise.

Evaluating the checkpoint of 2000 on LIBERO-PLUS, the average success rate is around 23.4%, which is a significant drop compared to vanilla LIBERO.

Name		Name	Last commit message	Last commit date
Latest commit History 131 Commits
examples		examples
external		external
images		images
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
calculate_sr.py		calculate_sr.py
environment.yml		environment.yml
pkgInfo.py		pkgInfo.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Train flow matching action expert for VLA robotic manipulation in simulation environment

Key components

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Train flow matching action expert for VLA robotic manipulation in simulation environment

Key components

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages