# SimpleVLA-RL **Repository Path**: xdjiangkai/SimpleVLA-RL ## Basic Information - **Project Name**: SimpleVLA-RL - **Description**: No description available - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-12-16 - **Last Updated**: 2025-12-16 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README

## SimpleVLA-RL: Open RL Framework for Vision–Language–Action Models [![Paper](https://img.shields.io/badge/Paper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https://arxiv.org/abs/2509.09674) [![Github](https://img.shields.io/badge/SimpleVLA--RL-000000?style=for-the-badge&logo=github&logoColor=white)](https://github.com/PRIME-RL/SimpleVLA-RL) [![Hugging Face Collection](https://img.shields.io/badge/Models-fcd022?style=for-the-badge&logo=huggingface&logoColor=000)](https://huggingface.co/collections/Haozhan72/simplevla-rl-6833311430cd9df52aeb1f86) [![Twitter](https://img.shields.io/badge/Twitter-%23000000.svg?style=for-the-badge&logo=x&logoColor=white)](https://x.com/stingning/status/1927770654385860804) [![WeChat](https://img.shields.io/badge/WeChat--Group-07C160?style=for-the-badge&logo=wechat&logoColor=white)](figs/wechat-group.png)

**SimpleVLA-RL** is an efficient RL framework for VLA that improves long-horizon planning under data scarcity. It leverages reinforcement learning that can substantially outperforms SFT in simulation and real-world tasks, reveals a "pushcut" new-action phenomenon, and strengthens spatial/object/goal generalization. # 🎉News - **[2025-10-01]** **SimpleVLA-RL** now supports RoboTwin2.0 Benchmark. Feel free to experiment with it! - **[2025-09-12]** Excited to release the **SimpleVLA-RL** paper! Check it out: [Paper](https://arxiv.org/abs/2509.09674). - **[2025-05-27]** We release the code of **SimpleVLA-RL**. # 📌Highlights #### Efficient and Effective VLA Reinforcement Learning Framework - End-to-end VLA RL pipeline built on [veRL](https://github.com/volcengine/verl) with VLA-specific optimizations - Multi-environment parallel rendering significantly accelerates VLA trajectory sampling - Leverages [veRL](https://github.com/volcengine/verl)'s state-of-the-art infrastructure: efficient distributed training (FSDP), hybrid communication patterns, and optimized memory management for fast training/inference #### Model and Environment Support - **VLA Models**: [OpenVLA](https://github.com/openvla/openvla), [OpenVLA-OFT](https://github.com/moojink/openvla-oft) - **Benchmarks**: [LIBERO](https://github.com/Lifelong-Robot-Learning/LIBERO), [RoboTwin 1.0/2.0](https://github.com/TianxingChen/RoboTwin) - Modular architecture for easy integration of new VLA models, benchmarks and RL algorithms (Upcoming) #### Minimal Reward Engineering and Exploration Strategies - Binary (0/1) outcome rewards - no complex reward design needed - Exploration strategies: dynamic sampling, adaptive clipping, temperature tuning # 🔧Key Implementations SimpleVLA-RL extends veRL with VLA-specific components across the following modules: **[`verl/trainer/main_ppo.py`](verl/trainer/main_ppo.py)** - Main entry point with ray initialization - `RobRewardManager` for reward distribution **[`verl/trainer/ppo/ray_trainer.py`](verl/trainer/ppo/ray_trainer.py)** - Main RL training loop: data loading, VLA rollout, model updates, evaluation, checkpointing - RL algorithm-specific advantage computation **[`verl/workers/fsdp_workers.py`](verl/workers/fsdp_workers.py)** - Source of core functions called in `ray_trainer.py` - VLA model/optimizer initialization, `generate_sequences`, `compute_entropy`, `update_actor` **[`verl/workers/actor/dp_rob.py`](verl/workers/actor/dp_rob.py)** - Specific implementation of functions in `fsdp_workers.py` - RL loss computation, policy updates, `compute_log_prob`, `compute_entropy` **[`verl/workers/rollout/rob_rollout.py`](verl/workers/rollout/rob_rollout.py)** - VLA rollout implementation: environment creation, multi-environment parallel rendering, VLA action generation, environment interaction, video saving, trajectory and 0/1 reward collection **[`verl/utils/dataset/rob_dataset.py`](verl/utils/dataset/rob_dataset.py)** - Dataset construction for training/testing across benchmarks **[`verl/utils/vla_utils/`](verl/utils/vla_utils/)** - VLA model implementations (OpenVLA-OFT/OpenVLA from official code) # ✨Getting Started #### 1. Set Up the Environment See [SETUP.md](SETUP.md) for detailed instructions on setting up the conda environment. #### 2. Prepare the SFT Model An **SFT (Supervised Fine-Tuning)** VLA model is required for RL training. Below are the available options: * **OpenVLA-OFT SFT Models** Download from the [SimpleVLA-RL Collection](https://huggingface.co/collections/Haozhan72/simplevla-rl-6833311430cd9df52aeb1f86). Available models include: - `libero-10 traj1/trajall SFT` - `libero-goal/object/spatial traj1 SFT` - `Robotwin2.0 tasks traj1000 SFT` * **OpenVLA SFT Models** Download from [here](https://huggingface.co/openvla). * **Other Models** For other models, you may need to fine-tune them yourself. #### 3. Train with SimpleVLA-RL Before running the training script, ensure the following configurations are properly set: - **Set Your Weights and Biases (WandB) API Key** Replace the `WANDB_API_KEY` field in `SimpleVLA-RL/align.json` with your own WandB API key. - **Modify Key Variables** Update the following variables in `examples/run_openvla_oft_rl_libero/twin2.sh` as needed: - `WANDB_API_KEY`: Your WandB API key. - `EXPERIMENT_NAME`: The name of your experiment. You can choose any name. - `SFT_MODEL_PATH`: Path to your SFT model. - `CKPT_PATH`: Path where your checkpoints will be saved. - `DATASET_NAME`: For detailed options, refer to `examples/run_openvla_oft_rl_libero/twin2.sh`. - `ALIGN_PATH`: Path to the `SimpleVLA-RL/align.json` file. - `NUM_GPUS`: Number of GPUs available per node (e.g., `8`). - `NUM_NODES`: Number of nodes used for RL training (e.g., `1`). > [!NOTE] > > - The script has been tested on the following configurations: > - Single-node setup: `NUM_NODES=1`, `NUM_GPUS=8` (1 node with 8 NVIDIA A800 GPUs, each having 80GB memory). > - Multi-node setup: `NUM_NODES=2`, `NUM_GPUS=8` (2 nodes with 16 NVIDIA A800 GPUs, each having 80GB memory). > - The driver version used is `470.161.03`, and the CUDA version is `12.4`. *(Not necessary)* - **Run RL Training** Use the following command to start RL training for OpenVLA-OFT on the LIBERO or RoboTwin2.0 benchmark: ```bash bash examples/run_openvla_oft_rl_libero.sh or bash examples/run_openvla_oft_rl_twin2.sh ``` #### 4. Run Evaluation To evaluate the performance of your model, enable evaluation mode by setting `trainer.val_only=True` in `examples/run_openvla_oft_rl_libero/twin2.sh`. Then, execute the same script: ```bash bash examples/run_openvla_oft_rl_libero.sh or bash examples/run_openvla_oft_rl_twin2.sh ``` # 📃 Main Results We evaluate SimpleVLA-RL on the LIBERO using OpenVLA-OFT. SimpleVLA-RL improves the performance of OpenVLA-OFT to 97.6 points on LIBERO-Long and sets a new state-of-the-art. Remarkably, using only one trajectory per task for cold-start SFT, SimpleVLA-RL raises the performance of OpenVLA-OFT from 17.3 to 91.7, yielding an improvement of 74.4 points (430.1%).

# 🌻Acknowledgement We develop this preview version of the code based on [veRL](https://github.com/volcengine/verl), [OpenVLA-OFT](https://github.com/moojink/openvla-oft), [RoboTwin2.0](https://github.com/RoboTwin-Platform/RoboTwin.git), and [PRIME](https://github.com/PRIME-RL/PRIME). We acknowledge their significant contributions! For further details and updates, please refer to the official documentation and repositories of the respective projects. # 📝Roadmap ### Expanding Model Support - [ ] Support advanced diffusion based RL: [pi0](https://github.com/Physical-Intelligence/openpi) and [pi0.5](https://github.com/Physical-Intelligence/openpi) with flow matching RL - [ ] Support more VLA models: especially for lightweight ones (e.g. [VLA-Adapter](https://github.com/OpenHelix-Team/VLA-Adapter), [SmolVLA](https://huggingface.co/docs/lerobot/smolvla)) ### Expanding Environment Support - [ ] Support more benchmarks: e.g. [SimplerEnv](https://github.com/simpler-env/SimplerEnv), [BEHAVIOR](https://github.com/behavior-robot-suite/brs-algo), [Calvin](https://github.com/mees/calvin) - [ ] Support real-world RL. ### Expanding Framework - [ ] Additional online RL methods and Offline RL algorithms - [ ] Modular environment and VLA interface for easy adaptation - [ ] Further optimize the RL framework to achieve more efficient training # 📨Contact - Haozhan Li: zhan72426@gmail.com - Ning Ding: dingning@mail.tsinghua.edu.cn # 🎈Citation If you find SimpleVLA-RL helpful, please cite us: ```bibtex @article{li2025simplevla, title={SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning}, author={Li, Haozhan and Zuo, Yuxin and Yu, Jiale and Zhang, Yuhao and Yang, Zhaohui and Zhang, Kaiyan and Zhu, Xuekai and Zhang, Yuchen and Chen, Tianxing and Cui, Ganqu and others}, journal={arXiv preprint arXiv:2509.09674}, year={2025} } ``` # 🌟Star History [![Star History Chart](https://api.star-history.com/svg?repos=PRIME-RL/SimpleVLA-RL&type=Date)](https://www.star-history.com/#PRIME-RL/SimpleVLA-RL&Date)