# VLA-OS **Repository Path**: hhyocc/VLA-OS ## Basic Information - **Project Name**: VLA-OS - **Description**: No description available - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-07-20 - **Last Updated**: 2025-07-20 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # VLA-OS: Structuring and Dissecting Planning Representations and Paradigms in Vision-Language-Action Models ### ๐Ÿ“[Paper](https://arxiv.org/abs/2506.17561) | ๐ŸŒ[Project Page](https://nus-lins-lab.github.io/vlaos/) | ๐Ÿค—[Model](https://huggingface.co/Linslab/VLA-OS) | ๐Ÿ›ข๏ธ[Data](https://huggingface.co/datasets/Linslab/VLA-OS-Dataset) ![](./imgs/overview.png) VLA-OS is a unified framework for planning representations and paradigms research in vision-language-action (VLA) models. Specifically, VLA-OS offers the following features: - ๐Ÿ—๏ธ **Advanced VLA Designs** VLA-OS integrates multiple cutting-edge VLA design elements, including support for [multi-view historical inputs](https://robovlms.github.io/), [action chunking](https://tonyzhaozh.github.io/aloha/), a separate [action head](https://www.physicalintelligence.company/blog/pi0), [block-wise causal attention for extracting visual-language model (VLM) features](https://www.physicalintelligence.company/blog/pi0), and support for both [L1 loss](https://openvla-oft.github.io/) and [flow-matching loss](https://www.physicalintelligence.company/blog/pi0) within a single network architecture. - ๐Ÿ”— **Modular, Scalable VLM Backbone** VLA-OS is agnostic to the choice of large-language or visual-language models: any Hugging Face LLM/VLM can be employed. [Our paper](https://arxiv.org/abs/2506.17561) presents model-scalability experiments on the same LLM model architecture ([Qwen2.5](https://github.com/QwenLM/Qwen2.5-VL)) with only different number of parameters. - ๐Ÿ› ๏ธ **Composable Planning Heads for Different Planning Representations** A suite of composable planning heads is provided for different task planning representations: language reasoning, visual reasoning, and image foresight reasoning. Each of them can be seamlessly attached to the VLM backbone. - ๐Ÿ”„ **Different Planning Paradigms** Using a unified codebase, VLA-OS implements three planning paradigms: Action-Only VLA, Integrated VLA, and Hierarchical VLA, enabling flexible exploration of planning strategies. This repo is an official PyTorch implementation of VLA-OS, containing: - ๐Ÿ› ๏ธVLA-OS model [implementation](https://github.com/HeegerGao/VLA-OS/tree/main/vlaos/models). - ๐Ÿค—[Dataset](https://huggingface.co/datasets/Linslab/VLA-OS-Dataset) of VLA-OS of [LIBERO](https://libero-project.github.io/intro.html), [The Colosseum](https://robot-colosseum.github.io/), [FurnitureBench](https://clvrai.github.io/furniture-bench/), [DexArt](https://www.chenbao.tech/dexart/), [PerAct2](https://bimanual.github.io/), and Real-World Deformable Object Manipulation tasks. - ๐Ÿค—[Checkpoint](https://huggingface.co/Linslab/VLA-OS) of VLA-OS. - ๐Ÿ“ˆTraining [scripts](https://github.com/HeegerGao/VLA-OS/tree/main/scripts) (with DeepSpeed Accelerator for VLA and FSDP for VLM). - ๐Ÿค–Data transformation [scripts](https://github.com/Panda-Shawn/VLA-OS-Language-Planning-Labeling) for your own dataset. - ๐Ÿ•น๏ธPlanning Data Labeling [scripts](https://github.com/Panda-Shawn/VLA-OS-Language-Planning-Labeling) for your custom dataset. The following guides include the [installation](#installation), [VLM Pretraining](#vlm-pretraining), [VLA Training](#vla-training), and [Training on your own dataset](#training-on-your-own-dataset). ## ๐Ÿ“ฐ News - [2025/06/24] ๐Ÿ”ฅ Training Code released! ## TODO [] Add training code for continual learning [] Add evaluation code ## Installation This installation is for NVIDIA A100 80G with cuda 12.6. ```bash # Clone this repo git clone git@github.com:HeegerGao/VLA-OS.git # Create a Conda environment conda create -n vla python=3.10 conda activate vla # Install PyTorch pip3 install torch torchvision torchaudio # Install LIBERO git clone https://github.com/Lifelong-Robot-Learning/LIBERO.git cd LIBERO pip install -r requirements.txt pip install -e . # Install dlimp git clone https://github.com/kvablack/dlimp cd dlimp pip install -e . # Install Flash Attention pip install flash-attn --no-build-isolation # Install other prequisites pip install -r requirements.txt ``` ## VLM Pretraining 1. Download the llava-v1.5-instruct Dataset You can download it according to the Prismatic-VLMs [instruction](https://github.com/TRI-ML/prismatic-vlms/tree/main?tab=readme-ov-file#pretraining-datasets). Then move the unzipped dataset folder: ```bash cd VLA-OS mkdir dataset mv YOUR_llava-v1.5-instruct_FOLDER dataset ``` 2. Train the VLM ```bash bash commands/pertrain_vlm.sh --config=config/train_vlm.yaml ``` If you do not want to train the VLM by yourself, you can directly download the pretrained VLM checkpoint [here](https://huggingface.co/Linslab/VLA-OS/tree/main/Qwen2.5-VLM). ## VLA Training 1. Put your pretrained VLM checkpoint under `runs/qwen25-dinosiglip-224px+0_5b+stage-finetune+x42/checkpoints/latest-checkpoint.pt`. 2. Download pretrained VAE (infinity_vae_d32reg.pth) for Image Foresight Planning from [here](https://huggingface.co/Linslab/VLA-OS/blob/main/auxiliary/infinity_vae_d32reg.pth). 3. Download the training datasets from [our huggingface dataset repo](https://huggingface.co/datasets/Linslab/VLA-OS-Dataset). 4. Run the training script with the corresponding config. For example, train Integrated VLA on LIBERO-10: ```bash bash commands/train_vla.sh --config=config/libero/libero_10/train_integrated_vla.yaml ``` Note if you want to train the Hierarchical VLA, you should first train a Integrated VLA to get the high-level checkpoint, and then train the low-level action head with it. Please refer to our paper for more details. ## Training on Your Own Dataset Please refer to this [repo](https://github.com/Panda-Shawn/VLA-OS-Language-Planning-Labeling) for more instructions. ## Citation If you find our work helpful, please cite us: ```bibtex @article{gao2025vlaos, title = {VLA-OS: Structuring and Dissecting Planning Representations and Paradigms in Vision-Language-Action Models}, author = {Gao, Chongkai and Liu, Zixuan and Chi, Zhenghao and Huang, Junshan and Fei, Xin and Hou, Yiwen and Zhang, Yuxuan and Lin, Yudi and Fang, Zhirui and Jiang, Zeyu and Shao, Lin}, journal = {arXiv preprint arXiv:2506.17561}, year = {2025}, url = {https://arxiv.org/abs/2506.17561} } ``` Thank you! ## License All the code, model weights, and data are licensed under [MIT license](./LICENSE).