# RoboBrain **Repository Path**: minglanlin/RoboBrain ## Basic Information - **Project Name**: RoboBrain - **Description**: No description available - **Primary Language**: Python - **License**: Apache-2.0 - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-03-28 - **Last Updated**: 2025-03-28 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README

# [CVPR 25] RoboBrain: A Unified Brain Model for Robotic Manipulation from Abstract to Concrete.

🎯 RoboOS (Coming Soon): An Efficient Open-Source Multi-Robot Coordination System for RoboBrain.

🎯 ReasonRFT: Exploring a New RFT Paradigm to Enhance RoboBrain's Visual Reasoning Capabilities.

## 🔥 Overview Recent advancements in Multimodal Large Language Models (MLLMs) have shown remarkable capabilities across various multimodal contexts. However, their application in robotic scenarios, particularly for long-horizon manipulation tasks, reveals significant limitations. These limitations arise from the current MLLMs lacking three essential robotic brain capabilities: **(1) Planning Capability**, which involves decomposing complex manipulation instructions into manageable sub-tasks; **(2) Affordance Perception**, the ability to recognize and interpret the affordances of interactive objects; and **(3) Trajectory Prediction**, the foresight to anticipate the complete manipulation trajectory necessary for successful execution. To enhance the robotic brain's core capabilities from abstract to concrete, we introduce ShareRobot, a high-quality heterogeneous dataset that labels multi-dimensional information such as task planning, object affordance, and end-effector trajectory. ShareRobot's diversity and accuracy have been meticulously refined by three human annotators. Building on this dataset, we developed RoboBrain, an MLLM-based model that combines robotic and general multi-modal data, utilizes a multi-stage training strategy, and incorporates long videos and high-resolution images to improve its robotic manipulation capabilities. Extensive experiments demonstrate that RoboBrain achieves state-of-the-art performance across various robotic tasks, highlighting its potential to advance robotic brain capabilities.

## 🚀 Features This repository supports: - **`Data Preparation`**: Please refer to [Dataset Preparation](https://github.com/FlagOpen/ShareRobot) for how to prepare the dataset. - **`Training for RoboBrain`**: Please refer to [Training Section](#Training) for the usage of training scripts. - **`Support HF/VLLM Inference`**: Please see [Inference Section](#Inference), now we support inference with [VLLM](https://github.com/vllm-project/vllm). - **`Evaluation for RoboBrain`**: Please refer to [Evaluation Section](#Evaluation) for how to prepare the benchmarks. - **`ShareRobot Generation`**: Please refer to [ShareRobot](https://github.com/FlagOpen/ShareRobot) for details. ## 🗞️ News - **`2025-03-27`**: 🤗 We have released [Planning Checkpoint](https://huggingface.co/BAAI/RoboBrain/) in Huggingface. - **`2025-03-26`**: 🔥 We have released the [RoboBrain](https://superrobobrain.github.io/) repository. - **`2025-02-27`**: 🌍 Our [RoboBrain](https://superrobobrain.github.io/) was accepted to CVPR2025. ## 📆 Todo - [x] Release scripts for model training and inference. - [x] Release Planning checkpoint. - [ ] Release Affordance and Trajectory checkpoints. - [ ] Release ShareRobot dataset. - [ ] Release evaluation scripts for Benchmarks. - [ ] Training more powerful Robobrain (v2). ## 🤗 Models - **[`Base Planning Model`](https://huggingface.co/BAAI/RoboBrain/)**: The model was trained on general datasets in Stages 1–2 and on the Robotic Planning dataset in Stage 3, which is designed for Planning prediction. - **[`A-LoRA for Affordance`](https://github.com/FlagOpen/RoboBrain/)**: Based on the Base Planning Model, Stage 4 involves LoRA-based training with our Affordance dataset to predict affordance. *(Coming Soon)* - **[`T-LoRA for Trajectory`](https://github.com/FlagOpen/RoboBrain/)**: Based on the Base Planning Model, Stage 4 involves LoRA-based training with our Trajectory dataset to predict trajectory. *(Coming Soon)*

| Models | Checkpoint | Description | |----------------------|----------------------------------------------------------------|------------------------------------------------------------| | Planning Model | [🤗 Planning CKPTs](https://huggingface.co/BAAI/RoboBrain/) | Used for Planning prediction in our paper | | Affordance (A-LoRA) | [🤗 Affordance CKPTs](https://superrobobrain.github.io/) | Used for Affordance prediction in our paper *(Coming Soon)* | | Trajectory (T-LoRA) | [🤗 Trajectory CKPTs](https://superrobobrain.github.io/) | Used for Trajectory prediction in our paper *(Coming Soon)* | ## 🛠️ Setup ```bash # clone repo. git clone https://github.com/FlagOpen/RoboBrain.git cd RoboBrain # build conda env. conda create -n robobrain python=3.10 conda activate robobrain pip install -r requirements.txt ``` ## 🤖 Training ### 1. Data Preparation ```bash # Modify datasets for Stage 1, please refer to: - yaml_path: scripts/train/yaml/stage_1_0.yaml # Modify datasets for Stage 1.5, please refer to: - yaml_path: scripts/train/yaml/stage_1_5.yaml # Modify datasets for Stage 2_si, please refer to: - yaml_path: scripts/train/yaml/stage_2_si.yaml # Modify datasets for Stage 2_ov, please refer to: - yaml_path: scripts/train/yaml/stage_2_ov.yaml # Modify datasets for Stage 3_plan, please refer to: - yaml_path: scripts/train/yaml/stage_3_planning.yaml # Modify datasets for Stage 4_aff, please refer to: - yaml_path: scripts/train/yaml/stage_4_affordance.yaml # Modify datasets for Stage 4_traj, please refer to: - yaml_path: scripts/train/yaml/stage_4_trajectory.yaml ``` **Note:** The sample format in each json file should be like: ```json { "id": "xxxx", "image": [ "image1.png", "image2.png", ], "conversations": [ { "from": "human", "value": "\n\nAre there numerous dials near the bottom left of the tv?" }, { "from": "gpt", "value": "Yes. The sun casts shadows ... a serene, clear sky." } ] }, ``` ### 2. Training ```bash # Training on Stage 1: bash scripts/train/stage_1_0_pretrain.sh # Training on Stage 1.5: bash scripts/train/stage_1_5_direct_finetune.sh # Training on Stage 2_si: bash scripts/train/stage_2_0_resume_finetune_si.sh # Training on Stage 2_ov: bash scripts/train/stage_2_0_resume_finetune_ov.sh # Training on Stage 3_plan: bash scripts/train/stage_3_0_resume_finetune_robo.sh # Training on Stage 4_aff: bash scripts/train/stage_4_0_resume_finetune_lora_a.sh # Training on Stage 4_traj: bash scripts/train/stage_4_0_resume_finetune_lora_t.sh ``` **Note:** Please change the environment variables (e.g. *DATA_PATH*, *IMAGE_FOLDER*, *PREV_STAGE_CHECKPOINT*) in the script to your own. ### 3. Convert original weights to HF weights ```bash python scripts/infer/convert_robobrain_to_hf.py --model_dir /path/to/original/checkpoint/ --dump_path /path/to/output/ ``` ## 🤖 Inference ### Option 1: HF inference #### Run python script as example: ```python import torch from transformers import AutoProcessor, AutoModelForPreTraining model_id = "BAAI/RoboBrain" print("Loading Checkpoint ...") model = AutoModelForPreTraining.from_pretrained( model_id, torch_dtype=torch.float16, low_cpu_mem_usage=True, ).to("cuda:0") processor = AutoProcessor.from_pretrained(model_id) # Define a chat history and use `apply_chat_template` to get correctly formatted prompt # Each value in "content" has to be a list of dicts with types ("text", "image") messages = [ { "role": "user", "content": [ {"type": "text", "text": "What is shown in this image?"}, {"type": "image", "url": "http://images.cocodataset.org/val2017/000000039769.jpg"}, ], }, ] print("Processing input...") inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt" ) inputs = {k: v.to("cuda:0") for k, v in inputs.items()} print("Generating output...") output = model.generate(**inputs, max_new_tokens=250) print(processor.decode(output[0][2:], skip_special_tokens=True)) ``` ### Option 2: VLLM inference #### Install and launch VLLM ```bash # Install vllm package pip install vllm==0.6.6.post1 # Launch Robobrain with vllm python -m vllm.entrypoints.openai.api_server --model BAAI/RoboBrain --served-model-name robobrain --max_model_len 16384 --limit_mm_per_prompt image=8 ``` #### Run python script as example: ```python from openai import OpenAI import base64 openai_api_key = "robobrain-123123" openai_api_base = "http://127.0.0.1:8000/v1" client = OpenAI( api_key=openai_api_key, base_url=openai_api_base, ) response = client.chat.completions.create( model="robobrain", messages=[ { "role": "user", "content": [ { "type": "image_url", "image_url": { "url": "http://images.cocodataset.org/val2017/000000039769.jpg" }, }, {"type": "text", "text": "What is shown in this image?"}, ], }, ] ) content = response.choices[0].message.content print(content) ``` ## 🤖 Evaluation *Coming Soon ...*

## 😊 Acknowledgement We would like to express our sincere gratitude to the developers and contributors of the following projects: 1. [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT): The comprehensive codebase for training Vision-Language Models (VLMs). 2. [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval): A powerful evaluation tool for Vision-Language Models (VLMs). 3. [vllm](https://github.com/vllm-project/vllm): A high-throughput and memory-efficient LLMs/VLMs inference engine. 4. [OpenEQA](https://github.com/facebookresearch/open-eqa): A wonderful benchmark for Embodied Question Answering. 5. [RoboVQA](https://github.com/google-deepmind/robovqa): Provide high-level reasoning models and datasets for robotics applications. Their outstanding contributions have played a pivotal role in advancing our research and development initiatives. ## 📑 Citation If you find this project useful, welcome to cite us. ```bib @article{ji2025robobrain, title={RoboBrain: A Unified Brain Model for Robotic Manipulation from Abstract to Concrete}, author={Ji, Yuheng and Tan, Huajie and Shi, Jiayu and Hao, Xiaoshuai and Zhang, Yuan and Zhang, Hengyuan and Wang, Pengwei and Zhao, Mengdi and Mu, Yao and An, Pengju and others}, journal={arXiv preprint arXiv:2502.21257}, year={2025} } ```