# Xiaomi-Robotics-0 **Repository Path**: fengcha007/Xiaomi-Robotics-0 ## Basic Information - **Project Name**: Xiaomi-Robotics-0 - **Description**: No description available - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2026-02-12 - **Last Updated**: 2026-02-12 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README
# Xiaomi-Robotics-0 **An Open-Sourced Vision-Language-Action Model with Real-Time Inference** [![Paper](https://img.shields.io/badge/๐Ÿ“„-Paper-red)](https://xiaomi-robotics-0.github.io/assets/paper.pdf) [![Project Page](https://img.shields.io/badge/๐ŸŒ-Project_Page-blue)](https://xiaomi-robotics-0.github.io/) [![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97-Hugging%20Face-yellow)](https://huggingface.co/collections/XiaomiRobotics/xiaomi-robotics-0) [![License](https://img.shields.io/badge/License-Apache_2.0-green.svg)](LICENSE)
--- ## ๐Ÿ’ก About Xiaomi-Robotics-0 **Xiaomi-Robotics-0** is a state-of-the-art **Vision-Language-Action (VLA) model** with 4.7B parameters, specifically engineered for high-performance robotic reasoning and seamless real-time execution. ### Key Features: * **๐Ÿง  Strong Generalization**: Pre-trained on diverse cross-embodiment trajectories and VL data to handle complex, unseen tasks. * **๐Ÿš€ Real-Time Ready**: Optimized with asynchronous execution to minimize inference latency. * **๐Ÿ› ๏ธ Flexible Deployment**: Fully compatible with the Hugging Face `transformers` ecosystem and optimized for consumer GPUs. ## ๐Ÿ“… Updates - **[Feb 2026]** ๐ŸŽ‰ Released the **Technical Report**. - **[Feb 2026]** ๐Ÿ”ฅ Released **Pre-trained weights** and **Fine-tuned weights** for LIBERO, CALVIN, and SimplerEnv. - **[Feb 2026]** ๐Ÿ’ป Inference code and evaluation scripts are now live! --- ## ๐Ÿ† Benchmark We evaluate **Xiaomi-Robotics-0** on three standard simulation benchmarks: **CALVIN**, **LIBERO**, and **SimplerEnv**. The table below summarizes the performance results across different embodiments and datasets. For each setting, we provide the corresponding fine-tuned checkpoint and a guide for running the evaluation. | | ๐Ÿค— Name on Hugging Face | Description | Performance | Evaluation Guide | | :------------- | :----------------------------------------------------------- | :-------------------------------- | :--------------------------------- | :------------------------------------------- | | **LIBERO** | [**Xiaomi-Robotics-0-LIBERO**](https://huggingface.co/XiaomiRobotics/Xiaomi-Robotics-0-LIBERO) | Fine-tuned on four LIBERO suites. | **98.7%** (Avg Success) | [LIBERO Eval](eval_libero/README.md) | | **CALVIN** | [**Xiaomi-Robotics-0-Calvin-ABCD_D**](https://huggingface.co/XiaomiRobotics/Xiaomi-Robotics-0-Calvin-ABCD_D) | Fine-tuned on ABCDโ†’D Split. | **4.80** (Avg Length) | [CALVIN Eval](eval_calvin/README.md) | | | [**Xiaomi-Robotics-0-Calvin-ABC_D**](https://huggingface.co/XiaomiRobotics/Xiaomi-Robotics-0-Calvin-ABC_D) | Fine-tuned on ABCโ†’D Split. | **4.75** (Avg Length) | [CALVIN Eval](eval_calvin/README.md) | | **SimplerEnv** | [**Xiaomi-Robotics-0-SimplerEnv-Google-Robot**](https://huggingface.co/XiaomiRobotics/Xiaomi-Robotics-0-SimplerEnv-Google-Robot) | Fine-tuned on Fractal dataset. | **85.5%** (VM)
**74.7%** (VA) | [SimplerEnv Eval](eval_simplerenv/README.md) | | | [**Xiaomi-Robotics-0-SimplerEnv-WidowX**](https://huggingface.co/XiaomiRobotics/Xiaomi-Robotics-0-SimplerEnv-WidowX) | Fine-tuned on Bridge dataset. | **79.2%** | [SimplerEnv Eval](eval_simplerenv/README.md) | | **Base** | [**Xiaomi-Robotics-0**](https://huggingface.co/XiaomiRobotics/Xiaomi-Robotics-0) | Pre-trained model. | - | - | ## ๐Ÿš€ Quick Start: Installation & Deployment Our project relies primarily on HuggingFace Transformers ๐Ÿค—, making deployment extremely easy. If your environment supports transformers >= 4.57.1, you can use our project seamlesslyโ€”we recommend PyTorch 2.8.0 (paired with torchvision 0.23.0 and torchaudio 2.8.0), as this combination has been fully tested by our team and ensures optimal compatibility. ### 1๏ธโƒฃ Installation Guides Hereโ€™s a simple installation guide to get you started: ```bash git clone https://github.com/XiaomiRobotics/Xiaomi-Robotics-0 cd Xiaomi-Robotics-0 # Create a Conda environment with Python 3.12 conda create -n mibot python=3.12 -y conda activate mibot # Install PyTorch pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu128 # Install transformers pip install transformers==4.57.1 # Install flash-attn pip uninstall -y ninja && pip install ninja pip install flash-attn==2.8.3 --no-build-isolation # or pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.8cxx11abiTRUE-cp312-cp312-linux_x86_64.whl sudo apt-get install -y libegl1 libgl1 libgles2 ``` ### 2๏ธโƒฃ Deployment Guides **Xiaomi-Robotics-0** is deployed on top of the HuggingFace Transformers ๐Ÿค— ecosystem, enabling straightforward deployment for robotic manipulation tasks. By leveraging Flash Attention 2 and bfloat16 precision, the model can be loaded and run efficiently on consumer-grade GPUs. ``` python import torch from transformers import AutoModel, AutoProcessor # 1. Load model and processor model_path = "XiaomiRobotics/Xiaomi-Robotics-0-LIBERO" model = AutoModel.from_pretrained( model_path, trust_remote_code=True, attn_implementation="flash_attention_2", dtype=torch.bfloat16 ).cuda().eval() processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True, use_fast=False) # 2. Construct the prompt with multi-view inputs language_instruction = "Pick up the red block." instruction = ( f"<|im_start|>user\nThe following observations are captured from multiple views.\n" f"# Base View\n<|vision_start|><|image_pad|><|vision_end|>\n" f"# Left-Wrist View\n<|vision_start|><|image_pad|><|vision_end|>\n" f"Generate robot actions for the task:\n{language_instruction} /no_cot<|im_end|>\n" f"<|im_start|>assistant\n<|im_end|>\n" ) # 3. Prepare inputs # Assuming `image_base`, `image_wrist`, and `proprio_state` are already loaded inputs = processor( text=[instruction], images=[image_base, image_wrist], # [PIL.Image, PIL.Image] videos=None, padding=True, return_tensors="pt", ).to(model.device) # Add proprioceptive state and action mask robot_type = "libero" inputs["state"] = torch.from_numpy(proprio_state).to(model.device, model.dtype).view(1, 1, -1) inputs["action_mask"] = processor.get_action_mask(robot_type).to(model.device, model.dtype) # 4. Generate action with torch.no_grad(): outputs = model(**inputs) # Decode raw outputs into actionable control commands action_chunk = processor.decode_action(outputs.actions, robot_type=robot_type) print(f"Generated Action Chunk Shape: {action_chunk.shape}") ``` ## ๐Ÿ“š Citation If you find this project useful, please consider citing: ```bibtex @misc{robotics2026xiaomi, title = {Xiaomi-Robotics-0: An Open-Sourced Vision-Language-Action Model with Real-Time Execution}, author = {Xiaomi Robotics}, howpublished={\url{https://xiaomi-robotics-0.github.io}}, year = {2026}, note={Project Website} } ``` ## ๐Ÿ“„ License This project is licensed under the [Apache License 2.0](LICENSE).