# X-VLA

**Repository Path**: pfs11/X-VLA

## Basic Information

- **Project Name**: X-VLA
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Apache-2.0
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 1
- **Created**: 2025-11-13
- **Last Updated**: 2025-12-10

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README


# 🤖 X-VLA: Soft-Prompted Transformer as a Scalable Cross-Embodiment Vision-Language-Action Model

### 🏅Champion @ AgiBot World Challenge @ IROS 2025

**X-VLA** (Cross-Embodiment Vision-Language-Action) model introduces a unified **soft-prompted Transformer** architecture that achieves **scalable and generalizable control** across heterogeneous robotic embodiments.  
By **decoupling the core policy model from embodiment-specific details**, X-VLA enables robust, high-performance deployment in both simulation and real-world robotic systems.

| 📄 **Paper** | 🌐 **Project Page** | 🤗 **Hugging Face** |
| :---: | :---: | :---: |
| [Read the Full Research](https://arxiv.org/pdf/2510.10274) | [Explore the Demos](https://thu-air-dream.github.io/X-VLA/) | [Access Models & Datasets](https://huggingface.co/collections/2toINF/x-vla) |

---

## 🧩 Overview

Successful generalist **Vision–Language–Action (VLA)** models depend on scalable, cross-platform training across diverse robotic embodiments.  
To leverage the heterogeneity of large-scale robot datasets, **X-VLA** introduces a **soft prompt** mechanism — embodiment-specific learnable embeddings that guide a unified Transformer backbone toward effective multi-domain policy learning.

The resulting architecture — **X-VLA-0.9B** — achieves **state-of-the-art generalization** across six simulation platforms and three real-world robots, surpassing prior VLA approaches in dexterity, adaptability, and efficiency.

https://github.com/user-attachments/assets/c047bac4-17c3-4d66-8036-badfab2b8c41

---

## 🚀 Quick Start: Installation & Deployment

### 1️⃣ Installation

```bash
# Clone the repository
git clone https://github.com/2toinf/X-VLA.git
cd X-VLA

# Create and activate Conda environment
conda create -n XVLA python=3.10 -y
conda activate XVLA

# Install dependencies
pip install -r requirements.txt
```

---
### 2️⃣ Deploying X-VLA for Inference

X-VLA adopts a **Server–Client** architecture to separate the model environment from simulation or robot-specific dependencies.
This design avoids package conflicts and supports distributed inference across GPUs, SLURM clusters, or edge devices.

#### 🧠 Available Pre-trained Models

- [ ] We observed a slight performance drop (around 1% across different datasets) after converting our models to the HF format, and we’re actively investigating the cause.

| Model ID                                                                                           | Embodiment        | Description                                                                                     |   Performance   | Evaluation Guidance |
| :------------------------------------------------------------------------------------------------- | :---------------- | :---------------------------------------------------------------------------------------------- | :--------------: | :-----------------: |
| [`2toINF/X-VLA-Pt`](https://huggingface.co/2toINF/X-VLA-Pt)                                        | Foundation        | Pretrained on large-scale heterogeneous robot–vision–language datasets for general transfer.     | —                | —                   |
| [`2toINF/X-VLA-AgiWorld-Challenge`](https://huggingface.co/2toINF/X-VLA-AgiWorld-Challenge)        | Agibot-G1          | Fine-tuned for AgiWorld Challenge.       | **Champion🥇**        | -  |
| [`2toINF/X-VLA-Calvin-ABC_D`](https://huggingface.co/2toINF/X-VLA-Calvin-ABC_D)                    | Franka     | Fine-tuned on CALVIN benchmark (ABC_D subset)              | **4.41**        | [Calvin Eval](evaluation/calvin/README.md)          |
| [`2toINF/X-VLA-Google-Robot`](https://huggingface.co/2toINF/X-VLA-Google-Robot)                    | Google Robot      |  Fine-tuned on large-scale Google Robot dataset                | **80.4%(VM) 75.7%(VA)**        | [Simpler Eval](evaluation/simpler/README.md)   |
| [`2toINF/X-VLA-Libero`](https://huggingface.co/2toINF/X-VLA-Libero)                                | Franka            | Fine-tuned on LIBERO benchmark                     | **98.1%**        | [LIBERO Eval](evaluation/libero/README.md)         |
| [`2toINF/X-VLA-VLABench`](https://huggingface.co/2toINF/X-VLA-VLABench)                                | Franka            | Fine-tuned on VLABench benchmark                     | **51.1(score)**        | to be update         |
| [`2toINF/X-VLA-RoboTwin2`](https://huggingface.co/2toINF/X-VLA-RoboTwin2)                          | Agilex        | Trained on RoboTwin2 dataset for dual-arm coordinated manipulation(50 demos for each task).                     | **70%**        |   [RoboTwin2.0 Eval](evaluation/robotwin-2.0/README.md)    |
| [`2toINF/X-VLA-Simpler-WidowX`](https://huggingface.co/2toINF/X-VLA-WidowX)                | WidowX  | Fine-tuned on BridgeDataV2 (Simpler benchmark).                                                  | **95.8%**        | [Simpler Eval](evaluation/simpler/README.md) |
| [`2toINF/X-VLA-SoftFold`](https://huggingface.co/2toINF/X-VLA-SoftFold)                            | Agilex          | Fine-tuned on Soft-Fold Dataset. Specialized in deformable object manipulation (e.g., folding and cloth control).                 | cloth folding with a 100% success rate in 2 hours.  |  [SoftFold-Agilex](evaluation/SoftFold-Agilex/readme.md)   |

---

## 🧩 Notes

- All models share a consistent architecture: `configuration_xvla.py`, `modeling_xvla.py`, and unified tokenizer (`tokenizer.json`).
- The **X-VLA-Pt** model is the *foundation checkpoint*, trained across multiple robot domains.
- Each embodiment is fine-tuned for its respective environment while retaining cross-embodiment alignment.
- Evaluation scripts (in `evaluation/`) follow a standardized format for reproducible benchmarking.

---

> 📊 Performance metrics follow standard evaluation protocols detailed in the [paper](https://arxiv.org/pdf/2510.10274).

---

### 3️⃣ Launching the Inference Server

```python
from transformers import AutoModel, AutoProcessor
import json_numpy

# Load model and processor
model = AutoModel.from_pretrained("2toINF/X-VLA-WidowX", trust_remote_code=True)
processor = AutoProcessor.from_pretrained("2toINF/X-VLA-WidowX", trust_remote_code=True)

# Start the inference server
print("🚀 Starting X-VLA inference server...")
model.run(processor, host="0.0.0.0", port=8000)
```

Once launched, the API endpoint is available at:

```
POST http://<server_ip>:8000/infer
```

---

### 4️⃣ Client Interaction & Action Prediction

The client communicates via HTTP POST, sending multimodal data (vision + language + proprioception) as a JSON payload.

#### Payload Structure

| Key                    | Type                      | Description                                           |
| :--------------------- | :------------------------ | :---------------------------------------------------- |
| `proprio`              | `json_numpy.dumps(array)` | Current proprioceptive state (e.g., joint positions). |
| `language_instruction` | `str`                     | Task instruction (e.g., "Pick up the red block").     |
| `image0`               | `json_numpy.dumps(array)` | Primary camera image (RGB).                           |
| `image1`, `image2`     | *optional*                | Additional camera views if applicable.                |
| `domain_id`            | `int`                     | Identifier for the current robotic embodiment/domain. |
| `steps`                | `int`                     | denoising steps for flow-matching based generation (e.g., 10).         |

#### Example Client Code

```python
import requests
import numpy as np
import json_numpy

server_url = "http://localhost:8000/infer"
timeout = 5

# Prepare inputs
proprio = np.zeros(7, dtype=np.float32)
image = np.zeros((256, 256, 3), dtype=np.uint8)
instruction = "Move the gripper to the target position"

payload = {
    "proprio": json_numpy.dumps(proprio),
    "language_instruction": instruction,
    "image0": json_numpy.dumps(image),
    "domain_id": 0,
    "steps": 10
}

try:
    response = requests.post(server_url, json=payload, timeout=timeout)
    response.raise_for_status()
    result = response.json()
    actions = np.array(result["action"], dtype=np.float32)
    print(f"✅ Received {actions.shape[0]} predicted actions.")
except Exception as e:
    print(f"⚠️ Request failed: {e}")
    actions = np.zeros((10, 20), dtype=np.float32)
```

#### Expected Output

```
[Server] Model loaded successfully on cuda:0
[Server] Listening on 0.0.0.0:8000
[Client] Sending observation to server...
✅ Received 10 predicted actions.
```

---

### 5️⃣ Standardized Control Interface: EE6D

To ensure consistency across embodiments, **X-VLA** adopts a unified **EE6D (End-Effector 6D)** control space.

| Component           | Specification                                                              | Notes                                         |
| :------------------ | :------------------------------------------------------------------------- | :-------------------------------------------- |
| **Proprio Input**   | Current EE6D pose (position + orientation)                                 | Must align with training-space normalization. |
| **Action Output**   | Predicted target delta/absolute pose (EE6D)                                | Executed by downstream controller.            |
| **Dimensionality**  | 20-D vector = 3 (EE Pos) + 6 (Rotation in 6D) + 1 (Gripper) + 10 (Padding) |                                               |
| **Single-arm Case** | If only one arm exists, pad with zeros to maintain 20D vector.             |                                               |

> ⚙️ **Reference Post-processing:**
>
> ```python
> from datasets.utils import rotate6d_to_xyz
> action_final = np.concatenate([
>     action_pred[:3],
>     rotate6d_to_xyz(action_pred[3:9]),
>     np.array([1.0 if action_pred[9] > 0.5 else 0])
> ])
> ```
>
> When feeding proprioception to the model, apply the **inverse transformation** accordingly.

---

### 6️⃣ Reference Client Implementations

Each released model includes a corresponding **reference client** under
[`evaluation/<domain>/<robot>/client.py`](evaluation/) for reproducing exact deployment behaviors.
We strongly recommend adapting from these clients when connecting to physical or simulated robots.

---

### 7️⃣ SLURM & Cluster Deployment

For large-scale or distributed training/deployment (e.g., HPC clusters, AgiBot nodes):

```bash
python -m deploy --model_path /path/to/your/model
```

This script automatically detects SLURM environment variables, launches distributed servers, and writes connection metadata to `info.json`.

---

## ⚙️ Training / Fine-tuning on Custom Data

X-VLA supports fine-tuning on new demonstrations via a modular and extensible dataset interface.

### Data Preparation Workflow

1. **Prepare Meta JSONs** — each domain has a `meta.json` listing trajectory file paths.
2. **Implement Custom Handler** — write a domain loader class with `iter_episode(traj_idx)` generator.
3. **Register Domain** — update:

   * `datasets/domain_handler/registry.py`
   * `datasets/domain_config.py`

### Example Handlers

| Handler       | Dataset               | Description                               |
| :------------ | :-------------------- | :---------------------------------------- |
| `"lerobot"`   | Agibot-Beta           | Optimized for LEROBOT format              |
| `"h5py"`      | RoboMind / Simulation | Efficient loading from `.h5` trajectories |
| `"scattered"` | AGIWorld              | Handles scattered trajectory storage      |

---

### Launch Training with Accelerate

```bash
accelerate launch \
    --mixed_precision bf16 \
    train.py \
    --models '2toINF/X-VLA-Pt' \
    --train_metas_path /path/to/meta_files.json \
    --learning_rate 1e-4 \
    --learning_coef 0.1 \
    --iters 50000 \
    --freeze_steps 1000 \
    --warmup_steps 2000
```

| Argument             | Description                            |
| :------------------- | :------------------------------------- |
| `--models`           | Base model (e.g., `'2toINF/X-VLA-Pt'`) |
| `--train_metas_path` | Path to meta JSON file(s)              |
| `--batch_size`       | Batch size                             |
| `--learning_rate`    | Base LR                                |
| `--learning_coef`    | LR multiplier for soft prompts         |
| `--iters`            | Total training iterations              |
| `--freeze_steps`     | Steps to freeze backbone               |
| `--warmup_steps`     | Warmup iterations                      |

---


## 📚 Citation

If you use X-VLA in your research, please cite:

```bibtex
@article{zheng2025x,
  title   = {X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model},
  author  = {Zheng, Jinliang and Li, Jianxiong and Wang, Zhihao and Liu, Dongxiu and Kang, Xirui
             and Feng, Yuchun and Zheng, Yinan and Zou, Jiayin and Chen, Yilun and Zeng, Jia and others},
  journal = {arXiv preprint arXiv:2510.10274},
  year    = {2025}
}
```

---

## 🪪 License

This repository is licensed under the **Apache License 2.0**.
You may freely use, modify, and distribute the code under the terms of the license.

```
Copyright 2025 2toINF (https://github.com/2toinf)
Licensed under the Apache License, Version 2.0.
```

---

**Maintained by [2toINF](https://github.com/2toinf)**
💬 Feedback, issues, and contributions are welcome via GitHub Discussions or Pull Requests.