# object_centric_diffusion

**Repository Path**: mirrors_NVlabs/object_centric_diffusion

## Basic Information

- **Project Name**: object_centric_diffusion
- **Description**: SPOT: SE(3) Pose Trajectory Diffusion for Object-Centric Manipulation
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2024-10-05
- **Last Updated**: 2025-10-04

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# SPOT: SE(3) Pose Trajectory Diffusion for Object-Centric Manipulation

[Cheng-Chun Hsu](https://chengchunhsu.github.io/), [Bowen Wen](https://research.nvidia.com/person/bowen-wen), [Jie Xu](https://research.nvidia.com/person/jie-xu), [Yashraj Narang](https://research.nvidia.com/person/yashraj-narang), [Xiaolong Wang](https://research.nvidia.com/labs/lpr/author/xiaolong-wang/), [Yuke Zhu](https://research.nvidia.com/person/yuke-zhu), [Joydeep Biswas](https://www.joydeepb.com/), [Stan Birchfield](https://research.nvidia.com/person/stan-birchfield)

ICRA 2025

[Project](https://nvlabs.github.io/object_centric_diffusion/) | [Paper](https://arxiv.org/abs/2411.00965)


## Abstract

We introduce SPOT, an object-centric imitation learning framework. The key idea is to capture each task by an object-centric representation, specifically the SE(3) object pose trajectory relative to the target. This approach decouples embodiment actions from sensory inputs, facilitating learning from various demonstration types, including both action-based and action-less human hand demonstrations, as well as cross-embodiment generalization.  


Additionally, object pose trajectories inherently capture planning constraints from demonstrations without the need for manually-crafted rules. 
To guide the robot in executing the task, the object trajectory is used to condition a diffusion policy. We systematically evaluate our method on simulation and real-world tasks. In real-world evaluation, using only eight demonstrations shot on an iPhone, our approach completed all tasks while fully complying with task constraints.


## Installation

The codebase is thoroughly tested on a desktop running Ubuntu 22 with an RTX 4090 GPU.


### Environment Setup
Create conda environment

```
conda create -n spot python=3.8
conda activate spot
```

Install dependencies (for FoundationPose compilation)

```
# Install eigen library
conda install conda-forge::eigen=3.4.0

# Install gcc and cuda
conda install gcc_linux-64 gxx_linux-64
conda install cuda -c nvidia/label/cuda-12.1.0
conda install nvidia/label/cuda-12.1.0::cuda-cudart
conda install cmake

# Install boost library
sudo apt install libboost-all-dev
conda install conda-forge::boost
```

Install Pytorch and Pytorch3d

```
conda install pytorch==2.3.1 torchvision==0.18.1 pytorch-cuda=12.1 -c pytorch -c nvidia
conda install pytorch3d -c pytorch3d
```


### RLBench Installation

Install CoppeliaSim v4.1.0 (see [here](https://github.com/stepjam/PyRep#install) for details)

```
# set env variables
export COPPELIASIM_ROOT=${HOME}/CoppeliaSim
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$COPPELIASIM_ROOT
export QT_QPA_PLATFORM_PLUGIN_PATH=$COPPELIASIM_ROOT

wget https://downloads.coppeliarobotics.com/V4_1_0/CoppeliaSim_Edu_V4_1_0_Ubuntu20_04.tar.xz
mkdir -p $COPPELIASIM_ROOT && tar -xf CoppeliaSim_Edu_V4_1_0_Ubuntu20_04.tar.xz -C $COPPELIASIM_ROOT --strip-components 1
rm -rf CoppeliaSim_Edu_V4_1_0_Ubuntu20_04.tar.xz
```

Install PyRep, YARR, and RLBench (PerAct's branch)
```
git clone https://github.com/MohitShridhar/PyRep.git
cd PyRep
pip3 install -r requirements.txt
pip3 install .
cd ..

git clone https://github.com/stepjam/YARR.git
cd YARR
pip3 install -r requirements.txt
pip3 install .
cd ..

git clone https://github.com/MohitShridhar/RLBench.git -b peract
cd RLBench
pip3 install -r requirements.txt
pip3 install .
cd ..
```


### FoundationPose and Dependencies Setup

Install python depdencies

```
pip install -r fp_requirements.txt
python -m pip install --quiet --no-cache-dir git+https://github.com/NVlabs/nvdiffrast.git

pip install -r dp3_requirements.txt
```

Compile  FoundationPose's extensions
```
cd foundation_pose
CMAKE_PREFIX_PATH=$CONDA_PREFIX/lib/python3.8/site-packages/pybind11/share/cmake/pybind11 bash build_all_conda.sh
cd ..
```

Download model weight from [here](https://drive.google.com/drive/folders/1DFezOAD0oD1BblsXVxqDsl8fj0qzB82i?usp=sharing) or [Foundation Pose repo](https://github.com/NVlabs/FoundationPose?tab=readme-ov-file), and put the model weight under `data/model_weight/foundation_pose`.


## Training and Evaluation on RLBench

Our dataset generation is based on PerAct's pre-generated datasets. We replay the demonstrations to collect object pose information for policy training.


### Requirements

Download [PerAct's  pre-generated datasets](https://drive.google.com/drive/folders/0B2LlLwoO3nfZfkFqMEhXWkxBdjJNNndGYl9uUDQwS1pfNkNHSzFDNGwzd1NnTmlpZXR1bVE?resourcekey=0-jRw5RaXEYRLe2W6aNrNFEQ) for train (100 episodes), validation (25 episodes), and test (25 episodes) splits (check [PerAct's repo](https://github.com/peract/peract?tab=readme-ov-file#pre-generated-datasets) for details). The task list can be found in our paper.

For reference, I stored the dataset as:
```
[PERACT_DATASET_PATH]
└─ raw
  └─ train
    └─ [TASK_1]
    └─ [TASK_2]
    └─ ...
  └─ val
    └─ [TASK_1]
    └─ [TASK_2]
    └─ ...
  └─ test
    └─ [TASK_1]
    └─ [TASK_2]
    └─ ...
```


Download [RLBench's object mesh](https://drive.google.com/drive/folders/1bupiLa2akr2sytb7jcULnU_6ed4OOgkw?usp=sharing) or manually export the object mesh from CoppeliaSim.


### Dataset Generation

Set up the arguments in `scripts/gen_demonstration_rlbench.sh`.

-  `--peract_demo_dir` specifies the path to store PerAct's demo, e.g., `[PERACT_DATASET_PATH]/raw`.
-  `--save_path` specifies the path to store the generated dataset.

Run the script to collect demonstration from RLBench for all tasks.

```
bash scripts/gen_demonstration_rlbench.sh
```


### Policy Training

- Set `dataset.root_dir` in `config/task/rlbench_multi.yaml` to the path of generated demonstration, i.e., `--save_path` in the script `scripts/gen_demonstration_rlbench.sh`.
- (Optional) Modified `self.task_list` in `diffusion_policy_3d/dataset/rlbench_dataset_list.py` if you want to select your own task suite.
- (Optional) For single task training, set `dataset.root_dir` in `config/task/rlbench/[TASK_NAME].yaml` instead

Run the script for training:
```
# Train on all tasks
bash scripts/train_policy.sh rlbench_multi

# Train on single task
bash scripts/train_policy.sh rlbench/[TASK_NAME]
```


### Policy Evaluation

- Set `pose_estimation.mesh_dir` in `config/simple_dp3.yaml`, ensuring the path leads to the downloaded RLBench mesh file.
- (Optional) Modified task list in `scripts/eval_policy_multi.sh` if you want to select your own task suite.
- (Optional) Set `env_runner.root_dir` in `config/task/rlbench/[TASK_NAME].yaml` to the path of generated demonstration, i.e., `--save_path` in the script `scripts/gen_demonstration_rlbench.sh`.

Run the script for evaluation:
```
# Evaluate on all tasks
bash scripts/eval_policy_multi.sh

# Evaluate on single task
bash scripts/eval_policy.sh rlbench/[TASK_NAME]
```

- Note: The paper's results are based on an internal version of Foundation Pose that cannot be public released due to legal restrictions. Instead, we reference the public version of Foundation Pose. Our testing showed no performance degradation on the RLbench benchmark (See [here](https://github.com/NVlabs/FoundationPose?tab=readme-ov-file#notes) for more information).

---


## Training and Deployment on Real Robot

In this section, I describe my workflow for real-world experiments. This should serve only as a reference, and I recommend that readers use any tools they are familiar with. I used only one iPhone 12 Pro for the entire data collection process.


### Environment Setup

- This guide assumes that the conda environment `spot` has been configured according to the instructions in [Installation](https://github.com/NVlabs/object_centric_diffusion#Installation).
- (Optional) Set up another conda environment named `yolo_world` following [YOLO-World Installation](https://github.com/AILab-CVC/YOLO-World#1-installation).
  - YOLO-World is used to obtain object masks during training (see the script `env_real/data/prepare_mask.py`) and deployment. If you use a different object detection/segmentation model, you can ignore this step.

### Dataset Collection
For policy training and deployment, we need the following:
- **Object mesh** for pose tracking
- **Task demonstration video** for policy training

For each task, the object mesh is the reconstructed mesh of the graspable object (e.g., pitcher) and the target object (e.g., cup). The task demonstration is an RGBD video, where a human hand performs the task (e.g., pour water).

1. Object mesh scanning
    Use [AR Code](https://apps.apple.com/us/app/ar-code-object-capture-3d-scan) to scan both graspable and target objects. Export the mesh in `.usdz` format. Uncompress the `.usdz` file to obtain the .usdc mesh file, then convert the `.usdc` file to `.obj` format. I personally use Blender for this conversion process.
2. Task demonstration collection
    Use [Record3D](https://apps.apple.com/us/app/record3d-3d-videos) to shoot a demonstration video of a single human hand performing the task. Use the export option "EXR + JPG sequence" to get the `.r3d` file.


After completing steps 1 and 2, place the mesh file in the `mesh` directory and the r3d file in the `r3d` directory. The files should be stored as follows:

```
[TASK_DATASET_PATH]
└─ mesh
  └─ pitcher
    └─ pitcher.obj
  └─ cup
    └─ cup.obj
└─ r3d
  └─ 2024-09-07--01-11-23.r3d
  └─ 2024-09-07--01-11-41.r3d
  └─ ...
```


### Dataset Post-Processing

Set up the task name and object name using `real_task_object_dict` in `env_real/utils/realworld_objects.py`.
- The `grasp_object_name` and `target_object_name` should be consistent with the folder names under `[TASK_DATASET_PATH]/mesh`.
- The `grasp_object_prompt` and `target_object_prompt` are the prompts for the object detection/segmentation model (in this case, Yolo-World) to obtain the object bounding box/mask for tracking.

Run the script for dataset generation:
```
bash scripts/gen_demonstration_real.sh
```

The generated dataset will be saved under `[TASK_DATASET_PATH]/zarr`. 


### Policy Training

To train the policy, set `dataset.root_dir` to `[TASK_DATASET_PATH]` in the config file (see `config/task/_real_world_task_template.yaml` for details).

Run the script for training:

```
bash scripts/train_policy.sh [TASK_NAME]
```


## Troubleshooting
ModuleNotFoundError: No module named 'rlbench.action_modes'
- Edit "setup.py" RLBench library and add 'rlbench.action_modes'. See [here](https://github.com/stepjam/RLBench/issues/160) for more details.


## Acknowledge
The policy learning is based on [3D Diffusion Policy](https://github.com/YanjieZe/3D-Diffusion-Policy). The RLBench data collection and evaluation is based on [RLBench](https://github.com/stepjam/RLBench) and [PerAct](https://github.com/peract/peract). The object pose tracking is based on [FoundationPose](https://github.com/NVlabs/FoundationPose). Thanks for their wonderful work.

## License

The code and data are released under the NVIDIA Source Code License. Copyright © 2025, NVIDIA Corporation. All rights reserved.