# ABot-PhysWorld
**Repository Path**: robotdna/ABot-PhysWorld
## Basic Information
- **Project Name**: ABot-PhysWorld
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2026-04-19
- **Last Updated**: 2026-04-19
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
> **ABot-PhysWorld** is a physically consistent, action-controllable video world model for robotic manipulation, built on a 14-billion-parameter Diffusion Transformer. It integrates physics-aware training, memory-efficient preference optimization, and precise spatial action injection to generate realistic and physically plausible robot-object interactions β even in zero-shot settings.
## ποΈ News
- **[2026-04]** π **1st Place on [WorldArena Leaderboard](https://huggingface.co/spaces/WorldArena/WorldArena)!** ABot-PhysWorld achieves the top rank on the WorldArena benchmark.
- **[2026-04]** π₯ **2nd Place on [GigaBrain Challenge CVPR 2026 β World Model Track](https://huggingface.co/spaces/open-gigaai/CVPR-2026-WorldModel-Track-LeaderBoard)!** ABot-PhysWorld secures the runner-up position in the CVPR 2026 GigaBrain Challenge World Model Track.
- **[2026-03]** π **Training code released!** Full-parameter SFT training scripts for fine-tuning on custom robot manipulation datasets. See [`training/`](training/).
- **[2026-03]** π¦ **SFT training data released!** The v1 SFT training dataset is available on [ModelScope](https://www.modelscope.cn/datasets/amap_cvlab/ABot-PhysWorld_SFT_Training_Data_v1).
- **[2026-03]** π¬ **Benchmark released!** EZS-Bench evaluation toolkit and data are open-sourced. See [`EZS-Bench/`](EZS-Bench/).
- **[2026-03]** π **Inference code released!** Generate robot manipulation videos with the pre-trained model. See [`inference/`](inference/).
### π Competition Results
#### WorldArena Leaderboard β π₯ 1st Place
π Click the image to view the live leaderboard on HuggingFace
#### GigaBrain Challenge CVPR 2026 β World Model Track β π₯ 2nd Place
π Click the image to view the live leaderboard on HuggingFace
## Table of Contents
- [π Key Contributions](#-key-contributions)
- [π EZS-Bench](#-ezs-bench)
- [π Evaluation](#-Evaluation)
- [πΌοΈ Qualitative Results](#οΈ-qualitative-results)
- [π οΈ Usage](#οΈ-usage)
- [ποΈ Training](#οΈ-training)
- [π Citing](#-Citing)
## π Key Contributions
1. **Industrial-Grade Data Pipeline**
Curated ~3M real-world manipulation clips from five datasets (`AgiBot`, `RoboCoin`, `RoboMind`, `Galaxea`, `OXE`) with motion, semantic, and action consistency filtering, plus hierarchical sampling for balanced generalization.
2. **Physics-Aware DPO Training**
Introduces a decoupled VLM-based discriminator: Qwen3-VL generates task-specific physics checklists, Gemini 3 Pro scores videos via Chain-of-Thought; combined with LoRA-augmented DPO on a 14B DiT to enforce physical plausibility.
3. **Parallel Context Blocks for Action Control**
Enables precise action-conditioned generation by residually injecting spatial action maps into cloned DiT blocks, preserving physical priors while supporting cross-embodiment control.
4. **EZSbench β First True Zero-Shot Benchmark**
Fully training-independent evaluation covering unseen robot, scene, and task combinations, with dual-model scoring to eliminate self-evaluation bias.
---
## π EZS-Bench
**Embodied-ZeroShot Benchmark for Physically Consistent Video Generation** π€β¨
EZS-Bench is a zero-shot evaluation benchmark designed to rigorously assess **physically plausible video generation** in robotic manipulation. It evaluates models on **physical consistency**, **action controllability**, and **cross-embodiment generalization**βwith *no training-test data overlap*. ππ¬
### β¨ Key Features
β
**True Zero-Shot Evaluation**
Unseen combinations of:
- π€ Robot morphologies (e.g., single-arm, bimanual, custom kinematics)
- π Scenes & backgrounds
- π― Manipulation tasks (pick-and-place, wiping, assembly, etc.)
π¨ **Dual-Source Data Construction**
- 𧬠*Synthetic branch*: Text-to-image generation with controlled variation
- πΌοΈ *Real-world editing*: VLM-driven scene augmentation preserving physical interactions
π§ **Physics-Aware Evaluation**
- Dynamic physical checklists generated by VLMs (e.g., *"Does the gripper penetrate the object?"*, *"Is gravity respected?"*)
- 30β50% negative questions to prevent guessing π«
- Decoupled scorer architecture to eliminate self-evaluation bias βοΈ
π **Comprehensive Metrics**
Evaluates:
- Physical fidelity (penetration, contact, deformation) π₯
- Temporal coherence π
- Spatial alignment & trajectory consistency π―
### π¦ Getting Started
**Download evaluation data** from ModelScope:
git lfs install
git clone https://www.modelscope.cn/datasets/amap_cvlab/EZS-Bench_data.git
**Install and run** the evaluation toolkit:
cd EZS-Bench
pip install -e .
# Full evaluation (Video Quality + Domain Score)
torchrun --standalone --nproc_per_node=4 evaluate_ezsbench.py \
--data_file /path/to/EZS-Bench_data/video_prompt_question_196_ezs0.jsonl \
--method_name "YourMethod" \
--method_dir /path/to/generated_videos \
--output_dir ./results
> The VLM judge model (Qwen2.5-VL-72B-Instruct, ~150 GB) is automatically downloaded on first run.
π *See [EZS-Bench/README.md](EZS-Bench/README.md) for full documentation.*
---
## π Evaluation
We evaluate ABot-PhysWorld on three key aspects:
- **Physical Consistency** (via **PBench** and **EZSbench**)
- **Zero-Shot Generalization** (via **EZSbench**)
- **Action-Conditioned Controllability** (via custom A2V benchmark)
### π Summary of Advancements ππ
| Capability | Benchmark | Ours | Best Baseline | Gain |
|----------|-----------|------|---------------|------|
| Physical Fidelity | PBench (Domain Score) | **0.9306** | 0.8644 (Wan2.5) | +6.62% |
| Zero-Shot Generalization | EZSbench (Domain Score) | **0.8366** | 0.7951 (WoW) | +4.15% |
| Action Control | Trajectory Consistency | **0.8522** | 0.8157 (Enerverse) | +3.65% |
β
ABot-PhysWorld establishes a new standard for **physically grounded**, **controllable**, and **generalizable** world models in robotic manipulation.
---
## πΌοΈ Qualitative Results
Selected representative zero-shot generation results demonstrating ABot-PhysWorld's strong generalization and physical plausibility.
### π― Zero-Shot Capabilities
#### π§ Scene 1: Deformable Object β Dual-Arm Towel Folding
- **Task**: Fold a towel using dual robotic arms
- **Challenge**: Complex cloth dynamics and bimanual coordination
- **Ours**:
β
Physically realistic deformation
β
Smooth, collision-free arm motion
β
Natural folding sequence with consistent contact
#### π₯€ Scene 2: Fine Manipulation β Diverse Object Handling
- **Task**: Stack cups, build blocks, place a knife
- **Challenge**: Varying shapes, weights, and friction
- **Ours**:
β
Accurate grasp pose prediction
β
Adaptive gripper control
β
Stable pick-and-place without slippage or penetration
#### πͺ Scene 3: Articulated Object β Opening a Cabinet Door
- **Task**: Open a hinged cabinet or door
- **Challenge**: Enforce rotational constraints and correct force direction
- **Ours**:
β
Proper handle grasping
β
Realistic hinge rotation
β
Motion follows physical pivot axis
#### π« Scene 4: Fluid Interaction β Pouring Water
- **Task**: Pour water from a cup into a bowl using dual arms
- **Challenge**: Bimanual coordination, tilt control, liquid dynamics
- **Ours**:
β
Collision-free trajectory planning
β
Accurate pour timing and angle
β
Visual consistency in fluid transfer (simulated proxy)
#### π§½ Scene 5: Cleaning Task β Wiping a Stain
> Note: The Gemini watermark (bottom-right) indicates the initial frame generated by Gemini (ensuring it is completely unseen); all other frames are generated by ABot-PhysWorld.
- **Task**: Wipe a stain off a table
- **Challenge**: Maintain contact, uniform pressure, full coverage
- **Ours**:
β
Continuous tool-surface contact
β
Systematic wiping motion
β
Gradual removal of the stain in video output
#### π Scene 6: Multi-Scene Generalization β Fruit Sorting
> Note: The Gemini watermark (bottom-right) indicates the initial frame generated by Gemini (ensuring it is completely unseen); all other frames are generated by ABot-PhysWorld.
- **Task**: Place fruits into a plate across diverse scenes
- **Challenge**: Background, lighting, and fruit variation
- **Ours**:
β
Robust object recognition under domain shifts
β
Consistent performance across unseen environments
β
Fast and stable manipulation regardless of setup
### π Pbench Results Demonstration
We conducted systematic qualitative comparative experiments on the **PAI-Bench** benchmark dataset. Below are the generated results from several typical scenarios.
| Task | Baselines | **Ours** |
|------|-----------------------------|--------|
| Grasping | Frequent penetration, floatation | β
Firm contact, no violation |
| Long-horizon Planning | Inconsistent state transitions | β
Coherent multi-step reasoning |
| Rigid-body Dynamics | Unphysical deformations | β
Preserved geometry and mass behavior |
| Contact Modeling | Non-contact attraction | β
Realistic interaction onset |
> Our model consistently generates physically valid trajectories even in complex, unseen scenarios β proving its utility as a reliable simulator for embodied AI.
---
## π οΈ Usage
### Quick Start: Video Generation Inference
Generate physically plausible robot manipulation videos using the **ABot-PhysWorld** fine-tuned model.
#### Environment Setup
```bash
# Create conda environment
conda create -n abot-physworld python=3.10
conda activate abot-physworld
# Install PyTorch with CUDA support
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121
# Install dependencies
pip install -r requirements.txt
```
**Hardware Requirements:**
| Configuration | VRAM | Notes |
|---|---|---|
| **Recommended** | >= 60GB | Best performance, no tiling needed |
| **Minimum** | >= 24GB | Uses tiled VAE (enabled by default) |
#### Demo: Generate Video from Image + Text Prompt
```bash
cd inference
# Download demo data and run inference
python inference.py \
--jsonl_path assets/demo.jsonl \
--output_dir ./outputs/demo \
--save_first_frames
```
This generates videos for 2 Franka robot manipulation samples. The model checkpoint is auto-downloaded from [ModelScope](https://www.modelscope.cn/models/amap_cvlab/Abot-PhysWorld) on first run.
#### Single Image Inference
```bash
python inference.py \
--input_image /path/to/image.jpg \
--prompt "robot arm picks up the red cube from the table" \
--output_dir ./outputs
```
#### Batch Inference from JSONL
Prepare a JSONL file (each line is a sample):
```json
{"video": "path/to/image.jpg", "prompt": "robot grasps the object"}
{"video": "path/to/image2.jpg", "prompt": "robot places object on table"}
```
Then run:
```bash
python inference.py \
--jsonl_path data.jsonl \
--output_dir ./outputs \
--num_samples 100 # Process max 100 samples
```
#### Full Parameter Reference
```bash
python inference.py --help
```
Key parameters:
- `--checkpoint_path`: Local path to model weights (auto-downloads if not provided)
- `--cache_dir`: Directory to store downloaded weights (default: `./checkpoints`)
- `--height`, `--width`: Video resolution (default: 480Γ832)
- `--num_frames`: Number of output frames (default: 81 β 5.4s at 15fps)
- `--num_inference_steps`: Denoising steps, higher = better quality but slower (default: 50)
- `--cfg_scale`: Classifier-free guidance scale (default: 5.0)
- `--seed`: Random seed for reproducibility
- `--gpu_id`: GPU device index
#### Output
- **Single image**: `{output_dir}/{image_name}_generated.mp4`
- **Batch mode**: `{output_dir}/{unique_id}_generated.mp4` + `results.json` (with status for each sample)
---
### Model Weights
**Auto-Download:** The fine-tuned checkpoint is automatically downloaded from [ModelScope](https://www.modelscope.cn/models/amap_cvlab/Abot-PhysWorld) on first inference run.
**Manual Download (Optional):**
```bash
pip install modelscope
modelscope download --model amap_cvlab/Abot-PhysWorld --local_dir ./inference/checkpoints
```
**Base Model:** Wan2.1-I2V-14B-480P is also auto-downloaded by DiffSynth-Studio.
---
### More Details
For detailed setup instructions, examples, and troubleshooting, see [`inference/README.md`](inference/README.md).
---
## ποΈ Training
We provide full-parameter SFT training scripts to fine-tune Wan2.1-I2V-14B-480P on your own robot manipulation datasets.
### Training Data
The v1 SFT training dataset is available on ModelScope:
```bash
git lfs install
git clone https://www.modelscope.cn/datasets/amap_cvlab/ABot-PhysWorld_SFT_Training_Data_v1.git
```
### Quick Start
```bash
cd training
# Prepare your dataset (JSONL format, see training/assets/demo_train.jsonl)
# Then launch 8-GPU training:
bash run_train.sh
```
### Key Features
- **Full-parameter SFT** on the 14B DiT model (LoRA also supported)
- **DeepSpeed ZeRO-2** distributed training via Accelerate
- **Encoded feature caching**: Save VAE/T5/CLIP encodings to disk, skip re-encoding in subsequent runs
- **Resume from checkpoint**: Continue training from any saved step
- **Real-time text encoding**: Re-train with new captions while reusing cached video features
### Resume from Checkpoint
```bash
RESUME_CHECKPOINT=./outputs/sft_training/step-800.safetensors \
bash run_train_resume.sh
```
### Training with Encoded Cache
```bash
# First run: train + save encoded features
ENCODED_CACHE_DIR=./encoded_cache bash run_train.sh
# Subsequent runs: reuse cached features (much faster)
ENCODED_CACHE_DIR=./encoded_cache bash run_train.sh
```
For detailed training instructions, data preparation, and parameter reference, see [`training/README.md`](training/README.md).
---
## π Citing
If you find **ABot-PhysWorld** is useful in your research or applications, please consider giving us a **star** π and **citing** it by the following BibTeX entry:
```
@article{chen2026abotphysworld,
title={ABot-PhysWorld: Interactive World Foundation Model for Robotic Manipulation with Physics Alignment},
author={Yuzhi Chen, Ronghan Chen, Dongjie Huo, Yandan Yang, Dekang Qi, Haoyun Liu, Tong Lin, Shuang Zeng, Junjin Xiao, Xinyuan Chang, Feng Xiong, Xing Wei, Zhiheng Ma, Mu Xu},
journal={arXiv preprint arXiv:2603.23376},
year={2026}
}
```
---
## π Acknowledgement
This project builds upon the following open-source projects. We thank these teams for their contributions:
- [Wan2.1](https://github.com/Wan-Video/Wan2.1)
- [VACE](https://github.com/ali-vilab/VACE)
- [DiffSynth-Studio](https://github.com/modelscope/DiffSynth-Studio)
- [VideoX-Fun](https://github.com/aigc-apps/VideoX-Fun)
- [Qwen3](https://github.com/QwenLM/Qwen3)
- [Qwen3-VL](https://github.com/QwenLM/Qwen3-VL)
- [Physical AI Bench](https://github.com/SHI-Labs/physical-ai-bench)
- [FantasyTalking2](https://github.com/Fantasy-AMAP/fantasy-talking2)
---