# ABot-PhysWorld **Repository Path**: robotdna/ABot-PhysWorld ## Basic Information - **Project Name**: ABot-PhysWorld - **Description**: No description available - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2026-04-19 - **Last Updated**: 2026-04-19 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README

πŸ€– ABot-PhysWorld

AMAP CV Lab

> **ABot-PhysWorld** is a physically consistent, action-controllable video world model for robotic manipulation, built on a 14-billion-parameter Diffusion Transformer. It integrates physics-aware training, memory-efficient preference optimization, and precise spatial action injection to generate realistic and physically plausible robot-object interactions β€” even in zero-shot settings. ## πŸ—žοΈ News - **[2026-04]** πŸ† **1st Place on [WorldArena Leaderboard](https://huggingface.co/spaces/WorldArena/WorldArena)!** ABot-PhysWorld achieves the top rank on the WorldArena benchmark. - **[2026-04]** πŸ₯ˆ **2nd Place on [GigaBrain Challenge CVPR 2026 – World Model Track](https://huggingface.co/spaces/open-gigaai/CVPR-2026-WorldModel-Track-LeaderBoard)!** ABot-PhysWorld secures the runner-up position in the CVPR 2026 GigaBrain Challenge World Model Track. - **[2026-03]** πŸŽ‰ **Training code released!** Full-parameter SFT training scripts for fine-tuning on custom robot manipulation datasets. See [`training/`](training/). - **[2026-03]** πŸ“¦ **SFT training data released!** The v1 SFT training dataset is available on [ModelScope](https://www.modelscope.cn/datasets/amap_cvlab/ABot-PhysWorld_SFT_Training_Data_v1). - **[2026-03]** πŸ”¬ **Benchmark released!** EZS-Bench evaluation toolkit and data are open-sourced. See [`EZS-Bench/`](EZS-Bench/). - **[2026-03]** πŸš€ **Inference code released!** Generate robot manipulation videos with the pre-trained model. See [`inference/`](inference/). ### πŸ† Competition Results #### WorldArena Leaderboard – πŸ₯‡ 1st Place
WorldArena Leaderboard

πŸ‘† Click the image to view the live leaderboard on HuggingFace

#### GigaBrain Challenge CVPR 2026 – World Model Track – πŸ₯ˆ 2nd Place
GigaBrain Challenge CVPR 2026 World Model Track

πŸ‘† Click the image to view the live leaderboard on HuggingFace

## Table of Contents - [πŸ“š Key Contributions](#-key-contributions) - [πŸš€ EZS-Bench](#-ezs-bench) - [πŸ“Š Evaluation](#-Evaluation) - [πŸ–ΌοΈ Qualitative Results](#️-qualitative-results) - [πŸ› οΈ Usage](#️-usage) - [πŸ‹οΈ Training](#️-training) - [πŸ“œ Citing](#-Citing) ## πŸ“š Key Contributions 1. **Industrial-Grade Data Pipeline** Curated ~3M real-world manipulation clips from five datasets (`AgiBot`, `RoboCoin`, `RoboMind`, `Galaxea`, `OXE`) with motion, semantic, and action consistency filtering, plus hierarchical sampling for balanced generalization.
EZS-Bench
2. **Physics-Aware DPO Training** Introduces a decoupled VLM-based discriminator: Qwen3-VL generates task-specific physics checklists, Gemini 3 Pro scores videos via Chain-of-Thought; combined with LoRA-augmented DPO on a 14B DiT to enforce physical plausibility.
EZS-Bench
3. **Parallel Context Blocks for Action Control** Enables precise action-conditioned generation by residually injecting spatial action maps into cloned DiT blocks, preserving physical priors while supporting cross-embodiment control.
EZS-Bench
4. **EZSbench – First True Zero-Shot Benchmark** Fully training-independent evaluation covering unseen robot, scene, and task combinations, with dual-model scoring to eliminate self-evaluation bias.
EZS-Bench
--- ## πŸš€ EZS-Bench **Embodied-ZeroShot Benchmark for Physically Consistent Video Generation** πŸ€–βœ¨ EZS-Bench is a zero-shot evaluation benchmark designed to rigorously assess **physically plausible video generation** in robotic manipulation. It evaluates models on **physical consistency**, **action controllability**, and **cross-embodiment generalization**β€”with *no training-test data overlap*. πŸ”πŸ”¬ ### ✨ Key Features βœ… **True Zero-Shot Evaluation** Unseen combinations of: - πŸ€– Robot morphologies (e.g., single-arm, bimanual, custom kinematics) - 🌍 Scenes & backgrounds - 🎯 Manipulation tasks (pick-and-place, wiping, assembly, etc.) 🎨 **Dual-Source Data Construction** - 🧬 *Synthetic branch*: Text-to-image generation with controlled variation - πŸ–ΌοΈ *Real-world editing*: VLM-driven scene augmentation preserving physical interactions 🧠 **Physics-Aware Evaluation** - Dynamic physical checklists generated by VLMs (e.g., *"Does the gripper penetrate the object?"*, *"Is gravity respected?"*) - 30–50% negative questions to prevent guessing 🚫 - Decoupled scorer architecture to eliminate self-evaluation bias βš–οΈ πŸ“Š **Comprehensive Metrics** Evaluates: - Physical fidelity (penetration, contact, deformation) πŸ’₯ - Temporal coherence πŸ•’ - Spatial alignment & trajectory consistency 🎯 ### πŸ“¦ Getting Started **Download evaluation data** from ModelScope: git lfs install git clone https://www.modelscope.cn/datasets/amap_cvlab/EZS-Bench_data.git **Install and run** the evaluation toolkit: cd EZS-Bench pip install -e . # Full evaluation (Video Quality + Domain Score) torchrun --standalone --nproc_per_node=4 evaluate_ezsbench.py \ --data_file /path/to/EZS-Bench_data/video_prompt_question_196_ezs0.jsonl \ --method_name "YourMethod" \ --method_dir /path/to/generated_videos \ --output_dir ./results > The VLM judge model (Qwen2.5-VL-72B-Instruct, ~150 GB) is automatically downloaded on first run. πŸ”— *See [EZS-Bench/README.md](EZS-Bench/README.md) for full documentation.* --- ## πŸ“Š Evaluation We evaluate ABot-PhysWorld on three key aspects: - **Physical Consistency** (via **PBench** and **EZSbench**) - **Zero-Shot Generalization** (via **EZSbench**) - **Action-Conditioned Controllability** (via custom A2V benchmark) ### πŸ“ˆ Summary of Advancements πŸŽ‰πŸŽ‰ | Capability | Benchmark | Ours | Best Baseline | Gain | |----------|-----------|------|---------------|------| | Physical Fidelity | PBench (Domain Score) | **0.9306** | 0.8644 (Wan2.5) | +6.62% | | Zero-Shot Generalization | EZSbench (Domain Score) | **0.8366** | 0.7951 (WoW) | +4.15% | | Action Control | Trajectory Consistency | **0.8522** | 0.8157 (Enerverse) | +3.65% | βœ… ABot-PhysWorld establishes a new standard for **physically grounded**, **controllable**, and **generalizable** world models in robotic manipulation. --- ## πŸ–ΌοΈ Qualitative Results Selected representative zero-shot generation results demonstrating ABot-PhysWorld's strong generalization and physical plausibility. ### 🎯 Zero-Shot Capabilities #### πŸ”§ Scene 1: Deformable Object – Dual-Arm Towel Folding
- **Task**: Fold a towel using dual robotic arms - **Challenge**: Complex cloth dynamics and bimanual coordination - **Ours**: βœ… Physically realistic deformation βœ… Smooth, collision-free arm motion βœ… Natural folding sequence with consistent contact #### πŸ₯€ Scene 2: Fine Manipulation – Diverse Object Handling
- **Task**: Stack cups, build blocks, place a knife - **Challenge**: Varying shapes, weights, and friction - **Ours**: βœ… Accurate grasp pose prediction βœ… Adaptive gripper control βœ… Stable pick-and-place without slippage or penetration #### πŸšͺ Scene 3: Articulated Object – Opening a Cabinet Door
- **Task**: Open a hinged cabinet or door - **Challenge**: Enforce rotational constraints and correct force direction - **Ours**: βœ… Proper handle grasping βœ… Realistic hinge rotation βœ… Motion follows physical pivot axis #### πŸ«— Scene 4: Fluid Interaction – Pouring Water
- **Task**: Pour water from a cup into a bowl using dual arms - **Challenge**: Bimanual coordination, tilt control, liquid dynamics - **Ours**: βœ… Collision-free trajectory planning βœ… Accurate pour timing and angle βœ… Visual consistency in fluid transfer (simulated proxy) #### 🧽 Scene 5: Cleaning Task – Wiping a Stain > Note: The Gemini watermark (bottom-right) indicates the initial frame generated by Gemini (ensuring it is completely unseen); all other frames are generated by ABot-PhysWorld.
- **Task**: Wipe a stain off a table - **Challenge**: Maintain contact, uniform pressure, full coverage - **Ours**: βœ… Continuous tool-surface contact βœ… Systematic wiping motion βœ… Gradual removal of the stain in video output #### πŸ“ Scene 6: Multi-Scene Generalization – Fruit Sorting > Note: The Gemini watermark (bottom-right) indicates the initial frame generated by Gemini (ensuring it is completely unseen); all other frames are generated by ABot-PhysWorld.
- **Task**: Place fruits into a plate across diverse scenes - **Challenge**: Background, lighting, and fruit variation - **Ours**: βœ… Robust object recognition under domain shifts βœ… Consistent performance across unseen environments βœ… Fast and stable manipulation regardless of setup ### πŸ” Pbench Results Demonstration We conducted systematic qualitative comparative experiments on the **PAI-Bench** benchmark dataset. Below are the generated results from several typical scenarios.
| Task | Baselines | **Ours** | |------|-----------------------------|--------| | Grasping | Frequent penetration, floatation | βœ… Firm contact, no violation | | Long-horizon Planning | Inconsistent state transitions | βœ… Coherent multi-step reasoning | | Rigid-body Dynamics | Unphysical deformations | βœ… Preserved geometry and mass behavior | | Contact Modeling | Non-contact attraction | βœ… Realistic interaction onset | > Our model consistently generates physically valid trajectories even in complex, unseen scenarios β€” proving its utility as a reliable simulator for embodied AI. --- ## πŸ› οΈ Usage ### Quick Start: Video Generation Inference Generate physically plausible robot manipulation videos using the **ABot-PhysWorld** fine-tuned model. #### Environment Setup ```bash # Create conda environment conda create -n abot-physworld python=3.10 conda activate abot-physworld # Install PyTorch with CUDA support pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121 # Install dependencies pip install -r requirements.txt ``` **Hardware Requirements:** | Configuration | VRAM | Notes | |---|---|---| | **Recommended** | >= 60GB | Best performance, no tiling needed | | **Minimum** | >= 24GB | Uses tiled VAE (enabled by default) | #### Demo: Generate Video from Image + Text Prompt ```bash cd inference # Download demo data and run inference python inference.py \ --jsonl_path assets/demo.jsonl \ --output_dir ./outputs/demo \ --save_first_frames ``` This generates videos for 2 Franka robot manipulation samples. The model checkpoint is auto-downloaded from [ModelScope](https://www.modelscope.cn/models/amap_cvlab/Abot-PhysWorld) on first run. #### Single Image Inference ```bash python inference.py \ --input_image /path/to/image.jpg \ --prompt "robot arm picks up the red cube from the table" \ --output_dir ./outputs ``` #### Batch Inference from JSONL Prepare a JSONL file (each line is a sample): ```json {"video": "path/to/image.jpg", "prompt": "robot grasps the object"} {"video": "path/to/image2.jpg", "prompt": "robot places object on table"} ``` Then run: ```bash python inference.py \ --jsonl_path data.jsonl \ --output_dir ./outputs \ --num_samples 100 # Process max 100 samples ``` #### Full Parameter Reference ```bash python inference.py --help ``` Key parameters: - `--checkpoint_path`: Local path to model weights (auto-downloads if not provided) - `--cache_dir`: Directory to store downloaded weights (default: `./checkpoints`) - `--height`, `--width`: Video resolution (default: 480Γ—832) - `--num_frames`: Number of output frames (default: 81 β‰ˆ 5.4s at 15fps) - `--num_inference_steps`: Denoising steps, higher = better quality but slower (default: 50) - `--cfg_scale`: Classifier-free guidance scale (default: 5.0) - `--seed`: Random seed for reproducibility - `--gpu_id`: GPU device index #### Output - **Single image**: `{output_dir}/{image_name}_generated.mp4` - **Batch mode**: `{output_dir}/{unique_id}_generated.mp4` + `results.json` (with status for each sample) --- ### Model Weights **Auto-Download:** The fine-tuned checkpoint is automatically downloaded from [ModelScope](https://www.modelscope.cn/models/amap_cvlab/Abot-PhysWorld) on first inference run. **Manual Download (Optional):** ```bash pip install modelscope modelscope download --model amap_cvlab/Abot-PhysWorld --local_dir ./inference/checkpoints ``` **Base Model:** Wan2.1-I2V-14B-480P is also auto-downloaded by DiffSynth-Studio. --- ### More Details For detailed setup instructions, examples, and troubleshooting, see [`inference/README.md`](inference/README.md). --- ## πŸ‹οΈ Training We provide full-parameter SFT training scripts to fine-tune Wan2.1-I2V-14B-480P on your own robot manipulation datasets. ### Training Data The v1 SFT training dataset is available on ModelScope: ```bash git lfs install git clone https://www.modelscope.cn/datasets/amap_cvlab/ABot-PhysWorld_SFT_Training_Data_v1.git ``` ### Quick Start ```bash cd training # Prepare your dataset (JSONL format, see training/assets/demo_train.jsonl) # Then launch 8-GPU training: bash run_train.sh ``` ### Key Features - **Full-parameter SFT** on the 14B DiT model (LoRA also supported) - **DeepSpeed ZeRO-2** distributed training via Accelerate - **Encoded feature caching**: Save VAE/T5/CLIP encodings to disk, skip re-encoding in subsequent runs - **Resume from checkpoint**: Continue training from any saved step - **Real-time text encoding**: Re-train with new captions while reusing cached video features ### Resume from Checkpoint ```bash RESUME_CHECKPOINT=./outputs/sft_training/step-800.safetensors \ bash run_train_resume.sh ``` ### Training with Encoded Cache ```bash # First run: train + save encoded features ENCODED_CACHE_DIR=./encoded_cache bash run_train.sh # Subsequent runs: reuse cached features (much faster) ENCODED_CACHE_DIR=./encoded_cache bash run_train.sh ``` For detailed training instructions, data preparation, and parameter reference, see [`training/README.md`](training/README.md). --- ## πŸ“œ Citing If you find **ABot-PhysWorld** is useful in your research or applications, please consider giving us a **star** 🌟 and **citing** it by the following BibTeX entry: ``` @article{chen2026abotphysworld, title={ABot-PhysWorld: Interactive World Foundation Model for Robotic Manipulation with Physics Alignment}, author={Yuzhi Chen, Ronghan Chen, Dongjie Huo, Yandan Yang, Dekang Qi, Haoyun Liu, Tong Lin, Shuang Zeng, Junjin Xiao, Xinyuan Chang, Feng Xiong, Xing Wei, Zhiheng Ma, Mu Xu}, journal={arXiv preprint arXiv:2603.23376}, year={2026} } ``` --- ## πŸ™ Acknowledgement This project builds upon the following open-source projects. We thank these teams for their contributions: - [Wan2.1](https://github.com/Wan-Video/Wan2.1) - [VACE](https://github.com/ali-vilab/VACE) - [DiffSynth-Studio](https://github.com/modelscope/DiffSynth-Studio) - [VideoX-Fun](https://github.com/aigc-apps/VideoX-Fun) - [Qwen3](https://github.com/QwenLM/Qwen3) - [Qwen3-VL](https://github.com/QwenLM/Qwen3-VL) - [Physical AI Bench](https://github.com/SHI-Labs/physical-ai-bench) - [FantasyTalking2](https://github.com/Fantasy-AMAP/fantasy-talking2) ---