# Tora **Repository Path**: Yang-chl/Tora ## Basic Information - **Project Name**: Tora - **Description**: No description available - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2024-10-30 - **Last Updated**: 2024-10-30 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README
This is the official repository for paper "Tora: Trajectory-oriented Diffusion Transformer for Video Generation". ## π‘ Abstract Recent advancements in Diffusion Transformer (DiT) have demonstrated remarkable proficiency in producing high-quality video content. Nonetheless, the potential of transformer-based diffusion models for effectively generating videos with controllable motion remains an area of limited exploration. This paper introduces Tora, the first trajectory-oriented DiT framework that integrates textual, visual, and trajectory conditions concurrently for video generation. Specifically, Tora consists of a Trajectory Extractor (TE), a Spatial-Temporal DiT, and a Motion-guidance Fuser (MGF). The TE encodes arbitrary trajectories into hierarchical spacetime motion patches with a 3D video compression network. The MGF integrates the motion patches into the DiT blocks to generate consistent videos following trajectories. Our design aligns seamlessly with DiTβs scalability, allowing precise control of video contentβs dynamics with diverse durations, aspect ratios, and resolutions. Extensive experiments demonstrate Toraβs excellence in achieving high motion fidelity, while also meticulously simulating the movement of physical world. ## π£ Updates - `2024/10/23` π₯π₯Our [ModelScope Demo](https://www.modelscope.cn/studios/xiaoche/Tora) is launched. Welcome to try it out! We also upload the model weights to [ModelScope](https://www.modelscope.cn/models/xiaoche/Tora). - `2024/10/21` Thanks to [@kijai](https://github.com/kijai) for supporting Tora in ComfyUI! [Link](https://github.com/kijai/ComfyUI-CogVideoXWrapper) - `2024/10/15` π₯π₯We released our inference code and model weights. **Please note that this is a CogVideoX version of Tora, built on the CogVideoX-5B model. This version of Tora is meant for academic research purposes only. Due to our commercial plans, we will not be open-sourcing the complete version of Tora at this time.** - `2024/08/27` We released our v2 paper including appendix. - `2024/07/31` We submitted our paper on arXiv and released our project page. ## π Table of Contents - [Showcases](#%EF%B8%8F-showcases) - [TODO List](#-todo-list) - [Installation](#-installation) - [Model Weights](#-model-weights) - [Inference](#-inference) - [Gradio Demo](#%EF%B8%8F-gradio-demo) - [Troubleshooting](#-troubleshooting) - [Acknowledgements](#-acknowledgements) - [Our previous work](#-our-previous-work) - [Citation](#-citation) ## ποΈ Showcases https://github.com/user-attachments/assets/949d5e99-18c9-49d6-b669-9003ccd44bf1 https://github.com/user-attachments/assets/7e7dbe87-a8ba-4710-afd0-9ef528ec329b https://github.com/user-attachments/assets/4026c23d-229d-45d7-b5be-6f3eb9e4fd50 All videos are available in this [Link](https://cloudbook-public-daily.oss-cn-hangzhou.aliyuncs.com/Tora_t2v/showcases.zip) ## β TODO List - [x] Release our inference code and model weights - [x] Provide a ModelScope Demo - [ ] Release our training code - [ ] Release complete version of Tora ## π Installation ```bash # Clone this repository. git clone https://github.com/alibaba/Tora.git cd Tora # Install Pytorch (we use Pytorch 2.4.0) and torchvision following the official instructions: https://pytorch.org/get-started/previous-versions/. For example: conda create -n tora python==3.10 conda activate tora conda install pytorch==2.4.0 torchvision==0.19.0 pytorch-cuda=12.1 -c pytorch -c nvidia # Install requirements cd modules/SwissArmyTransformer pip install -e . cd ../../sat pip install -r requirements.txt cd .. ``` ## π¦ Model Weights ### Folder Structure ``` Tora βββ sat βββ ckpts βββ t5-v1_1-xxl β βββ model-00001-of-00002.safetensors β βββ ... βββ vae β βββ 3d-vae.pt βββ tora βββ t2v βββ mp_rank_00_model_states.pt ``` ### Download Links - Download the VAE and T5 model following [CogVideo](https://github.com/THUDM/CogVideo/blob/main/sat/README.md#2-download-model-weights): - VAE: https://cloud.tsinghua.edu.cn/f/fdba7608a49c463ba754/?dl=1 - T5: [text_encoder](https://huggingface.co/THUDM/CogVideoX-2b/tree/main/text_encoder), [tokenizer](https://huggingface.co/THUDM/CogVideoX-2b/tree/main/tokenizer) - Tora t2v model weights: [Link](https://cloudbook-public-daily.oss-cn-hangzhou.aliyuncs.com/Tora_t2v/mp_rank_00_model_states.pt). Downloading this weight requires following the [CogVideoX License](CogVideoX_LICENSE). ## π Inference It requires around 30 GiB GPU memory tested on NVIDIA A100. ```bash cd sat PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True torchrun --standalone --nproc_per_node=$N_GPU sample_video.py --base configs/tora/model/cogvideox_5b_tora.yaml configs/tora/inference_sparse.yaml --load ckpts/tora/t2v --output-dir samples --point_path trajs/coaster.txt --input-file assets/text/t2v/examples.txt ``` You can change the `--input-file` and `--point_path` to your own prompts and trajectory points files. Please note that the trajectory is drawn on a 256x256 canvas. Replace `$N_GPU` with the number of GPUs you want to use. ### Recommendations for Text Prompts For text prompts, we highly recommend using GPT-4 to enhance the details. Simple prompts may negatively impact both visual quality and motion control effectiveness. You can refer to the following resources for guidance: - [CogVideoX Documentation](https://github.com/THUDM/CogVideo/blob/main/inference/convert_demo.py) - [OpenSora Scripts](https://github.com/hpcaitech/Open-Sora/blob/main/scripts/inference.py) ## π₯οΈ Gradio Demo Usage: ```bash cd sat python app.py --load ckpts/tora/t2v ``` ## π― Troubleshooting ### 1. ValueError: Non-consecutive added token... Upgrade the transformers package to 4.44.2. See [this](https://github.com/THUDM/CogVideo/issues/213) issue. ## π€ Acknowledgements We would like to express our gratitude to the following open-source projects that have been instrumental in the development of our project: - [CogVideo](https://github.com/THUDM/CogVideo): An open source video generation framework by THUKEG. - [Open-Sora](https://github.com/hpcaitech/Open-Sora): An open source video generation framework by HPC-AI Tech. - [MotionCtrl](https://github.com/TencentARC/MotionCtrl): A video generation model supporting motion control by ARC Lab, Tencent PCG. - [ComfyUI-DragNUWA](https://github.com/chaojie/ComfyUI-DragNUWA): An implementation of DragNUWA for ComfyUI. Special thanks to the contributors of these libraries for their hard work and dedication! ## π Our previous work - [AnimateAnything: Fine Grained Open Domain Image Animation with Motion Guidance](https://github.com/alibaba/animate-anything) ## π Citation ```bibtex @misc{zhang2024toratrajectoryorienteddiffusiontransformer, title={Tora: Trajectory-oriented Diffusion Transformer for Video Generation}, author={Zhenghao Zhang and Junchao Liao and Menghao Li and Zuozhuo Dai and Bingxue Qiu and Siyu Zhu and Long Qin and Weizhi Wang}, year={2024}, eprint={2407.21705}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2407.21705}, } ```