# AvatarCLIP **Repository Path**: 910024445/AvatarCLIP ## Basic Information - **Project Name**: AvatarCLIP - **Description**: No description available - **Primary Language**: Unknown - **License**: BSD-3-Clause - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2024-02-12 - **Last Updated**: 2024-06-28 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README

AvatarCLIP: Zero-Shot Text-Driven Generation and Animation of 3D Avatars

Fangzhou Hong¹* Mingyuan Zhang¹* Liang Pan¹ Zhongang Cai^1,2,3 Lei Yang² Ziwei Liu¹⁺

¹S-Lab, Nanyang Technological University ²SenseTime Research ³Shanghai AI Laboratory

*equal contribution ⁺corresponding author

Accepted to SIGGRAPH 2022 (Journal Track)

TL;DR

AvatarCLIP generate and animate avatars given descriptions of body shapes, appearances and motions.


A tall and skinny female soldier that is arguing.	A skinny ninja that is raising both arms.	An overweight sumo wrestler that is sitting.	A tall and fat Iron Man that is running.

This repository contains the official implementation of _AvatarCLIP: Zero-Shot Text-Driven Generation and Animation of 3D Avatars_. ---

[Project Page] • [arXiv] • [High-Res PDF (166M)] • [Supplementary Video] • [Colab Demo]

## Updates [09/2022] :fire::fire::fire:**If you are looking for a higher-quality 3D human generation method, go checkout our new work [EVA3D](https://hongfz16.github.io/projects/EVA3D.html)!**:fire::fire::fire: [09/2022] :fire::fire::fire:**If you are looking for a higher-quality text2motion method, go checkout our new work [MotionDiffuse](https://mingyuan-zhang.github.io/projects/MotionDiffuse.html)!**:fire::fire::fire: [07/2022] Code release for motion generation part! [05/2022] [Paper](https://arxiv.org/abs/2205.08535) uploaded to arXiv. [![arXiv](https://img.shields.io/badge/arXiv-2205.08535-b31b1b.svg)](https://arxiv.org/abs/2205.08535) [05/2022] Add a [Colab Demo](https://colab.research.google.com/drive/1dfaecX7xF3nP6fyXc8XBljV5QY1lc1TR?usp=sharing) for avatar generation! [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1dfaecX7xF3nP6fyXc8XBljV5QY1lc1TR?usp=sharing) [05/2022] Support converting the generated avatar to the **animatable FBX format**! Go checkout [how to use the FBX models](#use-generated-fbx-models). Or checkout the [instructions](./Avatar2FBX/README.md) for the conversion codes. [05/2022] Code release for avatar generation part! [04/2022] AvatarCLIP is accepted to SIGGRAPH 2022 (Journal Track):partying_face:! ## Citation If you find our work useful for your research, please consider citing the paper: ``` @article{hong2022avatarclip, title={AvatarCLIP: Zero-Shot Text-Driven Generation and Animation of 3D Avatars}, author={Hong, Fangzhou and Zhang, Mingyuan and Pan, Liang and Cai, Zhongang and Yang, Lei and Liu, Ziwei}, journal={ACM Transactions on Graphics (TOG)}, volume={41}, number={4}, articleno={161}, pages={1--19}, year={2022}, publisher={ACM New York, NY, USA}, doi={10.1145/3528223.3530094}, } ``` ## Use Generated FBX Models ### Download Go visit our [project page](https://hongfz16.github.io/projects/AvatarCLIP.html). Go to the section 'Avatar Gallery'. Pick a model you like. Click 'Load Model' below. Click 'Download FBX' link at the bottom of the pop-up viewer.

### Import to Your Favourite 3D Software (e.g. Blender, Unity3D) The FBX models are already rigged. Use your motion library to animate it!

### Upload to Mixamo To make use of the rich motion library provided by [Mixamo](https://www.mixamo.com), you can also upload the FBX model to Mixamo. The rigging process is completely automatic!

## Installation We recommend using anaconda to manage the python environment. The setup commands below are provided for your reference. ```bash git clone https://github.com/hongfz16/AvatarCLIP.git cd AvatarCLIP conda create -n AvatarCLIP python=3.7 conda activate AvatarCLIP conda install pytorch==1.7.0 torchvision==0.8.0 torchaudio==0.7.0 cudatoolkit=10.1 -c pytorch pip install -r requirements.txt ``` Other than the above steps, you should also install [neural_renderer](https://github.com/daniilidis-group/neural_renderer) following its instructions. Before compiling neural_renderer (or after compiling should also be fine), remember to add the following three lines to `neural_renderer/perspective.py` after line 19. ```python x[z<=0] = 0 y[z<=0] = 0 z[z<=0] = 0 ``` This quick fix is for a rendering issue where objects behide the camera will also be rendered. Be careful when using this fixed version of neural_renderer on your other projects, because this fix will cause the rendering process not differentiable. To support offscreen rendering for motion visualization, you should install osmesa library. ```bash conda install -c menpo osmesa ``` ## Data Preparation ### Download SMPL Models Register and download SMPL models [here](https://smpl.is.tue.mpg.de/). Put the downloaded models in the folder `smpl_models`. The folder structure should look like ``` ./ ├── ... └── smpl_models/ ├── smpl/ ├── SMPL_FEMALE.pkl ├── SMPL_MALE.pkl └── SMPL_NEUTRAL.pkl ``` ### Download Pretrained Models & Other Data This download is only for coarse shape generation and motion generation. You can skip if you only want to use other parts. Download the pretrained weights and other required data [here](https://1drv.ms/u/s!AjLpFg-f48ljgZl9qpU7_6ZA9B7qwA?e=pPcHIG). Put them in the folder `AvatarGen` so that the folder structure should look like ``` ./ ├── ... └── AvatarGen/ └── ShapeGen/ └── data/ ├── codebook.pth ├── model_VAE_16.pth ├── nongrey_male_0110.jpg ├── smpl_uv.mtl └── smpl_uv.obj ``` Pretrained weights and human texture for motion generation can be downloaded [here](https://drive.google.com/drive/folders/1TSyeT8MwH5EVQRbNGRVkWsA4Y9Y6dRbk?usp=sharing). Note that the human texture we used to render poses is from [SURREAL dataset](https://www.di.ens.fr/willow/research/surreal/data/). Besides, you should download pretrained weights of [VPoser v2.0](https://smpl-x.is.tue.mpg.de/download.php). Put them in the folder `AvatarAnimate` so that the folder structure should look like ``` ├── ... └── AvatarAnimate/ └── data/ ├── codebook.pth ├── motion_vae.pth ├── pose_realnvp.pth ├── nongrey_male_0110.jpg ├── smpl_uv.mtl ├── smpl_uv.obj └── vposer ├── V02_05.log ├── V02_05.yaml └── snapshots ├── V02_05_epoch=08_val_loss=0.03.ckpt └── V02_05_epoch=13_val_loss=0.03.ckpt ``` ## Avatar Generation ### Coarse Shape Generation Folder `AvatarGen/ShapeGen` contains codes for this part. Run the follow command to generate the coarse shape corresponding to the shape description 'a strong man'. We recommend to use the prompt augmentation 'a 3d rendering of xxx in unreal engine' for better results. The generated coarse body mesh will be stored under `AvatarGen/ShapeGen/output/coarse_shape`. ```bash python main.py --target_txt 'a 3d rendering of a strong man in unreal engine' ``` Then we need to render the mesh for initialization of the implicit avatar representation. Use the following command for rendering. ```bash python render.py --coarse_shape_obj output/coarse_shape/a_3d_rendering_of_a_strong_man_in_unreal_engine.obj --output_folder ${RENDER_FOLDER} ``` ### Shape Sculpting and Texture Generation Note that all the codes are tested on NVIDIA V100 (32GB memory). Therefore, in order to run on GPUs with lower memory, please try to scale down the network or tune down `max_ray_num` in the config files. You can refer to `confs/examples_small/example.conf` or our [colab demo](https://colab.research.google.com/drive/1dfaecX7xF3nP6fyXc8XBljV5QY1lc1TR?usp=sharing) for a scale-down version of AvatarCLIP. Folder `AvatarGen/AppearanceGen` contains codes for this part. We provide data, pretrained model and scripts to perform shape sculpting and texture generation on a zero-beta body (mean shape defined by SMPL). We provide many example scripts under `AvatarGen/AppearanceGen/confs/examples`. For example, if we want to generate 'Abraham Lincoln', which is defined in the config file `confs/examples/abrahamlincoln.conf`, use the following command. ```bash python main.py --mode train_clip --conf confs/examples/abrahamlincoln.conf ``` Results will be stored in `AvatarCLIP/AvatarGen/AppearanceGen/exp/smpl/examples/abrahamlincoln`. If you wish to perform shape sculpting and texture generation on the previously generated coarse shape. We also provide example config files in `confs/base_models/astrongman.conf` `confs/astrongman/*.conf`. Two steps of optimization are required as follows. ```bash # Initilization of the implicit avatar python main.py --mode train --conf confs/base_models/astrongman.conf # Shape sculpting and texture generation on the initialized implicit avatar python main.py --mode train_clip --conf confs/astrongman/hulk.conf ``` ### Marching Cube To extract meshes from the generated implicit avatar, one may use the following command. ```bash python main.py --mode validate_mesh --conf confs/examples/abrahamlincoln.conf ``` The final high resolution mesh will be stored as `AvatarCLIP/AvatarGen/AppearanceGen/exp/smpl/examples/abrahamlincoln/meshes/00030000.ply` ## Convert Avatar to FBX Format For the convenience of using the generated avatar with modern graphics pipeline, we also provide scripts to rig the avatar and convert to FBX format. See the instructions [here](./Avatar2FBX/README.md). ## Motion Generation ### Candidate Poses Generation Here we provide four different methods for pose generation. 1. PoseOptimizer: directly optimize on SMPL theta 2. VPoserOptimizer: optimize the latent space of VPoser 3. VPoserRealNVP: get latent codes of VPoser from pretrained conditional RealNVP 4. VPoserCodebook: select the most similar poses to the given text feature We provide configurations to compare these methods. Here are some examples: ```bash # Suppose your current location is `AvatarCLIP/AvatarAnimate` # Use PoseOptimizer method to generate poses for "arguing" python main.py --conf confs/pose_ablation/pose_optimizer/argue.conf # Results are stored in `AvatarCLIP/AvatarAnimate/exp/pose_ablation/pose_optimizer/argue` directory # candidate_0.jpg, candidate_1.jpg, ..., candidate_4.jpg are the top-5 poses # candidate_0.npy, candidate_1.npy, ..., candidate_4.npy are corresponding parameters # Use VPoserOptimizer method to generate poses for "praying" python main.py --conf confs/pose_ablation/vposer_optimizer/pray.conf # Results are stored in `AvatarCLIP/AvatarAnimate/exp/pose_ablation/vposer_optimizer/pray` directory # Use VPoserRealNVP method to generate poses for "shooting a basketball" python main.py --conf confs/pose_ablation/vposer_realnvp/shoot_basketball.conf # Results are stored in `AvatarCLIP/AvatarAnimate/exp/pose_ablation/vposer_realnvp/shoot_basketball` directory # Use VPoserCodebook method to generate poses for "running" python main.py --conf confs/pose_ablation/vposer_codebook/run.conf # Results are stored in `AvatarCLIP/AvatarAnimate/exp/pose_ablation/vposer_codebook/run` directory ``` ### Motion Generation Here we provide three different methods for motion generation. 1. MotionInterpolation: directly interpolate between given poses 2. MotionOptimizer (baseline): optimize latent code of a pretrained VAE with a simple reconstruction loss 3. MotionOptimizer (ours): optimize latent code of a pretrained VAE with weighted reconstruction loss, delta loss, and clip loss We provide configurations to compare these methods. Here are some examples: ```bash # Suppose your current location is `AvatarCLIP/AvatarAnimate` # Use MotionInterpolation method to generate motion for "arguing" python main.py --conf confs/motion_ablation/interpolation/argue.conf # Results are stored in `AvatarCLIP/AvatarAnimate/exp/motion_ablation/interpolation/argue` directory # candidate_0.jpg, candidate_1.jpg, ..., candidate_4.jpg are the top-5 poses # candidate_0.npy, candidate_1.npy, ..., candidate_4.npy are corresponding parameters # motion.mp4 is the generated motion # motion.npy is corresponding parameters # Use MotionOptimizer (baseline) method to generate motion for "praying" python main.py --conf confs/motion_ablation/baseline/pray.conf # Results are stored in `AvatarCLIP/AvatarAnimate/exp/motion_ablation/baseline/pray` directory # Use MotionOptimizer (ours) method to generate motion for "shooting a basketball" python main.py --conf confs/motion_ablation/motion_optimizer/shoot_basketball.conf # Results are stored in `AvatarCLIP/AvatarAnimate/exp/motion_ablation/motion_optimizer/shoot_basketball` directory ``` ### Make your own configure Each configuration contains three independent parts: general setting, pose generator, and motion generator. ```text # General Setting general { # describe the results path base_exp_dir = ./exp/motion_ablation/motion_optimizer/raise_arms # if you only want to generate poses, then you can set "mode = pose". mode = motion # define your prompt. We highly recommend using the format "a rendered 3d man is xxx" text = a rendered 3d man is raising both arms } # Pose Generator pose_generator { type = VPoserCodebook # you can change the number of candidate poses by setting "topk = 10" # for PoseOptimizer and VPoserOptimizer, you can further define the number of iterations and the optimizer type } # Motion Generator # if "mode = pose", you can ignore this part motion_generator { type = MotionOptimizer # you can further modify the coefficient of each loss. # for example, if you find the generated motion is very intensive, you can reduce the coefficient of delta loss. } ``` ## License Distributed under the S-Lab License. See `LICENSE` for more information. ## Related Works

There are lots of wonderful works that inspired our work or came around the same time as ours.

Dream Fields enables zero-shot text-driven general 3D object generation using CLIP and NeRF.

Text2Mesh proposes to edit a template mesh by predicting offsets and colors per vertex using CLIP and differentiable rendering.

CLIP-NeRF can manipulate 3D objects represented by NeRF with natural languages or examplar images by leveraging CLIP.

Text to Mesh facilitates zero-shot text-driven general mesh generation by deforming from a sphere mesh guided by CLIP.

MotionCLIP establishes a projection from the CLIP text space to the motion space through supervised training, which leads to amazing text-driven motion generation results.

## Acknowledgements This study is supported by NTU NAP, MOE AcRF Tier 2 (T2EP20221-0033), and under the RIE2020 Industry Alignment Fund – Industry Collaboration Projects (IAF-ICP) Funding Initiative, as well as cash and in-kind contribution from the industry partner(s). We thank the following repositories for their contributions in our implementation: [NeuS](https://github.com/Totoro97/NeuS), [smplx](https://github.com/vchoutas/smplx), [vposer](https://github.com/nghorbani/human_body_prior), [Smplx2FBX](https://github.com/mrhaiyiwang/Smplx2FBX).