# V-Express
**Repository Path**: jonker_m/V-Express
## Basic Information
- **Project Name**: V-Express
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2024-06-03
- **Last Updated**: 2024-06-03
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
# **_V-Express: Conditional Dropout for Progressive Training of Portrait Video Generation_**
---
## Introduction
In the field of portrait video generation, the use of single images to generate portrait videos has become increasingly prevalent.
A common approach involves leveraging generative models to enhance adapters for controlled generation.
However, control signals can vary in strength, including text, audio, image reference, pose, depth map, etc.
Among these, weaker conditions often struggle to be effective due to interference from stronger conditions, posing a challenge in balancing these conditions.
In our work on portrait video generation, we identified audio signals as particularly weak, often overshadowed by stronger signals such as pose and original image.
However, direct training with weak signals often leads to difficulties in convergence.
To address this, we propose V-Express, a simple method that balances different control signals through a series of progressive drop operations.
Our method gradually enables effective control by weak conditions, thereby achieving generation capabilities that simultaneously take into account pose, input image, and audio.

## Release
- [2024/05/29] 🔥 We have added video post-processing that can effectively mitigate the flicker problem.
- [2024/05/23] 🔥 We release the code and models.
## Installation
```
# install requirements
pip install diffusers==0.24.0
pip install imageio-ffmpeg==0.4.9
pip install insightface==0.7.3
pip install omegaconf==2.2.3
pip install onnxruntime==1.16.3
pip install safetensors==0.4.2
pip install torch==2.0.1
pip install torchaudio==2.0.2
pip install torchvision==0.15.2
pip install transformers==4.30.2
pip install einops==0.4.1
pip install tqdm==4.66.1
pip install xformers==0.0.22
pip install av==11.0.0
# download the codes
git clone https://github.com/tencent-ailab/V-Express
# download the models
cd V-Express
git lfs install
git clone https://huggingface.co/tk93/V-Express
mv V-Express/model_ckpts model_ckpts
# then you can use the scripts
```
## Download Models
you can download models from [here](https://huggingface.co/tk93/V-Express). We have included all the required models in the model card. You can also download the models separately from the original repository.
- [stabilityai/sd-vae-ft-mse](https://huggingface.co/stabilityai/sd-vae-ft-mse).
- [runwayml/stable-diffusion-v1-5](https://huggingface.co/runwayml/stable-diffusion-v1-5). Only the model configuration file for unet is needed here.
- [facebook/wav2vec2-base-960h](https://huggingface.co/facebook/wav2vec2-base-960h).
- [insightface/buffalo_l](https://github.com/deepinsight/insightface/releases/download/v0.7/buffalo_l.zip).
## How to Use
### Important Reminder
**_Important! Important!! Important!!!_**
In the talking-face generation task, when the target video is not the same person as the reference character, the retarget of the face will be a very important part. And choosing a target video that is more similar to the pose of the reference face will be able to get better results. In addition, our model now performs better on English, and other languages have not yet been tested in detail.
### Run the demo (step1, _optional_)
If you have a target talking video, you can follow the script below to extract the audio and face V-kps sequences from the video. You can also skip this step and run the script in Step 2 directly to try the example we provided.
```shell
python scripts/extract_kps_sequence_and_audio.py \
--video_path "./test_samples/short_case/AOC/gt.mp4" \
--kps_sequence_save_path "./test_samples/short_case/AOC/kps.pth" \
--audio_save_path "./test_samples/short_case/AOC/aud.mp3"
```
We recommend cropping a clear square face image as in the example below and making sure the resolution is no lower than 512x512. The green to red boxes in the image below are the recommended cropping ranges.
### Run the demo (step2, _core_)
**Scenario 1 (A's picture and A's talking video.) (Best Practice)**
If you have a picture of A and a talking video of A in another scene. Then you should run the following script. Our model is able to generate speaking videos that are consistent with the given video. _You can see more examples on our [project page](https://tenvence.github.io/p/v-express/)._
```shell
python inference.py \
--reference_image_path "./test_samples/short_case/AOC/ref.jpg" \
--audio_path "./test_samples/short_case/AOC/aud.mp3" \
--kps_path "./test_samples/short_case/AOC/kps.pth" \
--output_path "./output/short_case/talk_AOC_no_retarget.mp4" \
--retarget_strategy "no_retarget" \
--num_inference_steps 25
```