1 Star 0 Fork 0

JackHan / dreamtalk

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
克隆/下载
贡献代码
同步代码
取消
提示: 由于 Git 不支持空文件夾,创建文件夹后会生成空的 .keep 文件
Loading...
README
MIT

DreamTalk: When Expressive Talking Head Generation
Meets Diffusion Probabilistic Models

teaser

DreamTalk is a diffusion-based audio-driven expressive talking head generation framework that can produce high-quality talking head videos across diverse speaking styles. DreamTalk exhibits robust performance with a diverse array of inputs, including songs, speech in multiple languages, noisy audio, and out-of-domain portraits.

News

  • [2023.12] Release inference code and pretrained checkpoint.

Installation

conda create -n dreamtalk python=3.7.0
conda activate dreamtalk
pip install -r requirements.txt
conda install pytorch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 cudatoolkit=11.1 -c pytorch -c conda-forge
conda update ffmpeg

pip install urllib3==1.26.6
pip install transformers==4.28.1
pip install dlib

Download Checkpoints

In light of the social impact, we have ceased public download access to checkpoints. If you want to obtain the checkpoints, please request it by emailing mayf18@mails.tsinghua.edu.cn . It is important to note that sending this email implies your consent to use the provided method solely for academic research purposes.

Put the downloaded checkpoints into checkpoints folder.

Inference

Run the script:

python inference_for_demo_video.py \
--wav_path data/audio/acknowledgement_english.m4a \
--style_clip_path data/style_clip/3DMM/M030_front_neutral_level1_001.mat \
--pose_path data/pose/RichardShelby_front_neutral_level1_001.mat \
--image_path data/src_img/uncropped/male_face.png \
--cfg_scale 1.0 \
--max_gen_len 30 \
--output_name acknowledgement_english@M030_front_neutral_level1_001@male_face

wav_path specifies the input audio. The input audio file extensions such as wav, mp3, m4a, and mp4 (video with sound) should all be compatible.

style_clip_path specifies the reference speaking style and pose_path specifies head pose. They are 3DMM parameter sequences extracted from reference videos. You can follow PIRenderer to extract 3DMM parameters from your own videos. Note that the video frame rate should be 25 FPS. Besides, videos used for head pose reference should be first cropped to $256\times256$ using scripts in FOMM video preprocessing.

image_path specifies the input portrait. Its resolution should be larger than $256\times256$. Frontal portraits, with the face directly facing forward and not tilted to one side, usually achieve satisfactory results. The input portrait will be cropped to $256\times256$. If your portrait is already cropped to $256\times256$ and you want to disable cropping, use option --disable_img_crop like this:

python inference_for_demo_video.py \
--wav_path data/audio/acknowledgement_chinese.m4a \
--style_clip_path data/style_clip/3DMM/M030_front_surprised_level3_001.mat \
--pose_path data/pose/RichardShelby_front_neutral_level1_001.mat \
--image_path data/src_img/cropped/zp1.png \
--disable_img_crop \
--cfg_scale 1.0 \
--max_gen_len 30 \
--output_name acknowledgement_chinese@M030_front_surprised_level3_001@zp1

cfg_scale controls the scale of classifer-free guidance. It can adjust the intensity of speaking styles.

max_gen_len is the maximum video generation duration, measured in seconds. If the input audio exceeds this length, it will be truncated.

The generated video will be named $(output_name).mp4 and put in the output_video folder. Intermediate results, including the cropped portrait, will be in the tmp/$(output_name) folder.

Sample inputs are presented in data folder. Due to copyright issues, we are unable to include the songs we have used in this folder.

If you want to run this program on CPU, please add --device=cpu to the command line arguments. (Thank lukevs for adding CPU support.)

Ad-hoc solutions to improve resolution

The main goal of this method is to achieve accurate lip-sync and produce vivid expressions across diverse speaking styles. The resolution was not considered in the initial design process. There are two ad-hoc solutions to improve resolution. The first option is to utilize CodeFormer, which can achieve a resolution of $1024\times1024$; however, it is relatively slow, processing only one frame per second on an A100 GPU, and suffers from issues with temporal inconsistency. The second option is to employ the Temporal Super-Resolution Model from MetaPortrait, which attains a resolution of $512\times512$, offers a faster performance of 10 frames per second, and maintains temporal coherence. However, these super-resolution modules may reduce the intensity of facial emotions.

The sample results after super-resolution processing are in the output_video folder.

Acknowledgements

We extend our heartfelt thanks for the invaluable contributions made by preceding works to the development of DreamTalk. This includes, but is not limited to: PIRenderer ,AVCT ,StyleTalk ,Deep3DFaceRecon_pytorch ,Wav2vec2.0 ,diffusion-point-cloud ,FOMM video preprocessing. We are dedicated to advancing upon these foundational works with the utmost respect for their original contributions.

Citation

If you find this codebase useful for your research, please use the following entry.

@article{ma2023dreamtalk,
  title={DreamTalk: When Expressive Talking Head Generation Meets Diffusion Probabilistic Models},
  author={Ma, Yifeng and Zhang, Shiwei and Wang, Jiayu and Wang, Xiang and Zhang, Yingya and Deng, Zhidong},
  journal={arXiv preprint arXiv:2312.09767},
  year={2023}
}

Disclaimer

This method is intended for RESEARCH/NON-COMMERCIAL USE ONLY.

MIT License Copyright (c) 2023 Alibaba TongYi Vision Intelligence Lab Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

简介

暂无描述 展开 收起
Python
MIT
取消

发行版

暂无发行版

贡献者

全部

近期动态

加载更多
不能加载更多了
1
https://gitee.com/JackHan/dreamtalk.git
git@gitee.com:JackHan/dreamtalk.git
JackHan
dreamtalk
dreamtalk
main

搜索帮助