1 Star 0 Fork 0

alexAlter / MuseTalk

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
克隆/下载
贡献代码
同步代码
取消
提示: 由于 Git 不支持空文件夾,创建文件夹后会生成空的 .keep 文件
Loading...
README
MIT

MuseTalk

MuseTalk: Real-Time High Quality Lip Synchronization with Latent Space Inpainting Yue Zhang *, Minhao Liu*, Zhaokang Chen, Bin Wu, Yingjie He, Chao Zhan, Wenjiang Zhou (*Equal Contribution, Corresponding Author, benbinwu@tencent.com)

github huggingface gradio Project (comming soon) Technical report (comming soon)

We introduce MuseTalk, a real-time high quality lip-syncing model (30fps+ on an NVIDIA Tesla V100). MuseTalk can be applied with input videos, e.g., generated by MuseV, as a complete virtual human solution.

Overview

MuseTalk is a real-time high quality audio-driven lip-syncing model trained in the latent space of ft-mse-vae, which

  1. modifies an unseen face according to the input audio, with a size of face region of 256 x 256.
  2. supports audio in various languages, such as Chinese, English, and Japanese.
  3. supports real-time inference with 30fps+ on an NVIDIA Tesla V100.
  4. supports modification of the center point of the face region proposes, which SIGNIFICANTLY affects generation results.
  5. checkpoint available trained on the HDTF dataset.
  6. training codes (comming soon).

News

  • [04/02/2024] Release MuseTalk project and pretrained models.
  • [04/16/2024] Release Gradio demo on HuggingFace Spaces (thanks to HF team for their community grant)

Model

Model Structure MuseTalk was trained in latent spaces, where the images were encoded by a freezed VAE. The audio was encoded by a freezed whisper-tiny model. The architecture of the generation network was borrowed from the UNet of the stable-diffusion-v1-4, where the audio embeddings were fused to the image embeddings by cross-attention.

Note that although we use a very similar architecture as Stable Diffusion, MuseTalk is distinct in that it is Not a diffusion model. Instead, MuseTalk operates by inpainting in the latent space with a single step.

Cases

MuseV + MuseTalk make human photos alive!

Image MuseV +MuseTalk
  • The character of the last two rows, Xinying Sun, is a supermodel KOL. You can follow her on douyin.

Video dubbing

MuseTalk Original videos
Link
  • For video dubbing, we applied a self-developed tool which can identify the talking person.

Some interesting videos!

Image MuseV + MuseTalk

TODO:

  • trained models and inference codes.
  • Huggingface Gradio demo.
  • codes for real-time inference.
  • technical report.
  • training codes.
  • a better model (may take longer).

Getting Started

We provide a detailed tutorial about the installation and the basic usage of MuseTalk for new users:

Third party integration

Thanks for the third-party integration, which makes installation and use more convenient for everyone. We also hope you note that we have not verified, maintained, or updated third-party. Please refer to this project for specific results.

ComfyUI

Installation

To prepare the Python environment and install additional packages such as opencv, diffusers, mmcv, etc., please follow the steps below:

Build environment

We recommend a python version >=3.10 and cuda version =11.7. Then build environment as follows:

pip install -r requirements.txt

mmlab packages

pip install --no-cache-dir -U openmim 
mim install mmengine 
mim install "mmcv>=2.0.1" 
mim install "mmdet>=3.1.0" 
mim install "mmpose>=1.1.0" 

Download ffmpeg-static

Download the ffmpeg-static and

export FFMPEG_PATH=/path/to/ffmpeg

for example:

export FFMPEG_PATH=/musetalk/ffmpeg-4.4-amd64-static

Download weights

You can download weights manually as follows:

  1. Download our trained weights.

  2. Download the weights of other components:

Finally, these weights should be organized in models as follows:

./models/
├── musetalk
│   └── musetalk.json
│   └── pytorch_model.bin
├── dwpose
│   └── dw-ll_ucoco_384.pth
├── face-parse-bisent
│   ├── 79999_iter.pth
│   └── resnet18-5c106cde.pth
├── sd-vae-ft-mse
│   ├── config.json
│   └── diffusion_pytorch_model.bin
└── whisper
    └── tiny.pt

Quickstart

Inference

Here, we provide the inference script.

python -m scripts.inference --inference_config configs/inference/test.yaml 

configs/inference/test.yaml is the path to the inference configuration file, including video_path and audio_path. The video_path should be either a video file or a directory of images.

You are recommended to input video with 25fps, the same fps used when training the model. If your video is far less than 25fps, you are recommended to apply frame interpolation or directly convert the video to 25fps using ffmpeg.

Use of bbox_shift to have adjustable results

:mag_right: We have found that upper-bound of the mask has an important impact on mouth openness. Thus, to control the mask region, we suggest using the bbox_shift parameter. Positive values (moving towards the lower half) increase mouth openness, while negative values (moving towards the upper half) decrease mouth openness.

You can start by running with the default configuration to obtain the adjustable value range, and then re-run the script within this range.

For example, in the case of Xinying Sun, after running the default configuration, it shows that the adjustable value rage is [-9, 9]. Then, to decrease the mouth openness, we set the value to be -7.

python -m scripts.inference --inference_config configs/inference/test.yaml --bbox_shift -7 

:pushpin: More technical details can be found in bbox_shift.

Combining MuseV and MuseTalk

As a complete solution to virtual human generation, you are suggested to first apply MuseV to generate a video (text-to-video, image-to-video or pose-to-video) by referring this. Frame interpolation is suggested to increase frame rate. Then, you can use MuseTalk to generate a lip-sync video by referring this.

Note

If you want to launch online video chats, you are suggested to generate videos using MuseV and apply necessary pre-processing such as face detection and face parsing in advance. During online chatting, only UNet and the VAE decoder are involved, which makes MuseTalk real-time.

Acknowledgement

  1. We thank open-source components like whisper, dwpose, face-alignment, face-parsing, S3FD.
  2. MuseTalk has referred much to diffusers and isaacOnline/whisper.
  3. MuseTalk has been built on HDTF datasets.

Thanks for open-sourcing!

Limitations

  • Resolution: Though MuseTalk uses a face region size of 256 x 256, which make it better than other open-source methods, it has not yet reached the theoretical resolution bound. We will continue to deal with this problem.
    If you need higher resolution, you could apply super resolution models such as GFPGAN in combination with MuseTalk.

  • Identity preservation: Some details of the original face are not well preserved, such as mustache, lip shape and color.

  • Jitter: There exists some jitter as the current pipeline adopts single-frame generation.

Citation

@article{musetalk,
  title={MuseTalk: Real-Time High Quality Lip Synchorization with Latent Space Inpainting},
  author={Zhang, Yue and Liu, Minhao and Chen, Zhaokang and Wu, Bin and He, Yingjie and Zhan, Chao and Zhou, Wenjiang},
  journal={arxiv},
  year={2024}
}

Disclaimer/License

  1. code: The code of MuseTalk is released under the MIT License. There is no limitation for both academic and commercial usage.
  2. model: The trained model are available for any purpose, even commercially.
  3. other opensource model: Other open-source models used must comply with their license, such as whisper, ft-mse-vae, dwpose, S3FD, etc..
  4. The testdata are collected from internet, which are available for non-commercial research purposes only.
  5. AIGC: This project strives to impact the domain of AI-driven video generation positively. Users are granted the freedom to create videos using this tool, but they are expected to comply with local laws and utilize it responsibly. The developers do not assume any responsibility for potential misuse by users.
MIT License Copyright (c) 2024 TMElyralab Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

简介

暂无描述 展开 收起
Python 等 3 种语言
MIT
取消

发行版

暂无发行版

贡献者

全部

近期动态

加载更多
不能加载更多了
1
https://gitee.com/alexalter/muse-talk.git
git@gitee.com:alexalter/muse-talk.git
alexalter
muse-talk
MuseTalk
master

搜索帮助