# MMAudio **Repository Path**: MufcLiuKai/MMAudio ## Basic Information - **Project Name**: MMAudio - **Description**: No description available - **Primary Language**: Python - **License**: MIT - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2024-12-17 - **Last Updated**: 2024-12-17 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README

MMAudio

Paper (Soon) | Webpage | Models | Huggingface Demo | Colab Demo | Replicate Demo

## [Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis](https://hkchengrex.github.io/MMAudio) [Ho Kei Cheng](https://hkchengrex.github.io/), [Masato Ishii](https://scholar.google.co.jp/citations?user=RRIO1CcAAAAJ), [Akio Hayakawa](https://scholar.google.com/citations?user=sXAjHFIAAAAJ), [Takashi Shibuya](https://scholar.google.com/citations?user=XCRO260AAAAJ), [Alexander Schwing](https://www.alexander-schwing.de/), [Yuki Mitsufuji](https://www.yukimitsufuji.com/) University of Illinois Urbana-Champaign, Sony AI, and Sony Group Corporation **Note: This repository is still under construction. Single-example inference should work as expected. The training code will be added. Code is subject to non-backward-compatible changes.** ## Highlight MMAudio generates synchronized audio given video and/or text inputs. Our key innovation is multimodal joint training which allows training on a wide range of audio-visual and audio-text datasets. Moreover, a synchronization module aligns the generated audio with the video frames. ## Results (All audio from our algorithm MMAudio) Videos from Sora: https://github.com/user-attachments/assets/82afd192-0cee-48a1-86ca-bd39b8c8f330 Videos from MovieGen/Hunyuan Video/VGGSound: https://github.com/user-attachments/assets/29230d4e-21c1-4cf8-a221-c28f2af6d0ca For more results, visit https://hkchengrex.com/MMAudio/video_main.html. ## Update Logs - 2024-12-14: Removed the `ffmpeg<7` requirement for the demos by replacing `torio.io.StreamingMediaDecoder` with `pyav` for reading frames. The read frames are also cached, so we are not reading the same frames again during reconstruction. This should speed things up and make installation less of a hassle. - 2024-12-13: Improved for-loop processing in CLIP/Sync feature extraction by introducing a batch size multiplier. We can approximately use 40x batch size for CLIP/Sync without using more memory, thereby speeding up processing. Removed VAE encoder during inference -- we don't need it. - 2024-12-11: Replaced `torio.io.StreamingMediaDecoder` with `pyav` for reading framerate when reconstructing the input video. `torio.io.StreamingMediaDecoder` does not work reliably in huggingface ZeroGPU's environment, and I suspect that it might not work in some other environments as well. ## Installation We have only tested this on Ubuntu. ### Prerequisites We recommend using a [miniforge](https://github.com/conda-forge/miniforge) environment. - Python 3.9+ - PyTorch **2.5.1+** and corresponding torchvision/torchaudio (pick your CUDA version https://pytorch.org/, pip install recommended) **1. Install prerequisite if not yet met:** ``` pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 --upgrade ``` (Or any other CUDA versions that your GPUs/driver support) **2. Clone our repository:** ```bash git clone https://github.com/hkchengrex/MMAudio.git ``` **3. Install with pip (install pytorch first before attempting this!):** ```bash cd MMAudio pip install -e . ``` (If you encounter the File "setup.py" not found error, upgrade your pip with pip install --upgrade pip) **Pretrained models:** The models will be downloaded automatically when you run the demo script. MD5 checksums are provided in `mmaudio/utils/download_utils.py`. The models are also available at https://huggingface.co/hkchengrex/MMAudio/tree/main | Model | Download link | File size | | -------- | ------- | ------- | | Flow prediction network, small 16kHz | mmaudio_small_16k.pth | 601M | | Flow prediction network, small 44.1kHz | mmaudio_small_44k.pth | 601M | | Flow prediction network, medium 44.1kHz | mmaudio_medium_44k.pth | 2.4G | | Flow prediction network, large 44.1kHz | mmaudio_large_44k.pth | 3.9G | | Flow prediction network, large 44.1kHz, v2 **(recommended)** | mmaudio_large_44k_v2.pth | 3.9G | | 16kHz VAE | v1-16.pth | 655M | | 16kHz BigVGAN vocoder (from Make-An-Audio 2) |best_netG.pt | 429M | | 44.1kHz VAE |v1-44.pth | 1.2G | | Synchformer visual encoder |synchformer_state_dict.pth | 907M | To run the model, you need four components: a flow prediction network, visual feature extractors (Synchformer and CLIP, CLIP will be downloaded automatically), a VAE, and a vocoder. VAEs and vocoders are specific to the sampling rate (16kHz or 44.1kHz) and not model sizes. The 44.1kHz vocoder will be downloaded automatically. The expected directory structure (full): ```bash MMAudio ├── ext_weights │ ├── best_netG.pt │ ├── synchformer_state_dict.pth │ ├── v1-16.pth │ └── v1-44.pth ├── weights │ ├── mmaudio_small_16k.pth │ ├── mmaudio_small_44k.pth │ ├── mmaudio_medium_44k.pth │ ├── mmaudio_large_44k.pth │ └── mmaudio_large_44k_v2.pth └── ... ``` The expected directory structure (minimal, for the recommended model only): ```bash MMAudio ├── ext_weights │ ├── synchformer_state_dict.pth │ └── v1-44.pth ├── weights │ └── mmaudio_large_44k_v2.pth └── ... ``` ## Demo By default, these scripts use the `large_44k_v2` model. In our experiments, inference only takes around 6GB of GPU memory (in 16-bit mode) which should fit in most modern GPUs. ### Command-line interface With `demo.py` ```bash python demo.py --duration=8 --video= --prompt "your prompt" ``` The output (audio in `.flac` format, and video in `.mp4` format) will be saved in `./output`. See the file for more options. Simply omit the `--video` option for text-to-audio synthesis. The default output (and training) duration is 8 seconds. Longer/shorter durations could also work, but a large deviation from the training duration may result in a lower quality. ### Gradio interface Supports video-to-audio and text-to-audio synthesis. Use [port forwarding](https://unix.stackexchange.com/questions/115897/whats-ssh-port-forwarding-and-whats-the-difference-between-ssh-local-and-remot) if necessary. Our default port is `7860` which you can change in `gradio_demo.py`. ``` python gradio_demo.py ``` ### Known limitations 1. The model sometimes generates undesired unintelligible human speech-like sounds 2. The model sometimes generates undesired background music 3. The model struggles with unfamiliar concepts, e.g., it can generate "gunfires" but not "RPG firing". We believe all of these three limitations can be addressed with more high-quality training data. ## Training Work in progress. ## Evaluation Work in progress. ## Datasets MMAudio was trained on several datasets, including [AudioSet](https://research.google.com/audioset/), [Freesound](https://github.com/LAION-AI/audio-dataset/blob/main/laion-audio-630k/README.md), [VGGSound](https://www.robots.ox.ac.uk/~vgg/data/vggsound/), [AudioCaps](https://audiocaps.github.io/), and [WavCaps](https://github.com/XinhaoMei/WavCaps). These datasets are subject to specific licenses, which can be accessed on their respective websites. We do not guarantee that the pre-trained models are suitable for commercial use. Please use them at your own risk. ## Acknowledgement Many thanks to: - [Make-An-Audio 2](https://github.com/bytedance/Make-An-Audio-2) for the 16kHz BigVGAN pretrained model and the VAE architecture - [BigVGAN](https://github.com/NVIDIA/BigVGAN) - [Synchformer](https://github.com/v-iashin/Synchformer) - [EDM2](https://github.com/NVlabs/edm2) for the magnitude-preserving network architecture