# MMAudio **Repository Path**: MufcLiuKai/MMAudio ## Basic Information - **Project Name**: MMAudio - **Description**: No description available - **Primary Language**: Python - **License**: MIT - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2024-12-17 - **Last Updated**: 2024-12-17 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README
## [Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis](https://hkchengrex.github.io/MMAudio) [Ho Kei Cheng](https://hkchengrex.github.io/), [Masato Ishii](https://scholar.google.co.jp/citations?user=RRIO1CcAAAAJ), [Akio Hayakawa](https://scholar.google.com/citations?user=sXAjHFIAAAAJ), [Takashi Shibuya](https://scholar.google.com/citations?user=XCRO260AAAAJ), [Alexander Schwing](https://www.alexander-schwing.de/), [Yuki Mitsufuji](https://www.yukimitsufuji.com/) University of Illinois Urbana-Champaign, Sony AI, and Sony Group Corporation **Note: This repository is still under construction. Single-example inference should work as expected. The training code will be added. Code is subject to non-backward-compatible changes.** ## Highlight MMAudio generates synchronized audio given video and/or text inputs. Our key innovation is multimodal joint training which allows training on a wide range of audio-visual and audio-text datasets. Moreover, a synchronization module aligns the generated audio with the video frames. ## Results (All audio from our algorithm MMAudio) Videos from Sora: https://github.com/user-attachments/assets/82afd192-0cee-48a1-86ca-bd39b8c8f330 Videos from MovieGen/Hunyuan Video/VGGSound: https://github.com/user-attachments/assets/29230d4e-21c1-4cf8-a221-c28f2af6d0ca For more results, visit https://hkchengrex.com/MMAudio/video_main.html. ## Update Logs - 2024-12-14: Removed the `ffmpeg<7` requirement for the demos by replacing `torio.io.StreamingMediaDecoder` with `pyav` for reading frames. The read frames are also cached, so we are not reading the same frames again during reconstruction. This should speed things up and make installation less of a hassle. - 2024-12-13: Improved for-loop processing in CLIP/Sync feature extraction by introducing a batch size multiplier. We can approximately use 40x batch size for CLIP/Sync without using more memory, thereby speeding up processing. Removed VAE encoder during inference -- we don't need it. - 2024-12-11: Replaced `torio.io.StreamingMediaDecoder` with `pyav` for reading framerate when reconstructing the input video. `torio.io.StreamingMediaDecoder` does not work reliably in huggingface ZeroGPU's environment, and I suspect that it might not work in some other environments as well. ## Installation We have only tested this on Ubuntu. ### Prerequisites We recommend using a [miniforge](https://github.com/conda-forge/miniforge) environment. - Python 3.9+ - PyTorch **2.5.1+** and corresponding torchvision/torchaudio (pick your CUDA version https://pytorch.org/, pip install recommended) **1. Install prerequisite if not yet met:** ``` pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 --upgrade ``` (Or any other CUDA versions that your GPUs/driver support) **2. Clone our repository:** ```bash git clone https://github.com/hkchengrex/MMAudio.git ``` **3. Install with pip (install pytorch first before attempting this!):** ```bash cd MMAudio pip install -e . ``` (If you encounter the File "setup.py" not found error, upgrade your pip with pip install --upgrade pip) **Pretrained models:** The models will be downloaded automatically when you run the demo script. MD5 checksums are provided in `mmaudio/utils/download_utils.py`. The models are also available at https://huggingface.co/hkchengrex/MMAudio/tree/main | Model | Download link | File size | | -------- | ------- | ------- | | Flow prediction network, small 16kHz | mmaudio_small_16k.pth | 601M | | Flow prediction network, small 44.1kHz | mmaudio_small_44k.pth | 601M | | Flow prediction network, medium 44.1kHz | mmaudio_medium_44k.pth | 2.4G | | Flow prediction network, large 44.1kHz | mmaudio_large_44k.pth | 3.9G | | Flow prediction network, large 44.1kHz, v2 **(recommended)** | mmaudio_large_44k_v2.pth | 3.9G | | 16kHz VAE | v1-16.pth | 655M | | 16kHz BigVGAN vocoder (from Make-An-Audio 2) |best_netG.pt | 429M | | 44.1kHz VAE |v1-44.pth | 1.2G | | Synchformer visual encoder |synchformer_state_dict.pth | 907M | To run the model, you need four components: a flow prediction network, visual feature extractors (Synchformer and CLIP, CLIP will be downloaded automatically), a VAE, and a vocoder. VAEs and vocoders are specific to the sampling rate (16kHz or 44.1kHz) and not model sizes. The 44.1kHz vocoder will be downloaded automatically. The expected directory structure (full): ```bash MMAudio ├── ext_weights │ ├── best_netG.pt │ ├── synchformer_state_dict.pth │ ├── v1-16.pth │ └── v1-44.pth ├── weights │ ├── mmaudio_small_16k.pth │ ├── mmaudio_small_44k.pth │ ├── mmaudio_medium_44k.pth │ ├── mmaudio_large_44k.pth │ └── mmaudio_large_44k_v2.pth └── ... ``` The expected directory structure (minimal, for the recommended model only): ```bash MMAudio ├── ext_weights │ ├── synchformer_state_dict.pth │ └── v1-44.pth ├── weights │ └── mmaudio_large_44k_v2.pth └── ... ``` ## Demo By default, these scripts use the `large_44k_v2` model. In our experiments, inference only takes around 6GB of GPU memory (in 16-bit mode) which should fit in most modern GPUs. ### Command-line interface With `demo.py` ```bash python demo.py --duration=8 --video=