# MoChaBench **Repository Path**: monkeycc/MoChaBench ## Basic Information - **Project Name**: MoChaBench - **Description**: No description available - **Primary Language**: Python - **License**: Apache-2.0 - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-12-03 - **Last Updated**: 2025-12-06 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # MoChaBench This repo contains the `Benchmark`, **`Standard Evaluation Codebase`** and `MoCha's Generation Results` for [MoCha Towards Movie-Grade Talking Character Synthesis](https://arxiv.org/pdf/2503.23307).

Many thanks to the community for sharing — [An emotional narrative](https://x.com/CongWei1230/status/1907879690959732878), created with light manual editing on clips generated by **MoCha**, has surpassed **1 million views** on X.

## 📑 Table of Contents - [🏆 MoChaBench Leaderboard](#-mochabench-leaderboard) - ▶️ Evaluating Lip Sync Scores - [Overview](#overview) - [Benchmark files](#benchmark-files) - [How to Use](#how-to-use) - [Download this repo](#download-this-repo) - [Dependencies](#dependencies) - [Example Script to run SyncNetPipeline on single pair of (video, speech)](#example-script-to-run-syncnetpipeline-on-single-pair-of-video-speech) - [Running SyncNetPipeline on MoCha-Generated Videos for MoChaBench Evaluation](#running-syncnetpipeline-on-mocha-generated-videos-for-mochabench-evaluation) - [Running SyncNetPipeline on Your Model’s Outputs for MoChaBench](#running-syncnetpipeline-on-your-models-outputs-for-mochabench) - [🧩 Custom Benchmark Evaluation](#-custom-benchmark-evaluation) - ▶️ Evaluating VIEScore - [Evaluating with GPT-4o](#evaluating-with-gpt-4o) - [Evaluating Alignment with Human Ratings](#evaluating-alignment-with-human-ratings) - [📚 Citation](#-citation)
# 🏆 MoChaBench Leaderboard ### 🧑 Single-Character Monologue (English) Including categories: 1p_camera_movement, 1p_closeup_facingcamera, 1p_emotion, 1p_mediumshot_actioncontrol, 1p_portrait, 2p_1clip_1talk | Method | Sync-Conf. ↑ | Sync-Dist. ↓ | |-------|--------------|--------------| | MoCha | **6.333** | **8.185** | | Hallo3 | 4.866 | 8.963 | | SadTalker |4.727 |9.239 | | AniPortrait| 1.740 | 11.383 | --- ### 👥 Multi-Character Turn-based Dialogue (English) Including categories: 2p_2clip_2talk | Method | Sync-Conf. ↑ | Sync-Dist. ↓ | |-------|--------------|--------------| | MoCha | **4.951** | **8.601** | --- ### Per-Category Averages | Category | Model | Sync-Dist. ↓ | Sync-Conf. ↑ | Example (n) | |-----------------------------|--------|----------------|-----------------|----------------| | 1p_camera_movement | MoCha | 8.455 | 5.432 | 18 | | 1p_closeup_facingcamera | MoCha | 7.958 | 6.298 | 27 | | 1p_emotion | MoCha | 8.073 | 6.214 | 34 | | 1p_generalize_chinese | MoCha | 8.273 | 4.398 | 4 | | 1p_mediumshot_actioncontrol | MoCha | 8.386 | 6.241 | 52 | | 1p_protrait | MoCha | 8.125 | 6.892 | 38 | | 2p_1clip_1talk | MoCha | 8.082 | 6.493 | 30 | | 2p_2clip_2talk | MoCha | 8.601 | 4.951 | 15 | # ▶️ Evaluating Lip Sync Scores ## Overview We use SyncNet for evaluation. The codebase is adapted from [joonson/syncnet_python](https://github.com/joonson/syncnet_python) with **improved code structure and a unified API** to facilitate evaluation for the community. The implementation follows a Hugging Face Diffusers-style structure. We provided a `SyncNetPipeline` Class, located at `eval-lipsync\script\syncnet_pipeline.py`. You can initialize `SyncNetPipeline` by providing the weights and configs: ```python pipe = SyncNetPipeline( { "s3fd_weights": "path to sfd_face.pth", "syncnet_weights": "path to syncnet_v2.model", }, device="cuda", # or "cpu" ) ``` The pipeline offers an `inference` function to score a single pair of video and speech. For fair comparison, the input speech should be a denoised vocal source extracted from your audio. You can use seperator like [Kim_Vocal_2](https://huggingface.co/huangjackson/Kim_Vocal_2) for general noise remvoal and [Demucs_mdx_extra](https://github.com/facebookresearch/demucs) for music removal ```PYTHON av_off, sync_confs, sync_dists, best_conf, min_dist, s3fd_json, has_face = pipe.inference( video_path="path to video.mp4", # RGB video audio_path="path to speech.wav", # speech track (must be denoised from audio, ffmpeg-readable format) cache_dir= "path to store intermediate output", # optional; omit to auto-cleanup intermediates ) ``` ## Benchmark files We provide the benchmark files in the `benchmark/` directory, organized by data type and category. Each file follows the structure: `benchmark///.`

Directory Structure

```cmd ├─benchmark │ ├─audios │ │ ├─1p_camera_movement │ │ │ ├─ 10_man_basketball_camera_push_in.wav │ │ │ ... │ │ ├─1p_closeup_facingcamera │ │ ├─1p_emotion │ │ ├─1p_generalize_chinese │ │ ├─1p_mediumshot_actioncontrol │ │ ├─1p_protrait │ │ ├─2p_1clip_1talk │ │ └─2p_2clip_2talk │ ├─first-frames-from-mocha-generation │ │ ├─1p_camera_movement │ │ │ ├─ 10_man_basketball_camera_push_in.png │ │ │ ... │ │ ├─1p_closeup_facingcamera │ │ ├─1p_emotion │ │ ├─1p_generalize_chinese │ │ ├─1p_mediumshot_actioncontrol │ │ ├─1p_protrait │ │ ├─2p_1clip_1talk │ │ └─2p_2clip_2talk │ └─speeches │ ├─1p_camera_movement | | ├─ 10_man_basketball_camera_push_in_speech.wav │ │ ... │ ├─1p_closeup_facingcamera │ ├─1p_emotion │ ├─1p_generalize_chinese │ ├─1p_mediumshot_actioncontrol │ ├─1p_protrait │ ├─2p_1clip_1talk │ └─2p_2clip_2talk └──benchmark.csv ```

- **`benchmark.csv`** contains metadata for each sample, with each row specifying: `idx_in_category`, `category`, `context_id`, `prompt`. We use `benchmark.csv` to connect files. Any file in the benchmark can be located using the combination: `/benchmark///.` - `speeches` files are generated from `audios` files by using [Demucs_mdx_extra](https://github.com/facebookresearch/demucs). For fair comparison, `speeches` should also be used as the input to your own model instead of `audios`. - We also provie `first-frames-from-mocha-generation` to facilitate fair comparison for (image + text + audio → video) models. ## How to Use ### Download this repo [SyncNet Weights](https://github.com/congwei1230/MoChaBench/tree/main/eval-lipsync/weights), [Benchmark](https://github.com/congwei1230/MoChaBench/tree/main/benchmark) and [MoCha's Generation Results](https://github.com/congwei1230/MoChaBench/tree/main/mocha-generation) are embedded in this git repo ```sh git clone https://github.com/congwei1230/MoChaBench.git ``` ### Dependencies ```sh conda create -n mochabench_eval python=3.8 conda activate mochabench_eval cd eval-lipsync pip install -r requirements.txt # require ffmpeg installed ``` ### Example Script to run SyncNetPipeline on single pair of (video, speech) ```sh cd script python run_syncnet_pipeline_on_1example.py ``` You are expected to get values close (±0.1 due to ffmpeg version, the version i am using `ffmpeg version 7.1.1-essentials_build-www.gyan.dev Copyright (c) 2000-2025 the FFmpeg developers`) to: ``` AV offset: 1 Min dist: 9.255 Confidence: 4.497 best-confidence : 4.4973907470703125 lowest distance : 9.255396842956543 per-crop offsets : [1] ``` ### Running SyncNetPipeline on MoCha-Generated Videos for MoChaBench Evaluation We provide the MoCha-Generated videos in the `mocha-generation/` directory Each file follows the structure: `mocha-generation//.` ```cmd mocha-generation ├─1p_camera_movement │ ├─ 10_man_basketball_camera_push_in.mp4 │ │ ... ├─1p_closeup_facingcamera ├─1p_emotion ├─1p_generalize_chinese ├─1p_mediumshot_actioncontrol ├─1p_protrait ├─2p_1clip_1talk └─2p_2clip_2talk ``` To evaluate the results, simply run the pipeline below. This script will print the score for each category, as well as the average scores for Monologue and Dialogue. It will also output a CSV file at `eval-lipsync/mocha-eval-results/sync_scores.csv`, recording each example’s score. ```sh cd eval-lipsync/script python run_syncnet_pipeline_on_mocha_generation_on_mocha_bench.py ``` ### Running SyncNetPipeline on Your Model’s Outputs for MoChaBench To evaluate your own model’s outputs with MoChaBench, first use the following inputs to generate videos: - **Speech input:** `benchmark/speeches` - **Text input:** `prompt` from `benchmark.csv` - **Image input:** `benchmark/first-frames-from-mocha-generation` (if your model requires an image condition) You can also use our HF version

to generate videos. Then organize your generated videos in a folder that matches the structure of `mocha-generation/`: ```cmd / ├─ 1p_camera_movement/ │ ├─ 10_man_basketball_camera_push_in.mp4 │ │ ... ├─ 1p_closeup_facingcamera/ ├─ 1p_emotion/ ├─ 1p_generalize_chinese/ ├─ 1p_mediumshot_actioncontrol/ ├─ 1p_protrait/ ├─ 2p_1clip_1talk/ └─ 2p_2clip_2talk/ ``` Each video should be named as `.mp4` within the corresponding category folder. You don’t need to provide an mp4 for every category—the script will skip any missing videos and report scores for the rest. Next, modify the script `run_syncnet_pipeline_on_your_own_model_results.py` to point to your video folder. Then, run: ```sh cd eval-lipsync/script python run_syncnet_pipeline_on_your_own_model_results.py ``` The script will output a CSV file at `eval-lipsync/your own model-eval-results/sync_scores.csv` with the evaluation scores for each example. ## 🧩 Custom Benchmark Evaluation Since our pipeline provides an API to score a pair of (video, audio), you can easily adapt it for other benchmark datasets by looping through your examples: ```PYTHON # Loop through your dataset for example in dataset: video_fp = example["video_path"] audio_fp = example["audio_path"] id = example["context_id"] av_off, sync_confs, sync_dists, best_conf, min_dist, s3fd_json, has_face = pipe.inference( video_path=str(video_fp), audio_path=str(audio_fp), cache_dir="YOUR INPUT" ) # Store or process the results as needed # After processing all samples, compute average results. ``` # ▶️ Evaluating VIEScore ## Evaluating with GPT-4o We provide example scripts for running **GPT-4o-based evaluation** on 20 examples from MoChaBench, covering 4 models and 4 evaluation aspects. ``` conda activate mochabench pip install openai opencv-python cd eval-viescore python eval_gpt_viescore.py ``` ## Evaluating Alignment with Human Ratings We also provide a script to compute the agreement between GPT-4o scores and human majority vote ratings: ``` conda activate mochabench pip install scikit-learn cd eval-viescore python compute_alignment.py ``` This script outputs alignment metrics (QWK, Spearman ρ, Footrule, MAE) for each aspect and overall. # 📚 Citation 🌟 If you find our work helpful, please leave us a star and cite our paper. ```bibtex @article{wei2025mocha, title={MoCha: Towards Movie-Grade Talking Character Synthesis}, author={Wei, Cong and Sun, Bo and Ma, Haoyu and Hou, Ji and Juefei-Xu, Felix and He, Zecheng and Dai, Xiaoliang and Zhang, Luxin and Li, Kunpeng and Hou, Tingbo and others}, journal={arXiv preprint arXiv:2503.23307}, year={2025} } ```