# MoChaBench
**Repository Path**: monkeycc/MoChaBench
## Basic Information
- **Project Name**: MoChaBench
- **Description**: No description available
- **Primary Language**: Python
- **License**: Apache-2.0
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-12-03
- **Last Updated**: 2025-12-06
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
# MoChaBench
This repo contains the `Benchmark`, **`Standard Evaluation Codebase`** and `MoCha's Generation Results` for [MoCha
Towards Movie-Grade Talking Character Synthesis](https://arxiv.org/pdf/2503.23307).
Many thanks to the community for sharing β
[An emotional narrative](https://x.com/CongWei1230/status/1907879690959732878), created with light manual editing on clips generated by **MoCha**, has surpassed **1 million views** on X.
## π Table of Contents
- [π MoChaBench Leaderboard](#-mochabench-leaderboard)
- βΆοΈ Evaluating Lip Sync Scores
- [Overview](#overview)
- [Benchmark files](#benchmark-files)
- [How to Use](#how-to-use)
- [Download this repo](#download-this-repo)
- [Dependencies](#dependencies)
- [Example Script to run SyncNetPipeline on single pair of (video, speech)](#example-script-to-run-syncnetpipeline-on-single-pair-of-video-speech)
- [Running SyncNetPipeline on MoCha-Generated Videos for MoChaBench Evaluation](#running-syncnetpipeline-on-mocha-generated-videos-for-mochabench-evaluation)
- [Running SyncNetPipeline on Your Modelβs Outputs for MoChaBench](#running-syncnetpipeline-on-your-models-outputs-for-mochabench)
- [π§© Custom Benchmark Evaluation](#-custom-benchmark-evaluation)
- βΆοΈ Evaluating VIEScore
- [Evaluating with GPT-4o](#evaluating-with-gpt-4o)
- [Evaluating Alignment with Human Ratings](#evaluating-alignment-with-human-ratings)
- [π Citation](#-citation)
# π MoChaBench Leaderboard
### π§ Single-Character Monologue (English)
Including categories: 1p_camera_movement, 1p_closeup_facingcamera, 1p_emotion, 1p_mediumshot_actioncontrol, 1p_portrait, 2p_1clip_1talk
| Method | Sync-Conf. β | Sync-Dist. β |
|-------|--------------|--------------|
| MoCha | **6.333** | **8.185** |
| Hallo3 | 4.866 | 8.963 |
| SadTalker |4.727 |9.239 |
| AniPortrait| 1.740 | 11.383 |
---
### π₯ Multi-Character Turn-based Dialogue (English)
Including categories: 2p_2clip_2talk
| Method | Sync-Conf. β | Sync-Dist. β |
|-------|--------------|--------------|
| MoCha | **4.951** | **8.601** |
---
### Per-Category Averages
| Category | Model | Sync-Dist. β | Sync-Conf. β | Example (n) |
|-----------------------------|--------|----------------|-----------------|----------------|
| 1p_camera_movement | MoCha | 8.455 | 5.432 | 18 |
| 1p_closeup_facingcamera | MoCha | 7.958 | 6.298 | 27 |
| 1p_emotion | MoCha | 8.073 | 6.214 | 34 |
| 1p_generalize_chinese | MoCha | 8.273 | 4.398 | 4 |
| 1p_mediumshot_actioncontrol | MoCha | 8.386 | 6.241 | 52 |
| 1p_protrait | MoCha | 8.125 | 6.892 | 38 |
| 2p_1clip_1talk | MoCha | 8.082 | 6.493 | 30 |
| 2p_2clip_2talk | MoCha | 8.601 | 4.951 | 15 |
# βΆοΈ Evaluating Lip Sync Scores
## Overview
We use SyncNet for evaluation. The codebase is adapted from [joonson/syncnet_python](https://github.com/joonson/syncnet_python) with **improved code structure and a unified API** to facilitate evaluation for the community.
The implementation follows a Hugging Face Diffusers-style structure.
We provided a
`SyncNetPipeline` Class, located at `eval-lipsync\script\syncnet_pipeline.py`.
You can initialize `SyncNetPipeline` by providing the weights and configs:
```python
pipe = SyncNetPipeline(
{
"s3fd_weights": "path to sfd_face.pth",
"syncnet_weights": "path to syncnet_v2.model",
},
device="cuda", # or "cpu"
)
```
The pipeline offers an `inference` function to score a single pair of video and speech. For fair comparison, the input speech should be a denoised vocal source extracted from your audio. You can use seperator like [Kim_Vocal_2](https://huggingface.co/huangjackson/Kim_Vocal_2) for general noise remvoal and [Demucs_mdx_extra](https://github.com/facebookresearch/demucs) for music removal
```PYTHON
av_off, sync_confs, sync_dists, best_conf, min_dist, s3fd_json, has_face = pipe.inference(
video_path="path to video.mp4", # RGB video
audio_path="path to speech.wav", # speech track (must be denoised from audio, ffmpeg-readable format)
cache_dir= "path to store intermediate output", # optional; omit to auto-cleanup intermediates
)
```
## Benchmark files
We provide the benchmark files in the `benchmark/` directory, organized by data type and category.
Each file follows the structure:
`benchmark///.`
Directory Structure
```cmd
ββbenchmark
β ββaudios
β β ββ1p_camera_movement
β β β ββ 10_man_basketball_camera_push_in.wav
β β β ...
β β ββ1p_closeup_facingcamera
β β ββ1p_emotion
β β ββ1p_generalize_chinese
β β ββ1p_mediumshot_actioncontrol
β β ββ1p_protrait
β β ββ2p_1clip_1talk
β β ββ2p_2clip_2talk
β ββfirst-frames-from-mocha-generation
β β ββ1p_camera_movement
β β β ββ 10_man_basketball_camera_push_in.png
β β β ...
β β ββ1p_closeup_facingcamera
β β ββ1p_emotion
β β ββ1p_generalize_chinese
β β ββ1p_mediumshot_actioncontrol
β β ββ1p_protrait
β β ββ2p_1clip_1talk
β β ββ2p_2clip_2talk
β ββspeeches
β ββ1p_camera_movement
| | ββ 10_man_basketball_camera_push_in_speech.wav
β β ...
β ββ1p_closeup_facingcamera
β ββ1p_emotion
β ββ1p_generalize_chinese
β ββ1p_mediumshot_actioncontrol
β ββ1p_protrait
β ββ2p_1clip_1talk
β ββ2p_2clip_2talk
βββbenchmark.csv
```
- **`benchmark.csv`** contains metadata for each sample, with each row specifying:
`idx_in_category`, `category`, `context_id`, `prompt`.
We use `benchmark.csv` to connect files. Any file in the benchmark can be located using the combination:
`/benchmark///.`
- `speeches` files are generated from `audios` files by using [Demucs_mdx_extra](https://github.com/facebookresearch/demucs). For fair comparison, `speeches` should also be used as the input to your own model instead of `audios`.
- We also provie `first-frames-from-mocha-generation` to facilitate fair comparison for (image + text + audio β video) models.
## How to Use
### Download this repo
[SyncNet Weights](https://github.com/congwei1230/MoChaBench/tree/main/eval-lipsync/weights), [Benchmark](https://github.com/congwei1230/MoChaBench/tree/main/benchmark) and [MoCha's Generation Results](https://github.com/congwei1230/MoChaBench/tree/main/mocha-generation) are embedded in this git repo
```sh
git clone https://github.com/congwei1230/MoChaBench.git
```
### Dependencies
```sh
conda create -n mochabench_eval python=3.8
conda activate mochabench_eval
cd eval-lipsync
pip install -r requirements.txt
# require ffmpeg installed
```
### Example Script to run SyncNetPipeline on single pair of (video, speech)
```sh
cd script
python run_syncnet_pipeline_on_1example.py
```
You are expected to get values close (Β±0.1 due to ffmpeg version, the version i am using `ffmpeg version 7.1.1-essentials_build-www.gyan.dev Copyright (c) 2000-2025 the FFmpeg developers`) to:
```
AV offset: 1
Min dist: 9.255
Confidence: 4.497
best-confidence : 4.4973907470703125
lowest distance : 9.255396842956543
per-crop offsets : [1]
```
### Running SyncNetPipeline on MoCha-Generated Videos for MoChaBench Evaluation
We provide the MoCha-Generated videos in the `mocha-generation/` directory
Each file follows the structure:
`mocha-generation//.`
```cmd
mocha-generation
ββ1p_camera_movement
β ββ 10_man_basketball_camera_push_in.mp4
β β ...
ββ1p_closeup_facingcamera
ββ1p_emotion
ββ1p_generalize_chinese
ββ1p_mediumshot_actioncontrol
ββ1p_protrait
ββ2p_1clip_1talk
ββ2p_2clip_2talk
```
To evaluate the results, simply run the pipeline below.
This script will print the score for each category, as well as the average scores for Monologue and Dialogue.
It will also output a CSV file at `eval-lipsync/mocha-eval-results/sync_scores.csv`, recording each exampleβs score.
```sh
cd eval-lipsync/script
python run_syncnet_pipeline_on_mocha_generation_on_mocha_bench.py
```
### Running SyncNetPipeline on Your Modelβs Outputs for MoChaBench
To evaluate your own modelβs outputs with MoChaBench, first use the following inputs to generate videos:
- **Speech input:** `benchmark/speeches`
- **Text input:** `prompt` from `benchmark.csv`
- **Image input:** `benchmark/first-frames-from-mocha-generation` (if your model requires an image condition)
You can also use our HF version
to generate videos.
Then organize your generated videos in a folder that matches the structure of `mocha-generation/`:
```cmd
/
ββ 1p_camera_movement/
β ββ 10_man_basketball_camera_push_in.mp4
β β ...
ββ 1p_closeup_facingcamera/
ββ 1p_emotion/
ββ 1p_generalize_chinese/
ββ 1p_mediumshot_actioncontrol/
ββ 1p_protrait/
ββ 2p_1clip_1talk/
ββ 2p_2clip_2talk/
```
Each video should be named as `.mp4` within the corresponding category folder. You donβt need to provide an mp4 for every categoryβthe script will skip any missing videos and report scores for the rest.
Next, modify the script `run_syncnet_pipeline_on_your_own_model_results.py` to point to your video folder.
Then, run:
```sh
cd eval-lipsync/script
python run_syncnet_pipeline_on_your_own_model_results.py
```
The script will output a CSV file at `eval-lipsync/your own model-eval-results/sync_scores.csv` with the evaluation scores for each example.
## π§© Custom Benchmark Evaluation
Since our pipeline provides an API to score a pair of (video, audio), you can easily adapt it for other benchmark datasets by looping through your examples:
```PYTHON
# Loop through your dataset
for example in dataset:
video_fp = example["video_path"]
audio_fp = example["audio_path"]
id = example["context_id"]
av_off, sync_confs, sync_dists, best_conf, min_dist, s3fd_json, has_face = pipe.inference(
video_path=str(video_fp),
audio_path=str(audio_fp),
cache_dir="YOUR INPUT"
)
# Store or process the results as needed
# After processing all samples, compute average results.
```
# βΆοΈ Evaluating VIEScore
## Evaluating with GPT-4o
We provide example scripts for running **GPT-4o-based evaluation** on 20 examples from MoChaBench, covering 4 models and 4 evaluation aspects.
```
conda activate mochabench
pip install openai opencv-python
cd eval-viescore
python eval_gpt_viescore.py
```
## Evaluating Alignment with Human Ratings
We also provide a script to compute the agreement between GPT-4o scores and human majority vote ratings:
```
conda activate mochabench
pip install scikit-learn
cd eval-viescore
python compute_alignment.py
```
This script outputs alignment metrics (QWK, Spearman Ο, Footrule, MAE) for each aspect and overall.
# π Citation
π If you find our work helpful, please leave us a star and cite our paper.
```bibtex
@article{wei2025mocha,
title={MoCha: Towards Movie-Grade Talking Character Synthesis},
author={Wei, Cong and Sun, Bo and Ma, Haoyu and Hou, Ji and Juefei-Xu, Felix and He, Zecheng and Dai, Xiaoliang and Zhang, Luxin and Li, Kunpeng and Hou, Tingbo and others},
journal={arXiv preprint arXiv:2503.23307},
year={2025}
}
```