# CV3-Eval **Repository Path**: ysujiang/CV3-Eval ## Basic Information - **Project Name**: CV3-Eval - **Description**: No description available - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-12-23 - **Last Updated**: 2025-12-24 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # CV3-Eval To address the challenges of diversity and generalizability from unrestrained real-world speech synthesis scenarios, Cosyvoice 3 releases CV3-Eval benchmark for zero-shot speech synthesis in the wild, which is built on authentic in-the-wild reference speech from Common Voice, FLUERS, EmoBox, and Web-crawled real-world audio data, and spans a broad range of languages and dialects, domains and environments, emotions and styles. ## Dataset CV3-Eval includes subsets for both objective and subjective evaluation. The objective evaluation subset is further split into three subsets, including multilingual voice cloning, cross-lingual voice cloning, and emotion cloning. Three subjective subsets are prepared for expressive voice cloning, expressive voice continuation, and Chinese accent voice cloning. You can find more details in the [Cosyvoice 3](https://arxiv.org/pdf/2505.17589) paper. ## Metrics We focus on three key aspects: content consistency, speaker similarity, and audio quality. * For content consistency, we measure the Character Error Rate (CER) or Word Error Rate (WER) of the ASR transcription against the given text, using Whisper-large V3 for English ASR and Paraformer for Chinese ASR. * To assess speaker similarity, we extract speaker embeddings from the generated speech using the ERes2Net speaker verification model and calculate the cosine similarity with the embedding of the reference speech. * For audio quality, we score the generated speech using the DNSMOS network, the scores of which show high correlations with human auditory perception. ## Requirements ```bash conda create -n cv3-eval python=3.10 conda activate cv3-eval # Install pytorch with your CUDA version, e.g. pip install torch==2.4.0+cu124 torchaudio==2.4.0+cu124 --extra-index-url https://download.pytorch.org/whl/cu124 pip install -r requirements.txt ``` ## Evaluation Code We release the evaluation codes for all metrics: ```bash # remember to modify to the path of your own generated audio # WER, speaker similarity, and DNSMOS bash run_infer_cv3_eval.sh # emotion score bash run_infer_cv3_eval_emo.sh ``` ## Citation If this benchmark is useful for you, please cite as: ``` @article{gao2025differentiablerewardoptimizationllm, title={Differentiable Reward Optimization for LLM based TTS system}, author={Changfeng Gao and Zhihao Du and Shiliang Zhang}, year={2025}, eprint={2507.05911}, archivePrefix={arXiv}, primaryClass={cs.SD}, url={https://arxiv.org/abs/2507.05911}, } @article{du2025cosyvoice, title={CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training}, author={Du, Zhihao and Gao, Changfeng and Wang, Yuxuan and Yu, Fan and Zhao, Tianyu and Wang, Hao and Lv, Xiang and Wang, Hui and Shi, Xian and An, Keyu and others}, journal={arXiv preprint arXiv:2505.17589}, year={2025} } ```