This repository is the official PyTorch implementation of our AAAI-2022 paper, in which we propose DiffSinger (for Singing-Voice-Synthesis) and DiffSpeech (for Text-to-Speech).
Besides, more detailed & improved code framework, which contains the implementations of FastSpeech 2, DiffSpeech and our NeurIPS-2021 work PortaSpeech is coming soon
DiffSinger/DiffSpeech at training | DiffSinger/DiffSpeech at inference |
---|---|
![]() |
![]() |
PortaSpeech: Portable and High-Quality Generative Text-to-Speech
was accepted by NeurIPS-2021 conda create -n your_env_name python=3.8
source activate your_env_name
pip install -r requirements_2080.txt (GPU 2080Ti, CUDA 10.2)
or pip install -r requirements_3090.txt (GPU 3090, CUDA 11.4)
a) Download and extract the LJ Speech dataset, then create a link to the dataset folder: ln -s /xxx/LJSpeech-1.1/ data/raw/
b) Download and Unzip the ground-truth duration extracted by MFA: tar -xvf mfa_outputs.tar; mv mfa_outputs data/processed/ljspeech/
c) Run the following scripts to pack the dataset for training/inference.
export PYTHONPATH=.
CUDA_VISIBLE_DEVICES=0 python data_gen/tts/bin/binarize.py --config configs/tts/lj/fs2.yaml
# `data/binary/ljspeech` will be generated.
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/lj_ds_beta6.yaml --exp_name lj_exp1 --reset
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/lj_ds_beta6.yaml --exp_name lj_exp1 --reset --infer
We also provide:
Remember to put the pre-trained models in checkpoints
directory.
About the determination of 'k' in shallow diffusion: We recommend the trick introduced in Appendix B. We have already provided the proper 'k' for Ljspeech dataset in the config files.
a) Download and extract PopCS, then create a link to the dataset folder: ln -s /xxx/popcs/ data/processed/
b) Run the following scripts to pack the dataset for training/inference.
export PYTHONPATH=.
CUDA_VISIBLE_DEVICES=0 python data_gen/tts/bin/binarize.py --config usr/configs/popcs_ds_beta6.yaml
# `data/binary/popcs-pmf0` will be generated.
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/popcs_ds_beta6_offline.yaml --exp_name popcs_exp2 --reset
# first run fs2 infer;
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/popcs_fs2.yaml --exp_name popcs_fs2_pmf0_1230 --reset --infer
# second run ds infer;
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/popcs_ds_beta6_offline.yaml --exp_name popcs_exp2 --reset --infer
We also provide:
Note that:
[1] Adversarially trained multi-singer sequence-to-sequence singing synthesizer. Interspeech 2020.
[2] SEQUENCE-TO-SEQUENCE SINGING SYNTHESIS USING THE FEED-FORWARD TRANSFORMER. ICASSP 2020.
[3] DeepSinger : Singing Voice Synthesis with Data Mined From the Web. KDD 2020.
tensorboard --logdir_spec exp_name
![]() |
Along vertical axis, DiffSpeech: [0-80]; FastSpeech2: [80-160].
DiffSpeech vs. FastSpeech 2 |
---|
![]() |
![]() |
![]() |
Audio samples can be found in our demo page.
We also put part of the audio samples generated by DiffSpeech+HifiGAN (marked as [P]) and GTmel+HifiGAN (marked as [G]) of test set in resources/demos_1213.
(corresponding to the pre-trained model DiffSpeech)
New singing samples can be found in resources/demos_0112.
@misc{liu2021diffsinger,
title={DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism},
author={Jinglin Liu and Chengxi Li and Yi Ren and Feiyang Chen and Zhou Zhao},
year={2021},
eprint={2105.02446},
archivePrefix={arXiv},}
Our codes are based on the following repos:
Also thanks Keon Lee for fast implementation of our work.
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。