# NeuralSVB **Repository Path**: evan-xia-mew/NeuralSVB ## Basic Information - **Project Name**: NeuralSVB - **Description**: No description available - **Primary Language**: Unknown - **License**: GPL-3.0 - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2024-04-23 - **Last Updated**: 2024-04-23 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Learning the Beauty in Songs: Neural Singing Voice Beautifier --- [![arXiv](https://img.shields.io/badge/arXiv-Paper-.svg)](https://arxiv.org/abs/2202.13277) [![GitHub Stars](https://img.shields.io/github/stars/MoonInTheRiver/NeuralSVB)](https://github.com/MoonInTheRiver/NeuralSVB) ![visitors](https://visitor-badge.glitch.me/badge?page_id=moonintheriver/NeuralSVB)

Demo Page

This repository is the official PyTorch implementation of our ACL-2022 [paper](https://arxiv.org/abs/2202.13277). ## 0. Dataset (PopBuTFy) Acquirement ### Audio samples - You can download the dataset from [here](https://drive.google.com/file/d/1IKFp7y1WeYGrwXgJ0HC3rdPj54WoqIsU/view?usp=sharing). Please send us an email for registration (See in [apply_form](resources/apply_form.md)). - Dataset [preview](https://github.com/MoonInTheRiver/NeuralSVB/releases/download/pre-release/PopBuTFy-preview.zip). ### Text labels NeuralSVB does not need text as input, but the ASR model to extract PPG needs text. Thus we also provide the [text labels](https://github.com/MoonInTheRiver/NeuralSVB/releases/download/pre-release/text_labels.zip) of PopBuTFy. ## 1. Preparation ### Environment Preparation Most of the required packages are in https://github.com/NATSpeech/NATSpeech/blob/main/requirements.txt Or you can prepare environments with the Requirements.txt file in the repository directory. ```sh pip install Requirements.txt ``` ### Data Preparation 1. Extract embeddings of vocal timbre: ```sh CUDA_VISIBLE_DEVICES=0 python data_gen/tts/bin/binarize.py --config egs/datasets/audio/PopBuTFy/save_emb.yaml ``` 2. Pack the dataset: ```sh CUDA_VISIBLE_DEVICES=0 python data_gen/tts/bin/binarize.py --config egs/datasets/audio/PopBuTFy/para_bin.yaml ``` ### Vocoder Preparation We provide the pre-trained model of [HifiGAN-Singing](https://github.com/MoonInTheRiver/NeuralSVB/releases/download/pre-release/1012_hifigan_all_songs_nsf.zip) which is specially designed for SVS with NSF mechanism. Please unzip pre-trained vocoder into `checkpoints` before training your acoustic model. This singing vocoder is trained on 100+ hours singing data (including Chinese and English songs). ### PPG Extractor Preparation We provide the pre-trained model of [PPG Extractor](https://github.com/MoonInTheRiver/NeuralSVB/releases/download/pre-release/1009_pretrain_asr_english.zip). Please unzip pre-trained PPG extractor into `checkpoints` before training your acoustic model. After the instructions above, the directory structure should be as follows: ``` . |--data |--processed |--PopBuTFy (unzip PopBuTFy.zip) |--data |--directories containing wavs |--binary |--PopBuTFyENSpkEM |--checkpoints |--1009_pretrain_asr_english |-- |--config.yaml |--1012_hifigan_all_songs_nsf |-- |--config.yaml ``` ## 2. Training Example ```sh CUDA_VISIBLE_DEVICES=0,1 python tasks/run.py --config egs/datasets/audio/PopBuTFy/vae_global_mle_eng.yaml --exp_name exp_name --reset ``` ## 3. Inference ### Inference from packed test set ```sh CUDA_VISIBLE_DEVICES=0,1 python tasks/run.py --config egs/datasets/audio/PopBuTFy/vae_global_mle_eng.yaml --exp_name exp_name --reset --infer ``` Inference results will be saved in `./checkpoints/EXP_NAME/generated_` by default. We provided: - the [pre-trained model](https://github.com/MoonInTheRiver/NeuralSVB/releases/download/pre-release/1030_vae_mle.zip) of NSVB (en version); Remember to put the pre-trained models in `checkpoints` directory. ### Inference from raw inputs WIP. ## Limitations See Appendix D "Limitations and Solutions" in our [paper](https://aclanthology.org/2022.acl-long.549.pdf). ## Citation If this repository helps your research, please cite: @inproceedings{liu-etal-2022-learning-beauty, title = "Learning the Beauty in Songs: Neural Singing Voice Beautifier", author = "Liu, Jinglin and Li, Chengxi and Ren, Yi and Zhu, Zhiying and Zhao, Zhou", booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)", month = may, year = "2022", address = "Dublin, Ireland", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2022.acl-long.549", pages = "7970--7983",} ## Issues - Before raising a issue, please check our Readme and other issues for possible solutions. - We will try to handle your problem in time but we could not guarantee a satisfying solution. - Please be friendly. ## Acknowledgements * r9y9's [wavenet_vocoder](https://github.com/r9y9/wavenet_vocoder) * Po-Hsun-Su's [ssim](https://github.com/Po-Hsun-Su/pytorch-ssim) * descriptinc's [melgan](https://github.com/descriptinc/melgan-neurips) * Official [espnet](https://github.com/espnet/espnet) * Official [PyTorch Lightning](https://github.com/PyTorchLightning/pytorch-lightning) The framework of this repository is based on [DiffSinger](https://github.com/MoonInTheRiver/DiffSinger), and is a predecessor of [NATSpeech](https://github.com/NATSpeech/NATSpeech/).