# 语音大作业

**Repository Path**: lin-yuying/speech-homework

## Basic Information

- **Project Name**: 语音大作业
- **Description**: 南开大学2023年春语音大作业
- **Primary Language**: Unknown
- **License**: MIT
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2023-05-11
- **Last Updated**: 2023-06-11

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# lyy:关于运行环境
使用python38

pip install torch==1.7.0+cu110 torchvision==0.8.1+cu110 torchaudio==0.7.0 -f https://download.pytorch.org/whl/torch_stable.html

pip install -r requirements.txt 

# lyy:数据处理
1.将config/LJSpeech/*.yaml中路径改好。

2.
```
python3 prepare_align.py config/LJSpeech/preprocess.yaml
```
生成对齐后的raw_data等(约3min)。

3.
```
python3 preprocess.py config/LJSpeech/preprocess.yaml
```
生成FastSpeech2/preprocessed_data/LJSpeech下的训练数据，包括energy、pitch、duration等（进度条是按speaker显示的，但此数据集只有一个speaker。此步用时较长，约20min）
## 报错：UnboundLocalError: local variable 'pitch' referenced before assignment
是因为FastSpeech2/config/*/preprocess.yaml中的path不对，对照FastSpeech2/preprocessor/preprocessor.py改正即可
# lyy:训练
```
python3 train.py -p config/LJSpeech/preprocess.yaml -m config/LJSpeech/model.yaml -t config/LJSpeech/train.yaml

tensorboard --logdir output/log/LJSpeech --port 6606
```
一个epoch约5min时间，total_step为10000约1h（3090）

注：LJSpeech只有一个speakor，“LJSpeech”，注意文件夹命名。

model：model/fastspeech2.py
vocoder:hifigen/model.py中的Generator
loss计算： total_loss = (mel_loss + postnet_mel_loss + duration_loss + pitch_loss + energy_loss)，在model/loss.py中

后台运行：
nohup python3 train.py -p config/LJSpeech/preprocess.yaml -m config/LJSpeech/model.yaml -t config/LJSpeech/train.yaml > rnnout.txt 2>&1 &
# lyy:模型结构
train.py为主要函数。FastSpeech2/config/LJSpeech/*.yaml中是一些训练的超参设置。
数据经过预处理，生成在FastSpeech2/preprocessed_data/LJSpeech/train.txt中。例：LJ027-0080|LJSpeech|{T UW1 T OW1 Z sp SH IY1 P sp F AO1 R T OW1 Z sp spn sp AE1 N D F AY1 V T OW1 Z}|two toes (sheep), four toes (hog) and five toes (dog)

首先通过Dataset（FastSpeech2/dataset.py）读入文件。Dataset输出一个字典sample：
{"id": basename,
"speaker": speaker_id,
"text": phone,
"raw_text": raw_text,
"mel": mel,
"pitch": pitch,
"energy": energy,
"duration": duration,}
比较关键的是text输出，主要是通过FastSpeech2/text/\_\_init\_\_.py中的text_to_sequence函数处理。text_to_sequence函数将一个带有音素标注的字符串转化为：一维list（每个元素是音素或单词对应id，长度（大概50-200）不等）。其中使用正则表达式匹配为了分辨单词和音素，要注意这里是将其分别转化为id再直接拼接。

然后使用DataLoader将data分为batch_size=64分批次输出，并使用FastSpeech2/dataset.py中的collate_fn函数按self.batch_size重新分批。collate_fn函数获取一个batch（64个）的sample，通过reprocess函数再处理，输出维度：batchs = 4(即当前包括的几个batch) * 不同的数据类型数（如energy，speaker，text）=12 * self.batch_size，作为一个item

接下来，对每个batchs中的每个batch，取部分的数据类型输入model。其模型结构参见netframe.txt，计算loss，进行一步参数更新。

输入维度：数据类型数 个 参数 *【self.batch_size，seq_len】。要注意的是，实际上会以batch size 为6再将self.batch_size分成小批输入model，故模型中第一维不为16，但实际具体batch size大小无需在意，暂且称做batch_size。

经过encoder，输出维度【batch_size，seq_len，emb_size=256】

经过variance_adaptor，输出维度【batch_size，Regulator_seq_len ，emb_size=256】其中Regulator_seq_len为经过LengthRegulator处理过的seq长度

经过decoder，输出维度【batch_size，Regulator_seq_len，emb_size=256】
最后经过mel_linear，输出维度【batch_size，Regulator_seq_len，emb_size=256】

# lyy:改进
在实验后，发现几个VarianceAdaptor都效果不佳，在val上的测试结果表明其很早就出现了过拟合的现象，于是想尝试改变其结构，探究是否可以增加其表征能力。根据序列数据的特点，将VarianceAdaptor中的VariancePredictor中原有的2个cnn层换为2个lstm，进行实验。运行20000个step，发现结果确实稍有提高。

另外发现 10000个step时模型已经收敛，后来出现了过拟合现象。
# FastSpeech 2 - PyTorch Implementation

This is a PyTorch implementation of Microsoft's text-to-speech system [**FastSpeech 2: Fast and High-Quality End-to-End Text to Speech**](https://arxiv.org/abs/2006.04558v1). 
This project is based on [xcmyz's implementation](https://github.com/xcmyz/FastSpeech) of FastSpeech. Feel free to use/modify the code.

There are several versions of FastSpeech 2.
This implementation is more similar to [version 1](https://arxiv.org/abs/2006.04558v1), which uses F0 values as the pitch features.
On the other hand, pitch spectrograms extracted by continuous wavelet transform are used as the pitch features in the [later versions](https://arxiv.org/abs/2006.04558).

![](./img/model.png)

# Updates
- 2021/7/8: Release the checkpoint and audio samples of a multi-speaker English TTS model trained on LibriTTS
- 2021/2/26: Support English and Mandarin TTS
- 2021/2/26: Support multi-speaker TTS (AISHELL-3 and LibriTTS)
- 2021/2/26: Support MelGAN and HiFi-GAN vocoder

# Audio Samples
Audio samples generated by this implementation can be found [here](https://ming024.github.io/FastSpeech2/). 

# Quickstart

## Dependencies
You can install the Python dependencies with
```
pip3 install -r requirements.txt
```

## Inference

You have to download the [pretrained models](https://drive.google.com/drive/folders/1DOhZGlTLMbbAAFZmZGDdc77kz1PloS7F?usp=sharing) and put them in ``output/ckpt/LJSpeech/``,  ``output/ckpt/AISHELL3``, or ``output/ckpt/LibriTTS/``.

For English single-speaker TTS, run
```
python3 synthesize.py --text "YOUR_DESIRED_TEXT" --restore_step 900000 --mode single -p config/LJSpeech/preprocess.yaml -m config/LJSpeech/model.yaml -t config/LJSpeech/train.yaml
```

For Mandarin multi-speaker TTS, try
```
python3 synthesize.py --text "大家好" --speaker_id 0 --restore_step 600000 --mode single -p config/AISHELL3/preprocess.yaml -m config/AISHELL3/model.yaml -t config/AISHELL3/train.yaml
```

For English multi-speaker TTS, run
```
python3 synthesize.py --text "YOUR_DESIRED_TEXT"  --speaker_id SPEAKER_ID --restore_step 800000 --mode single -p config/LibriTTS/preprocess.yaml -m config/LibriTTS/model.yaml -t config/LibriTTS/train.yaml
```

The generated utterances will be put in ``output/result/``.

Here is an example of synthesized mel-spectrogram of the sentence "Printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the Exhibition", with the English single-speaker TTS model.  
![](./img/synthesized_melspectrogram.png)

## Batch Inference
Batch inference is also supported, try

```
python3 synthesize.py --source preprocessed_data/LJSpeech/val.txt --restore_step 900000 --mode batch -p config/LJSpeech/preprocess.yaml -m config/LJSpeech/model.yaml -t config/LJSpeech/train.yaml
```
to synthesize all utterances in ``preprocessed_data/LJSpeech/val.txt``

## Controllability
The pitch/volume/speaking rate of the synthesized utterances can be controlled by specifying the desired pitch/energy/duration ratios.
For example, one can increase the speaking rate by 20 % and decrease the volume by 20 % by

```
python3 synthesize.py --text "YOUR_DESIRED_TEXT" --restore_step 900000 --mode single -p config/LJSpeech/preprocess.yaml -m config/LJSpeech/model.yaml -t config/LJSpeech/train.yaml --duration_control 0.8 --energy_control 0.8
```

# Training

## Datasets

The supported datasets are

- [LJSpeech](https://keithito.com/LJ-Speech-Dataset/): a single-speaker English dataset consists of 13100 short audio clips of a female speaker reading passages from 7 non-fiction books, approximately 24 hours in total.
- [AISHELL-3](http://www.aishelltech.com/aishell_3): a Mandarin TTS dataset with 218 male and female speakers, roughly 85 hours in total.
- [LibriTTS](https://research.google/tools/datasets/libri-tts/): a multi-speaker English dataset containing 585 hours of speech by 2456 speakers.

We take LJSpeech as an example hereafter.

## Preprocessing
 
First, run 
```
python3 prepare_align.py config/LJSpeech/preprocess.yaml
```
for some preparations.

As described in the paper, [Montreal Forced Aligner](https://montreal-forced-aligner.readthedocs.io/en/latest/) (MFA) is used to obtain the alignments between the utterances and the phoneme sequences.
Alignments of the supported datasets are provided [here](https://drive.google.com/drive/folders/1DBRkALpPd6FL9gjHMmMEdHODmkgNIIK4?usp=sharing).
You have to unzip the files in ``preprocessed_data/LJSpeech/TextGrid/``.

After that, run the preprocessing script by
```
python3 preprocess.py config/LJSpeech/preprocess.yaml
```

Alternately, you can align the corpus by yourself. 
Download the official MFA package and run
```
./montreal-forced-aligner/bin/mfa_align raw_data/LJSpeech/ lexicon/librispeech-lexicon.txt english preprocessed_data/LJSpeech
```
or
```
./montreal-forced-aligner/bin/mfa_train_and_align raw_data/LJSpeech/ lexicon/librispeech-lexicon.txt preprocessed_data/LJSpeech
```

to align the corpus and then run the preprocessing script.
```
python3 preprocess.py config/LJSpeech/preprocess.yaml
```

## Training

Train your model with
```
python3 train.py -p config/LJSpeech/preprocess.yaml -m config/LJSpeech/model.yaml -t config/LJSpeech/train.yaml
```

The model takes less than 10k steps (less than 1 hour on my GTX1080Ti GPU) of training to generate audio samples with acceptable quality, which is much more efficient than the autoregressive models such as Tacotron2.

# TensorBoard

Use
```
tensorboard --logdir output/log/LJSpeech
```

to serve TensorBoard on your localhost.
The loss curves, synthesized mel-spectrograms, and audios are shown.

![](./img/tensorboard_loss.png)
![](./img/tensorboard_spec.png)
![](./img/tensorboard_audio.png)

# Implementation Issues

- Following [xcmyz's implementation](https://github.com/xcmyz/FastSpeech), I use an additional Tacotron-2-styled Post-Net after the decoder, which is not used in the original FastSpeech 2.
- Gradient clipping is used in the training.
- In my experience, using phoneme-level pitch and energy prediction instead of frame-level prediction results in much better prosody, and normalizing the pitch and energy features also helps. Please refer to ``config/README.md`` for more details.

Please inform me if you find any mistakes in this repo, or any useful tips to train the FastSpeech 2 model.

# References
- [FastSpeech 2: Fast and High-Quality End-to-End Text to Speech](https://arxiv.org/abs/2006.04558), Y. Ren, *et al*.
- [xcmyz's FastSpeech implementation](https://github.com/xcmyz/FastSpeech)
- [TensorSpeech's FastSpeech 2 implementation](https://github.com/TensorSpeech/TensorflowTTS)
- [rishikksh20's FastSpeech 2 implementation](https://github.com/rishikksh20/FastSpeech2)

# Citation
```
@INPROCEEDINGS{chien2021investigating,
  author={Chien, Chung-Ming and Lin, Jheng-Hao and Huang, Chien-yu and Hsu, Po-chun and Lee, Hung-yi},
  booktitle={ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, 
  title={Investigating on Incorporating Pretrained and Learnable Speaker Representations for Multi-Speaker Multi-Style Text-to-Speech}, 
  year={2021},
  volume={},
  number={},
  pages={8588-8592},
  doi={10.1109/ICASSP39728.2021.9413880}}
```