# tacotron2-vae **Repository Path**: jeton/tacotron2-vae ## Basic Information - **Project Name**: tacotron2-vae - **Description**: Implementation of "Learning Latent Representations for Style Control and Transfer in End-to-end Speech Synthesis" - **Primary Language**: Unknown - **License**: BSD-3-Clause - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2020-06-02 - **Last Updated**: 2020-12-19 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # tacotron2-vae ## Overview ![overview](./res/overview.png) - Generate emotional voices by receiving text and emotional label as input - Korean Version of ["Learning Latent Representations for Style Control and Transfer in End-to-end Speech Synthesis"](https://arxiv.org/pdf/1812.04342.pdf) ## Data 1. Dataset * Korean Speech Emotion Dataset ([more info](http://aicompanion.or.kr/kor/main/)) * Single Female Voice Actor recorded six diffrent emotions(neutral, happy, sad, angry, disgust, fearful), each with 3,000 sentences. * For training, I used four emotions(neutral,happy,sad,angry), total 21.25 hours 2. Text * test: `python -m text.cleaners` * examples ``` 감정있는 한국어 목소리 생성 ==> ['ᄀ', 'ㅏ', 'ㅁ', 'ᄌ', 'ㅓ', 'ㅇ', 'ᄋ', 'ㅣ', 'ㅆ', 'ᄂ', 'ㅡ', 'ㄴ', ' ', 'ᄒ', 'ㅏ', 'ㄴ', 'ᄀ', 'ㅜ', 'ㄱ', 'ᄋ', 'ㅓ', ' ', 'ᄆ', 'ㅗ', 'ㄱ', 'ᄉ', 'ㅗ', 'ᄅ', 'ㅣ', ' ', 'ᄉ', 'ㅐ', 'ㅇ', 'ᄉ', 'ㅓ', 'ㅇ', '~'] ==> [2, 21, 57, 14, 25, 62, 13, 41, 61, 4, 39, 45, 79, 20, 21, 45, 2, 34, 42, 13, 25, 79, 8, 29, 42, 11, 29, 7, 41, 79, 11, 22, 62, 11, 25, 62, 1] ``` 3. Audio * sampling rate: 16000 * filter length: 1024 * hop length: 256 * win length: 1024 * n_mel: 80 * mel_fmin: 0 * mel_fmax: 8000 4. Training files * `./filelists/*.txt` * path | text | speaker | emotion * examples ``` /KoreanEmotionSpeech/wav/neu/neu_00002289.wav|선생님이 초록색으로 머리를 염색하고 나타나서 모두들 깜짝 놀랐다.|0|0 /KoreanEmotionSpeech/wav/sad/sad_00002266.wav|과외 선생님이 열심히 지도해준 덕택에 수학실력이 점점 늘고 있다.|0|1 /KoreanEmotionSpeech/wav/ang/ang_00000019.wav|명백한 것은 각 당이 투사하고 있는 실상과 허상이 있다면 이제 허상은 걷어들여야 한다는 것이다.|0|2 /KoreanEmotionSpeech/wav/hap/hap_00001920.wav|강력한 스크럽으로 상쾌한 양치효과를 주네요.|0|3 ``` ## Training 1. Prepare Datasets 2. Clone this repo: `git clone https://github.com/jinhan/tacotron2-vae.git` 3. CD into this repo: `cd tacotron2-vae` 4. Initialize submodule: `git submodule init; git submodule update` 5. Update .wav paths: `sed -i -- 's,DUMMY,ljs_dataset_folder/wavs,g' filelists/*.txt` 6. Install requirements `pip install -r requirements.txt` 7. Training: `python train.py --output_directory=outdir --log_directory=logdir -- hparams=training_files='filelists/koemo_spk_emo_all_train.txt',validation_files='filelists/koemo_spk_emo_all_valid.txt',anneal_function='constant',batch_size=6` 8. Monitoring: `tensorboard --logdir=outdir/logdir --host=127.0.0.1` 9. Training results (~ 250,000 steps) ![alignment](./res/alignment.gif) ## Visualization source: `inference.ipynb` 1. Load Models tacotron2-vae model ```python model = load_model(hparams) model.load_state_dict(torch.load(checkpoint_path)['state_dict']) _ = model.eval() ``` WaveGlow vocoder model ```python waveglow = torch.load(waveglow_path)['model'] waveglow.cuda() ``` 2. Load Data - 'Prosody' is the output of the fully connected layer to match dimension of z and text-encoded output. ```python path = './filelists/koemo_spk_emo_all_test.txt' with open(path, encoding='utf-8') as f: filepaths_and_text = [line.strip().split("|") for line in f] model.eval() prosody_outputs = [] emotions = [] mus = [] zs = [] for audio_path, _, _, emotion in tqdm(filepaths_and_text): melspec = load_mel(audio_path) prosody, mu, _, z = model.vae_gst(melspec) prosody_outputs.append(prosody.squeeze(1).cpu().data) mus.append(mu.cpu().data) zs.append(z.cpu().data) emotions.append(int(emotion)) prosody_outputs = torch.cat(prosody_outputs, dim=0) emotions = np.array(emotions) mus = torch.cat(mus, dim=0) zs = torch.cat(zs, dim=0) ``` 3. Scatter plot ```python colors = 'r','b','g','y' labels = 'neu','sad','ang','hap' data_x = mus.data.numpy() data_y = emotions plt.figure(figsize=(10,10)) for i, (c, label) in enumerate(zip(colors, labels)): plt.scatter(data_x[data_y==i,0], data_x[data_y==i,1], c=c, label=label, alpha=0.5) axes = plt.gca() plt.grid(True) plt.legend(loc='upper left') ``` ![scatter plot](./res/scatter.png) 4. t-SNE plot ```python colors = 'r','b','g','y' labels = 'neu','sad','ang','hap' data_x = mus data_y = emotions tsne_model = TSNE(n_components=2, random_state=0, init='random') tsne_all_data = tsne_model.fit_transform(data_x) tsne_all_y_data = data_y plt.figure(figsize=(10,10)) for i, (c, label) in enumerate(zip(colors, labels)): plt.scatter(tsne_all_data[tsne_all_y_data==i,0], tsne_all_data[tsne_all_y_data==i,1], c=c, label=label, alpha=0.5) plt.grid(True) plt.legend(loc='upper left') ``` ![t-SNE plot](./res/tsne.png) ## Inference source: `inference.ipynb` ### Reference Audio - Generate voice that follows the style of the reference audio Reference audio ```python def generate_audio_vae_by_ref(text, ref_audio): transcript_outputs = TextEncoder(text) print("reference audio") ipd.display(ipd.Audio(ref_audio, rate=hparams.sampling_rate)) ref_audio_mel = load_mel(ref_audio) latent_vector, _, _, _ = model.vae_gst(ref_audio_mel) latent_vector = latent_vector.unsqueeze(1).expand_as(transcript_outputs) encoder_outputs = transcript_outputs + latent_vector synth, mel_outputs = Decoder(encoder_outputs) ipd.display(ipd.Audio(synth[0].data.cpu().numpy(), rate=hparams.sampling_rate)) ipd.display(plot_data(mel_outputs.data.cpu().numpy()[0])) ``` Generate voice ```python text = "이 모델을 이용하면 같은 문장을 여러가지 스타일로 말할 수 있습니다." ref_wav = "/KoreanEmotionSpeech/wav/ang/ang_00000100.wav" generate_audio_vae_by_ref(text, ref_wav) ``` ### Interpolation - Create a new z by multiply ratios to the centroids. Interpolation ```python def generate_audio_vae(text, ref_audio, trg_audio, ratios): transcript_outputs = TextEncoder(text) for ratio in ratios: latent_vector = ref_audio * ratio + trg_audio * (1.0-ratio) latent_vector = torch.FloatTensor(latent_vector).cuda() latent_vector = model.vae_gst.fc3(latent_vector) encoder_outputs = transcript_outputs + latent_vector synth, mel_outputs_postnet = Decoder(encoder_outputs) ipd.display(ipd.Audio(synth[0].data.cpu().numpy(), rate=hparams.sampling_rate)) ipd.display(plot_data(mel_outputs_postnet.data.cpu().numpy()[0])) ``` Get Centroids ```python encoded = zs.data.numpy() neu = np.mean(encoded[emotions==0,:], axis=0) sad = np.mean(encoded[emotions==1,:], axis=0) ang = np.mean(encoded[emotions==2,:], axis=0) hap = np.mean(encoded[emotions==3,:], axis=0) ``` Generate voice ```python text = "이 모델을 이용하면 같은 문장을 여러가지 스타일로 말할 수 있습니다." ref_audio = hap trg_audio = sad ratios = [1.0, 0.64, 0.34, 0.0] generate_audio_vae(text, ref_audio, trg_audio, ratios) ``` ### Mixer - Result of mixing more than two labels at a desired ratio Mixer ```python def generate_audio_vae_mix(text, ratios): transcript_outputs = TextEncoder(text) latent_vector = ratios[0]*neu + ratios[1]*hap + ratios[2]*sad + ratios[3]*ang latent_vector = torch.FloatTensor(latent_vector).cuda() latent_vector = model.vae_gst.fc3(latent_vector) encoder_outputs = transcript_outputs + latent_vector synth, mel_outputs = Decoder(encoder_outputs) ipd.display(ipd.Audio(synth[0].data.cpu().numpy(), rate=hparams.sampling_rate)) ipd.display(plot_data(mel_outputs.data.cpu().numpy()[0])) ``` Generate voice ```python text = "이 모델을 이용하면 같은 문장을 여러가지 스타일로 말할 수 있습니다." ratios = [0.0, 0.25, 0.0, 0.75] #neu, hap, sad, ang generate_audio_vae_mix(text, ratios) ``` ## Demo page 1. Run: `python app.py --checkpoint_path="./models/032902_vae_250000" --waveglow_path="./models/waveglow_130000"` 2. Mix: Generate voices by adjusting the ratio of netural, sad, happy, and angry 3. Ref Audio: Generate voices by testset audio as a reference audio ![demo page](./res/demo.png) ## Samples - Interpolation: Result of interpolating between two labels at 1.0, 0.66, 0.33, and 0.0 - refs: Result of recorded audio as a reference audio - mix: Result of mixing more than two labels at a desired ratio ## References - Tacotron2: https://github.com/NVIDIA/tacotron2 - Prosody Encoder: https://github.com/KinglittleQ/GST-Tacotron/blob/master/GST.py - WaveGlow: https://github.com/NVIDIA/waveglow - Export Images from tfevents: https://github.com/anderskm/exportTensorFlowLog