Real-Time State-of-the-art Speech Synthesis for Tensorflow 2
:zany_face: TensorflowTTS provides real-time state-of-the-art speech synthesis architectures such as Tacotron-2, Melgan, Multiband-Melgan, FastSpeech, FastSpeech2 based-on TensorFlow 2. With Tensorflow 2, we can speed-up training/inference progress, optimizer further by using fake-quantize aware and pruning, make TTS models can be run faster than real-time and be able to deploy on mobile devices or embedded systems.
This repository is tested on Ubuntu 18.04 with:
Different Tensorflow version should be working but not tested yet. This repo will try to work with latest stable tensorflow version.
$ git clone https://github.com/dathudeptrai/TensorflowTTS.git
$ cd TensorflowTTS
$ pip install .
If you want upgrade the repository and its dependencies:
$ git pull
$ pip install --upgrade .
TensorflowTTS currently provides the following architectures:
We are also implement some techniques to improve quality and convergence speed from following papers:
Here in an audio samples on valid set. tacotron-2, fastspeech, melgan, melgan.stft, fastspeech2
Prepare a dataset in the following format:
|- datasets/
| |- metadata.csv
| |- wav/
| |- file1.wav
| |- ...
where metadata.csv has the following format: id|transcription. This is a ljspeech-like format, you can ignore preprocessing step if you have other format dataset.
The preprocessing have three steps:
This is a command line to do three steps above:
tensorflow-tts-preprocess --rootdir ./datasets/ --outdir ./dump/ --conf preprocess/ljspeech_preprocess.yaml
tensorflow-tts-compute-statistics --rootdir ./dump/train/ --outdir ./dump --config preprocess/ljspeech_preprocess.yaml
tensorflow-tts-normalize --rootdir ./dump --outdir ./dump --stats ./dump/stats.npy --config preprocess/ljspeech_preprocess.yaml
After preprocessing, a structure of project will become:
|- datasets/
| |- metadata.csv
| |- wav/
| |- file1.wav
| |- ...
|- dump/
| |- train/
| |- ids/
| |- LJ001-0001-ids.npy
| |- ...
| |- raw-feats/
| |- LJ001-0001-raw-feats.npy
| |- ...
| |- raw-f0/
| |- LJ001-0001-raw-f0.npy
| |- ...
| |- raw-energies/
| |- LJ001-0001-raw-energy.npy
| |- ...
| |- norm-feats/
| |- LJ001-0001-norm-feats.npy
| |- ...
| |- wavs/
| |- LJ001-0001-wave.npy
| |- ...
| |- valid/
| |- ids/
| |- LJ001-0009-ids.npy
| |- ...
| |- raw-feats/
| |- LJ001-0009-raw-feats.npy
| |- ...
| |- raw-f0/
| |- LJ001-0001-raw-f0.npy
| |- ...
| |- raw-energies/
| |- LJ001-0001-raw-energy.npy
| |- ...
| |- norm-feats/
| |- LJ001-0009-norm-feats.npy
| |- ...
| |- wavs/
| |- LJ001-0009-wave.npy
| |- ...
| |- stats.npy/
| |- stats_f0.npy/
| |- stats_energy.npy/
| |- train_utt_ids.npy
| |- valid_utt_ids.npy
|- examples/
| |- melgan/
| |- fastspeech/
| |- tacotron2/
| ...
Where stats.npy contains mean/var of train melspectrogram (we can use mean/var to de-normalization to get melspectrogram raw), stats_energy.npy is a min/max value of energy values over Training dataset, stats_f0 is a min/max value of F0 values, train_utt_ids/valid_utt_ids contains training and valid utt ids respectively. We use suffix (ids, raw-feats, norm-feats, wave) for each type of input.
IMPORTANT NOTES:
To know how to training model from scratch or fine-tune with other datasets/languages, pls see detail at example directory.
A detail implementation of abstract dataset class from tensorflow_tts/dataset/abstract_dataset. There are some functions you need overide and understand:
IMPORTANT NOTES:
Some examples to use this abstract_dataset are tacotron_dataset.py, fastspeech_dataset.py, melgan_dataset.py.
A detail implementation of base_trainer from tensorflow_tts/trainer/base_trainer.py. It include Seq2SeqBasedTrainer and GanBasedTrainer inherit from BasedTrainer. There a some functions you MUST overide when implement new_trainer:
All models on this repo are trained based-on GanBasedTrainer (see train_melgan.py, train_melgan_stft.py, train_multiband_melgan.py) and Seq2SeqBasedTrainer (see train_tacotron2.py, train_fastspeech.py). In the near future, we will implement MultiGPU for BasedTrainer class.
You can know how to inference each model at notebooks or see a colab. Here is an example code for end2end inference with fastspeech and melgan.
import numpy as np
import soundfile as sf
import yaml
import tensorflow as tf
from tensorflow_tts.processor import LJSpeechProcessor
from tensorflow_tts.configs import FastSpeechConfig
from tensorflow_tts.configs import MelGANGeneratorConfig
from tensorflow_tts.models import TFFastSpeech
from tensorflow_tts.models import TFMelGANGenerator
# initialize fastspeech model.
with open('./examples/fastspeech/conf/fastspeech.v1.yaml') as f:
fs_config = yaml.load(f, Loader=yaml.Loader)
fs_config = FastSpeechConfig(**fs_config["fastspeech_params"])
fastspeech = TFFastSpeech(config=fs_config, name="fastspeech")
fastspeech._build()
fastspeech.load_weights("./examples/fastspeech/pretrained/model-195000.h5")
# initialize melgan model
with open('./examples/melgan/conf/melgan.v1.yaml') as f:
melgan_config = yaml.load(f, Loader=yaml.Loader)
melgan_config = MelGANGeneratorConfig(**melgan_config["generator_params"])
melgan = TFMelGANGenerator(config=melgan_config, name='melgan_generator')
melgan._build()
melgan.load_weights("./examples/melgan/pretrained/generator-1500000.h5")
# inference
processor = LJSpeechProcessor(None, cleaner_names="english_cleaners")
ids = processor.text_to_sequence("Recent research at Harvard has shown meditating for as little as 8 weeks, can actually increase the grey matter in the parts of the brain responsible for emotional regulation, and learning.")
ids = tf.expand_dims(ids, 0)
# fastspeech inference
masked_mel_before, masked_mel_after, duration_outputs = fastspeech.inference(
ids,
attention_mask=tf.math.not_equal(ids, 0),
speaker_ids=tf.zeros(shape=[tf.shape(ids)[0]]),
speed_ratios=tf.constant([1.0], dtype=tf.float32)
)
# melgan inference
audio_before = melgan(masked_mel_before)[0, :, 0]
audio_after = melgan(masked_mel_after)[0, :, 0]
# save to file
sf.write('./audio_before.wav', audio_before, 22050, "PCM_16")
sf.write('./audio_after.wav', audio_after, 22050, "PCM_16")
dathudeptrai: nguyenquananhminh@gmail.com, erogol: erengolge@gmail.com
Overrall, Almost models here are licensed under the Apache 2.0 for all countries in the world, except in Viet Nam this framework cannot be used for production in any way without permission from TensorflowTTS's Authors. There is an exception, Tacotron-2 can be used with any perpose. So, if you are VietNamese and want to use this framework for production, you Must contact our in andvance.
We would like to thank Tomoki Hayashi, who discussed with our much about Melgan, Multi-band melgan, Fastspeech and Tacotron. This framework based-on his great open-source ParallelWaveGan project.
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。