layout | background-class | body-class | title | summary | category | image | author | tags | github-link | github-id | featured_image_1 | featured_image_2 | accelerator | order | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
hub_detail | hub-background | hub | Tacotron 2 | The Tacotron 2 model for generating mel spectrograms from text | researchers | nvidia_logo.png | NVIDIA |
| https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/Tacotron2 | NVIDIA/DeepLearningExamples | tacotron2_diagram.png | no-image | cuda | 10 |
import torch
tacotron2 = torch.hub.load('nvidia/DeepLearningExamples:torchhub', 'nvidia_tacotron2')
will load the Tacotron2 model pre-trained on LJ Speech dataset
The Tacotron 2 and WaveGlow model form a text-to-speech system that enables user to synthesise a natural sounding speech from raw transcripts without any additional prosody information. The Tacotron 2 model produces mel spectrograms from input text using encoder-decoder architecture. WaveGlow (also available via torch.hub) is a flow-based model that consumes the mel spectrograms to generate speech.
This implementation of Tacotron 2 model differs from the model described in the paper. Our implementation uses Dropout instead of Zoneout to regularize the LSTM layers.
In the example below:
To run the example you need some extra python packages installed. These are needed for preprocessing the text and audio, as well as for display and input / output.
pip install numpy scipy librosa unidecode inflect librosa
import numpy as np
from scipy.io.wavfile import write
Prepare tacotron2 for inference
tacotron2 = tacotron2.to('cuda')
tacotron2.eval()
Load waveglow from PyTorch Hub
waveglow = torch.hub.load('nvidia/DeepLearningExamples:torchhub', 'nvidia_waveglow')
waveglow = waveglow.remove_weightnorm(waveglow)
waveglow = waveglow.to('cuda')
waveglow.eval()
Now, let's make the model say "hello world, I missed you"
text = "hello world, I missed you"
Now chain pre-processing -> tacotron2 -> waveglow
# preprocessing
sequence = np.array(tacotron2.text_to_sequence(text, ['english_cleaners']))[None, :]
sequence = torch.from_numpy(sequence).to(device='cuda', dtype=torch.int64)
# run the models
with torch.no_grad():
_, mel, _, _ = tacotron2.infer(sequence)
audio = waveglow.infer(mel)
audio_numpy = audio[0].data.cpu().numpy()
rate = 22050
You can write it to a file and listen to it
write("audio.wav", rate, audio_numpy)
Alternatively, play it right away in a notebook with IPython widgets
from IPython.display import Audio
Audio(audio_numpy, rate=rate)
For detailed information on model input and output, training recipies, inference and performance visit: github and/or NGC
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。