This model can convert a piece of text into an audio file. It accepts a string of UTF-8 text as the input. The model returns a WAV audio file that is an audio version of the text input file. This model can be used for call center automation, interactive responses from IoT devices, or transforming text to be consumed as audio, for example while driving or for the visually impaired.
4.04 Average RTF with Batch Size 1 and Precision FP16
1.65 Average RTF with Batch Size 4 and Precision FP16
This model was trained on the LJSpeech-1.1 dataset. The results were obtained by running the training script in the PyTorch-19.06-py3 NGC container on NVIDIA DGX-1 with 8x V100 16G GPUs. Performance numbers (in output mel-spectrograms per second for Tacotron 2 and output samples per second for WaveGlow) were averaged over an entire training epoch. More detailed performance information can be found here. This model achieves average real-time-factor (RTF) with Batch Size 1 and Precision FP16 of 4.04, average RTF with Batch Size 4 and Precision FP16 of 1.65, average RTF with Batch Size 1 and Precision FP32 of 3.71, and average RTF with Batch Size 4 and Precision FP32 of 1.43.
Average RTF with Batch Size 1 and Precision FP16 measures ratio of speech recognition response time to the utterance duration when trained with half-precision floating point format.
Further information here.
Average RTF with Batch Size 4 and Precision FP16 measures ratio of speech recognition response time to the utterance duration when trained with half-precision floating point format.
This text-to-speech (TTS) system is a combination of two neural network models:
a modified Tacotron 2 model
WaveGlow, a flow-based neural network model
The Tacotron 2 and WaveGlow models form a text-to-speech system that enables users to synthesize natural sounding speech from raw transcripts without any additional information such as patterns and/or rhythms of speech.
The Tacotron 2 model is a recurrent sequence-to-sequence model with attention that predicts mel-spectrograms from text. The encoder transforms the whole text into a fixed-size hidden feature representation. This feature representation is then consumed by the autoregressive decoder that produces one spectrogram frame at a time. In our implementation, the autoregressive WaveNet is replaced by the flow-based generative WaveGlow.
The WaveGlow model is a flow-based generative model that generates audio samples from Gaussian distribution using mel-spectrogram conditioning. During training, the model learns to transform the dataset distribution into a spherical Gaussian distribution through a series of flows. One step of a flow consists of an invertible convolution, followed by a modified WaveNet architecture that serves as an affine coupling layer. During inference, the network is inverted, and audio samples are generated from the Gaussian distribution. Our implementation uses 512 residual channels in the coupling layer.
The Tacotron 2 and WaveGlow models were both trained separately and independently on the LJSpeech-1.1 dataset. Both models obtain mel-spectrograms from short time Fourier transform (STFT) during training. These mel-spectrograms are used for loss computation in the case of Tacotron 2 and as conditioning input to the network in case of WaveGlow.
The training loss is averaged over an entire training epoch, whereas the validation loss is averaged over the validation dataset. Performance is reported in total output mel-spectrograms per second for the Tacotron 2 model and in total output samples per second for the WaveGlow model. Both measures are recorded as train_iter_items/sec (after each iteration) and train_epoch_items/sec (averaged over epoch) in the output log file. The result is averaged over an entire training epoch and summed over all GPUs that were included in the training.
The performance of the model was tested on a validation dataset that is a subset of the dataset the models were trained on.
The input(s) to this model must adhere to the following specifications:
This model will output the following:
Necessary cookies are absolutely essential for the website to function properly. These cookies ensure basic functionalities and security features of the website, anonymously.
Functional cookies help to perform certain functionalities like sharing the content of the website on social media platforms, collect feedbacks, and other third-party features.
Performance cookies are used to understand and analyze the key performance indexes of the website which helps in delivering a better user experience and Modzy product offering.
Analytical cookies are used to understand how visitors interact with the website. These cookies help provide information on metrics the number of visitors, bounce rate, traffic source, etc.