Speech Transcription

Model by Open Source

This model can convert a piece of text into an audio file. It accepts a string of UTF-8 text as the input. The model returns a WAV audio file that is an audio version of the text input file. This model can be used for call center automation, interactive responses from IoT devices, or transforming text to be consumed as audio, for example while driving or for the visually impaired.

  • Description

    Product Description

    PERFORMANCE METRICS:

    4.04 Average RTF with Batch Size 1 and Precision FP16

    1.65 Average RTF with Batch Size 4 and Precision FP16

    This model was trained on the LJSpeech-1.1 dataset. The results were obtained by running the training script in the PyTorch-19.06-py3 NGC container on NVIDIA DGX-1 with 8x V100 16G GPUs. Performance numbers (in output mel-spectrograms per second for Tacotron 2 and output samples per second for WaveGlow) were averaged over an entire training epoch. More detailed performance information can be found here. This model achieves average real-time-factor (RTF) with Batch Size 1 and Precision FP16 of 4.04, average RTF with Batch Size 4 and Precision FP16 of 1.65, average RTF with Batch Size 1 and Precision FP32 of 3.71, and average RTF with Batch Size 4 and Precision FP32 of 1.43.

    Average RTF with Batch Size 1 and Precision FP16 measures ratio of speech recognition response time to the utterance duration when trained with half-precision floating point format.

    Further information here.

    Average RTF with Batch Size 4 and Precision FP16 measures ratio of speech recognition response time to the utterance duration when trained with half-precision floating point format.

    Further information here.

    OVERVIEW:

    This text-to-speech (TTS) system is a combination of two neural network models:

    The Tacotron 2 and WaveGlow models form a text-to-speech system that enables users to synthesize natural sounding speech from raw transcripts without any additional information such as patterns and/or rhythms of speech.

    The Tacotron 2 model is a recurrent sequence-to-sequence model with attention that predicts mel-spectrograms from text. The encoder transforms the whole text into a fixed-size hidden feature representation. This feature representation is then consumed by the autoregressive decoder that produces one spectrogram frame at a time. In our implementation, the autoregressive WaveNet is replaced by the flow-based generative WaveGlow.

    The WaveGlow model is a flow-based generative model that generates audio samples from Gaussian distribution using mel-spectrogram conditioning. During training, the model learns to transform the dataset distribution into a spherical Gaussian distribution through a series of flows. One step of a flow consists of an invertible convolution, followed by a modified WaveNet architecture that serves as an affine coupling layer. During inference, the network is inverted, and audio samples are generated from the Gaussian distribution. Our implementation uses 512 residual channels in the coupling layer.

    TRAINING:

    The Tacotron 2 and WaveGlow models were both trained separately and independently on the LJSpeech-1.1 dataset. Both models obtain mel-spectrograms from short time Fourier transform (STFT) during training. These mel-spectrograms are used for loss computation in the case of Tacotron 2 and as conditioning input to the network in case of WaveGlow.

    The training loss is averaged over an entire training epoch, whereas the validation loss is averaged over the validation dataset. Performance is reported in total output mel-spectrograms per second for the Tacotron 2 model and in total output samples per second for the WaveGlow model. Both measures are recorded as train_iter_items/sec (after each iteration) and train_epoch_items/sec (averaged over epoch) in the output log file. The result is averaged over an entire training epoch and summed over all GPUs that were included in the training.

    VALIDATION:

    The performance of the model was tested on a validation dataset that is a subset of the dataset the models were trained on.

    INPUT SPECIFICATION

    The input(s) to this model must adhere to the following specifications:

    Filename Maximum Size Accepted Format(s)
    input.txt 1M .txt

    OUTPUT DETAILS

    This model will output the following:

    Filename Maximum Size Format
    results.wav 100M .wav