Speech Transcription

Model by Open Source

This model is an open source Speech-To-Text engine, using a model trained by machine learning techniques based on Baidu’s Deep Speech research paper. It takes a WAV file as input. The model outputs text of the transcribed speech.

This model can be used in to transcribe audio speech to analyze customer service phone calls, and to convert spoken messages into email or text messages.

  • Description

    Product Description


    8.2% Average Word Error Rate

    The model was tested on the on the Librispeech clean test corpus which was created from the Librivox project which has many open source audiobooks. This dataset has about 1,000 hours of speech sampled at 16 kHz along with it’s accompanying transcribed text. The model achieved a word error rate of 8.22% on the this dataset which is competitive with existing state-of-the-art speech recognition software. The word error rate is derived from the Levenshtein distance working at the word level instead of the phoneme level. The validation and testing audio of this corpus are about 5 hours long.Further information here.

    Measures the average word error rate over multiple texts.

    Further information here.


    This model was trained on the Common Voice dataset by Mozilla which contains transcribed text for many different languages. The model used a well-optimized RNN training system that uses multiple GPUs, as well as a set of novel data synthesis techniques that allowed the developers to efficiently obtain a large amount of varied data for training. More information can be found at the Deepspeech repository. The audio is first converted to a Spectogram and is then fed into a customized RNN for training. The RNN model is composed of 5 hidden layers. The first three layers are not recurrent and the last two are.

    For the first layer, the output depends on the particular spectrogram frame along with the context of a given number of frames on each side and the non-recurrent layers do not look at the context. The fourth layer is a bi-directional recurrent layer with two sets of hidden units: a set with forward recurrence and a set with backward recurrence. The fifth, non-recurrent, layer takes both both the forward and backward units as inputs and feeds them into a Soft Max layer. This yields predicted character probabilities for each time slice and character in the alphabet.

    The loss function used is Connectionist temporal classification (CTC loss) which is commonly used for training RNN’s. The predicted probabilities are compared with the actual values to compute the training loss.

    Additionally, dropout was used at a rate between 5%-10%, and an ensemble of several RNN’s was used with their outputs averaged for the final result.


    The Deepspeech model was trained on the Fisher, Librispeech and Switchboard corpora. The Fisher dataset contains about 2,000 hours of audio over 23,000 speakers. The Switchboard corpus contains about 300 hours of audio over 4,000 speakers. The Librispeech dataset contains about 250 hours of speech over 500 speakers.


    A subset of the Librispeech dataset was used for validation and testing. This contains about 5 hours of audio for validation and 5 hours for testing. It has a word error rate of 8.22% on the this dataset which is competitive with existing state of the art speech recognition software.


    The input(s) to this model must adhere to the following specifications:

    Filename Maximum Size Accepted Format(s)
    input.wav 100M .wav


    This model will output the following:

    Filename Maximum Size Format
    results.json 1M .json