Audio Keyword Spotting

Model by Modzy

This model searches for a given set of keywords within speech in an audio file, and if they are found, returns the timestamps at which they occur. This capability can be especially useful for identifying sections and entities of interest within audio and video recordings.

  • Description

    Product Description


    7.5% Word Error Rate

    This model achieves a 7.5% word error rate on the LibriSpeech clean test corpus, which consists of around 1,000 hours of English speech derived from read audiobooks.

    Word Error Rate measures the performance of speech recognition or machine translation at the word level and is derived from the Levenshtein distance.

    Further information here.


    This model uses the open source Speech-to-Text engine named DeepSpeech, implemented by Mozilla, which is based on the Deep Speech algorithm. It uses a recurrent neural network architecture with 5 hidden layers, each containing 2,048 neurons.


    This model was trained on the combined Fisher, LibriSpeech, Switchboard, and Common Voice English datasets, in addition to approximately 1,700 hours of transcribed WAMU (NPR) radio shows. It was trained for 75 epochs using a learning rate of 0.0001 and a batch size of 128. After training, the weights with the best validation loss were selected. This model was trained using Quadro RTX 6000 GPUs.


    This model was validated on the LibriSpeech clean test corpus which consists of approximately 1,000 hours of 16kHz English speech derived from read audiobooks from the LibriVox project.


    The input(s) to this model must adhere to the following specifications:

    Filename Maximum Size Accepted Format(s)

    The “word.txt” file should contain words to be spotted in the audio (one word per line, case independent). The “input.wav” file contains the audio to be searched for occurrences of the words specified in “word.txt”.


    This model will output the following:

    Filename Maximum Size Format
    results.json 1M .json

    The “results.json” file will contain the detected word occurrences and corresponding timestamps in the following JSON format: [{"word": "keyword1", "start_time ": startTime, "duration": duration}, {"word": "keyword2", "start_time ": startTime, "duration": duration}]