Audio Fingerprinting

Model by Modzy

An acoustic fingerprint is a digital summary generated from an audio signal. Given an audio file, this model creates a representative digital signature by detecting unique sound patterns. The corresponding digital embedding can then be used in various ways: identifying a song name from a snippet of music, voice identification for access control, and detecting equipment running in the background of a video.

  • Description

    Product Description


    94.6% Top 1 Accuracy – The ratio of the number of correct predictions of the top 1 predicted class to the total number of input samples. Further information here.

    97.3% Top 5 Accuracy – The fraction of the top 5 predictions made by the classifier. Further information here.

    To evaluate the performance of this unsupervised model, the generated audio fingerprints were tested with a clustering algorithm for speaker classification. The model achieves a Top 1 accuracy of 0.9458 and a Top 5 accuracy of 0.9725 on the TIMIT corpus test set.


    This model uses the SincNet architecture as its base for creating digital signatures, i.e. audio fingerprints, of audio files. SincNet is a Convolutional Neural Network (CNN) that places an emphasis on learning specific filter elements in the first convolutional layer. Using parametrized sinc functions, this network learns an audio signal’s low and high cutoff frequencies only, which presents a more efficient learning method than standard CNNs. While SincNet was originally designed for returning speaker class probabilities, this model removes the SoftMax output layer yielding a condensed fingerprint of the audio file.


    This model was trained on the DARPA TIMIT dataset – a speech corpus designed to provide speech data for the development and evaluation of automatic speech recognition systems and other audio specific deep learning systems. TIMIT contains a total of 6,300 sentences, 10 sentences spoken by each of 630 speakers from 8 major dialect regions of the United States. This model was trained for 360 epochs on a subset of TIMIT containing 462 speakers.


    This model was validated using a clustering-based approach on a mixed dataset consisting of 10 speech samples produced by 15 unique speakers, as well as 40 sound samples for 12 unique environmental sounds.


    The input(s) to this model must adhere to the following specifications:

    Filename Maximum Size Accepted Format(s)
    input 1M .wav, .ogg, .flac



    This model will output the following:

    Filename Maximum Size Format
    results.json 1M .json

    The output JSON file will hold a 2,048-dimensional vector embedding for the given audio input.