An acoustic fingerprint is a digital summary generated from an audio signal. Given an audio file, this model creates a representative digital signature by detecting unique sound patterns. The corresponding digital embedding can then be used in various ways: identifying a song name from a snippet of music, voice identification for access control, and detecting equipment running in the background of a video.
See the model in action with a Modzy MLOps platform demo or start a trial
94.6% Top 1 Accuracy – The ratio of the number of correct predictions of the top 1 predicted class to the total number of input samples.
97.3% Top 5 Accuracy – The fraction of the top 5 predictions made by the classifier.
To evaluate the performance of this unsupervised model, the generated audio fingerprints were tested with a clustering algorithm for speaker classification. The model achieves a Top 1 accuracy of 0.9458 and a Top 5 accuracy of 0.9725 on the TIMIT corpus test set.
This model uses the SincNet architecture as its base for creating digital signatures, i.e. audio fingerprints, of audio files. SincNet is a Convolutional Neural Network (CNN) that places an emphasis on learning specific filter elements in the first convolutional layer. Using parametrized sinc functions, this network learns an audio signal’s low and high cutoff frequencies only, which presents a more efficient learning method than standard CNNs. While SincNet was originally designed for returning speaker class probabilities, this model removes the SoftMax output layer yielding a condensed fingerprint of the audio file.
This model was trained on the DARPA TIMIT dataset – a speech corpus designed to provide speech data for the development and evaluation of automatic speech recognition systems and other audio specific deep learning systems. TIMIT contains a total of 6,300 sentences, 10 sentences spoken by each of 630 speakers from 8 major dialect regions of the United States. This model was trained for 360 epochs on a subset of TIMIT containing 462 speakers.
This model was validated using a clustering-based approach on a mixed dataset consisting of 10 speech samples produced by 15 unique speakers, as well as 40 sound samples for 12 unique environmental sounds.
The input(s) to this model must adhere to the following specifications:
This model will output the following:
The output JSON file will hold a 2,048-dimensional vector embedding for the given audio input.
See how quickly you can deploy and run models, connect to pipelines, autoscale resources, and integrate into workflows with Modzy MLOps platform
d o n o t fill t h i s . f i e l d d o n o t fill t h i s . f i e l d