An acoustic fingerprint is a digital summary generated from an audio signal. Given an audio file, this model creates a representative digital signature by detecting unique sound patterns. The corresponding digital embedding can then be used in various ways: identifying a song name from a snippet of music, voice identification for access control, and detecting equipment running in the background of a video.
94.6% Top 1 Accuracy – The ratio of the number of correct predictions of the top 1 predicted class to the total number of input samples. Further information here.
97.3% Top 5 Accuracy – The fraction of the top 5 predictions made by the classifier. Further information here.
To evaluate the performance of this unsupervised model, the generated audio fingerprints were tested with a clustering algorithm for speaker classification. The model achieves a Top 1 accuracy of 0.9458 and a Top 5 accuracy of 0.9725 on the TIMIT corpus test set.
This model uses the SincNet architecture as its base for creating digital signatures, i.e. audio fingerprints, of audio files. SincNet is a Convolutional Neural Network (CNN) that places an emphasis on learning specific filter elements in the first convolutional layer. Using parametrized sinc functions, this network learns an audio signal’s low and high cutoff frequencies only, which presents a more efficient learning method than standard CNNs. While SincNet was originally designed for returning speaker class probabilities, this model removes the SoftMax output layer yielding a condensed fingerprint of the audio file.
This model was trained on the DARPA TIMIT dataset – a speech corpus designed to provide speech data for the development and evaluation of automatic speech recognition systems and other audio specific deep learning systems. TIMIT contains a total of 6,300 sentences, 10 sentences spoken by each of 630 speakers from 8 major dialect regions of the United States. This model was trained for 360 epochs on a subset of TIMIT containing 462 speakers.
This model was validated using a clustering-based approach on a mixed dataset consisting of 10 speech samples produced by 15 unique speakers, as well as 40 sound samples for 12 unique environmental sounds.
The input(s) to this model must adhere to the following specifications:
This model will output the following:
The output JSON file will hold a 2,048-dimensional vector embedding for the given audio input.
Necessary cookies are absolutely essential for the website to function properly. These cookies ensure basic functionalities and security features of the website, anonymously.
Functional cookies help to perform certain functionalities like sharing the content of the website on social media platforms, collect feedbacks, and other third-party features.
Performance cookies are used to understand and analyze the key performance indexes of the website which helps in delivering a better user experience and Modzy product offering.
Analytical cookies are used to understand how visitors interact with the website. These cookies help provide information on metrics the number of visitors, bounce rate, traffic source, etc.