Voice Re-Identification

Model by Modzy

This model compares a given voice of interest, a query, to a set of known voices, a gallery, and returns the probability of a match. This capability is useful for a variety of fields including biometric authentication, forensics, and electronic security.

  • Description

    Product Description


    This model was validated on roughly 10,000 (about 5,000 similar and 5,000 dissimilar) speaker pairs from the LibriSpeech corpusdataset. With a similarity threshold of 0.5, the model achieves 0.951 accuracy, 0.963 precision, 0.938 recall, and an F1 Score of 0.950.


    This model was designed using a two-fold architecture: a Siamese Network – the branch component, and a comparison module – the head component. The branch acts as the feature extractor for both the query and gallery data inputs, and the head compares these two feature vectors by computing their similarity score. This model uses a modified version of the SincNet architecture as the branch.SincNet is based on parametrized sinc functions and learns band-pass filters for the audio inputs during training. The head model then calculates similarity metrics for the two feature vectors and processes them using a small neural network to produce a final similarity score.


    The two components of this model were trained separately. The feature extractor, or branch model, was trained on the DARPA TIMIT dataset, which is a corpus of read speech that has been designed to provide speech data for the acquisition of acoustic-phonetic knowledge and for the development and evaluation of automatic speech recognition systems. TIMIT contains a total of 6300 sentences, 10 sentences spoken by each of 630 speakers from 8 major dialect regions of the United States.

    This model was trained for 360 epochs on a subset of TIMIT containing 462 speakers. The head model was trained on a manually-created dataset comprised of approximately 17,000 similar and 17,000 dissimilar pairs taken from the LibriSpeech corpus.


    This model was validated on roughly 10,000 (about 5,000 similar and 5,000 dissimilar) speaker pairs from the LibriSpeech corpus dataset using a similarity threshold of 0.5.


    The input(s) to this model must adhere to the following specifications:

    Filename Maximum Size Accepted Format(s)
    data.zip 50M .zip

    The zip input file must contain both query and gallery inputs, where a query input is defined as the audio file of interest to be compared to a set of gallery audio files from an existing database. The model will only process audio samples within the zip file that contain “query” or “gallery” in their respective filepaths. For example, the filepaths “/data/query.wav”, “/query/audio1.wav”, “/data/test/gallery/audio.wav” would all be processed, but the filepath “/data/q/audio.wav” would not.


    This model will output the following:

    Filename Maximum Size Format
    results.json 10M .json

    The output JSON file contains a list of similarity scores between the query input(s) and gallery input(s).