Spoken Language Identification

Model by Open Source

This model can identify the language spoken in an audio file. Languages that it is able to accept are English, German, French, Spanish, Chinese and Russian. It accepts a WAV file as input. The model returns a JSON file containing the probabilities of each language. It is an implementation of the model described here. This model can be used to determine the language of customer or constituent service calls, or to detect the language of a spoken message so that it can be transcribed or translated correctly into text.

  • Description

    Product Description


    91% F1 Score – F1 Score is the harmonic mean of the precision and recall, with best value of 1. It measures the balance between the two metrics. Further information here.

    91% Precision – A higher precision score indicates that the majority of labels predicted by the model for different classes are accurate. Further information here.

    91% Recall – A higher recall score indicates that the model finds and predicts correct labels for the majority of the classes it is supposed to find. Further information here.

    This model was tested on 10% of the Youtube News Dataset. This model uses precision, recall and the F1 score as its metrics. These are commonly used metrics in classification. In the context of this model they denote the ability of the model not to mislabel results, not to miss data that should be labeled, and the harmonic mean of these values to give an overall sense of model success. This model achieved a precision, recall and F1 score of 91%. There was some variation in different languages with English being the worst at an 88% F1 score and Chinese scoring the best at a 96% F1 score.


    This model uses a combination Convolutional Neural Network and Recurrent Neural Network to take advantage of visual features as well as time-based features. The audio is first converted to spectrograms which is then fed into a convolutional feature extractor. This extractor convolves the input image in several steps and produces a feature map which is then sliced along the x-axis and each slice is used as a time step for the subsequent BLSTM network. The design of the convolutional feature extractor is based on the well-known VGG architecture. The network has five convolutional layers, with each layer followed by Relu activations, Batch Normalization and a max pooling layer. The BLSTM consists of two single LSTM’s which are concatenated and fed into a fully connected layer which serves as the classifier.


    The authors trained on a custom dataset called YouTube News Dataset. This dataset was gathered from YouTube channels such as the official BBC News5 YouTube channel. The obtained audio data has many desired properties. The quality of the audio recordings is very high, and hundreds of hours are available online. News programs often feature guests or remote correspondents resulting in a good mix of different speakers. Further, news programs feature noise one would expect from a real-world situation: music jingles, nonspeech audio from video clips and transitions between reports. All in all, they were able to gather 1,508 hours of audio data for this dataset. 70 percent of this dataset was used for training.


    This model was tested on 10% of the YouTube News Dataset and was validated on the remaining 10%.


    The input(s) to this model must adhere to the following specifications:

    Filename Maximum Size Accepted Format(s)
    input.wav 100M .wav


    This model will output the following:

    Filename Maximum Size Format
    results.json 1M .json