Language Identification

Model by Modzy

This model computes the probabilities that a string of UTF-8 text belongs to each of 235 languages, using learned relevant character sequence patterns for each language. It accepts a string of text as an input and outputs the prediction of the language used, along with the probability of the language being correct. This model can be used to determine the language of text from news, social media, and other sources, so that it can be sent to the appropriate audience or translated correctly.

  • Description

    Product Description

    PERFORMANCE METRICS:

    Explainable: This model has a built-in explainability feature. Click here to read more about model explainability.

    95% Average F1 Score

    95% Average Precision

    95% Average Recall

    This model was tested on a subset of the Wikipedia Language Identification Dataset and achieved an average recall of 0.95, an average precision of 0.95, and an average F1 score of 0.95. During evaluation, it was observed that the model’s performance improved as the length of the text it processes increased.

    Average F1 Score is the harmonic mean of the average precision and average recall, with best value of 1. It measures the balance between the two metrics. Further information here.

    A higher precision score indicates that on average the majority of labels predicted by the model for different classes are accurate. Further information here.

    A higher recall score indicates that on average the model finds and predicts correct labels for the majority of the classes it is supposed to find. Further information here.

    OVERVIEW:

    This model was trained on the Wikipedia Language Identification Database. It leverages language models in conjunction with machine learning algorithms to uncover the most discriminative morphological (word structure) features for probabilistic language classification. The language classifier is implemented in Scala using a modern Spark engine for high performance parallel computing on distributed clusters.

    The algorithm uses character N-gram features as input to a set of 235 L2-regularized logistic regression models which are combined in a one-versus-all manner to identify the most probable languages for a piece of text as the output.

    TRAINING:

    This model was trained on the Wikipedia Language Identification Database (WiLI), which consists of 235,000 paragraphs in 235 languages.

    VALIDATION:

    The performance of the model was tested on a validation subset of the Wikipedia Language Identification Database.

    INPUT SPECIFICATION

    The input(s) to this model must adhere to the following specifications:

    Filename Maximum Size Accepted Format(s)
    input.txt 1M .txt

    OUTPUT DETAILS

    This model will output the following:

    Filename Maximum Size Format
    results.json 1M .json