This model computes the probabilities that a string of UTF-8 text belongs to each of 235 languages, using learned relevant character sequence patterns for each language. It accepts a string of text as an input and outputs the prediction of the language used, along with the probability of the language being correct. This model can be used to determine the language of text from news, social media, and other sources, so that it can be sent to the appropriate audience or translated correctly.
See the model in action with a Modzy MLOps platform demo or start a trial
Explainable: This model has a built-in explainability feature. What is model explainability?.
95% Average F1 Score
95% Average Precision
95% Average Recall
This model was tested on a subset of the Wikipedia Language Identification Dataset and achieved an average recall of 0.95, an average precision of 0.95, and an average F1 score of 0.95. During evaluation, it was observed that the model’s performance improved as the length of the text it processes increased.
Average F1 Score is the harmonic mean of the average precision and average recall, with best value of 1. It measures the balance between the two metrics.
A higher precision score indicates that on average the majority of labels predicted by the model for different classes are accurate.
A higher recall score indicates that on average the model finds and predicts correct labels for the majority of the classes it is supposed to find.
This model was trained on the Wikipedia Language Identification Database. It leverages language models in conjunction with machine learning algorithms to uncover the most discriminative morphological (word structure) features for probabilistic language classification. The language classifier is implemented in Scala using a modern Spark engine for high performance parallel computing on distributed clusters.
The algorithm uses character N-gram features as input to a set of 235 L2-regularized logistic regression models which are combined in a one-versus-all manner to identify the most probable languages for a piece of text as the output.
This model was trained on the Wikipedia Language Identification Database (WiLI), which consists of 235,000 paragraphs in 235 languages.
The performance of the model was tested on a validation subset of the Wikipedia Language Identification Database.
The input(s) to this model must adhere to the following specifications:
This model will output the following:
See how quickly you can deploy and run models, connect to pipelines, autoscale resources, and integrate into workflows with Modzy MLOps platform
d o n o t fill t h i s . f i e l d d o n o t fill t h i s . f i e l d