This model computes the probabilities that a string of UTF-8 text belongs to each of 235 languages, using learned relevant character sequence patterns for each language. It accepts a string of text as an input and outputs the prediction of the language used, along with the probability of the language being correct. This model can be used to determine the language of text from news, social media, and other sources, so that it can be sent to the appropriate audience or translated correctly.
Explainable: This model has a built-in explainability feature. Click here to read more about model explainability.
95% Average F1 Score
95% Average Precision
95% Average Recall
This model was tested on a subset of the Wikipedia Language Identification Dataset and achieved an average recall of 0.95, an average precision of 0.95, and an average F1 score of 0.95. During evaluation, it was observed that the model’s performance improved as the length of the text it processes increased.
Average F1 Score is the harmonic mean of the average precision and average recall, with best value of 1. It measures the balance between the two metrics. Further information here.
A higher precision score indicates that on average the majority of labels predicted by the model for different classes are accurate. Further information here.
A higher recall score indicates that on average the model finds and predicts correct labels for the majority of the classes it is supposed to find. Further information here.
This model was trained on the Wikipedia Language Identification Database. It leverages language models in conjunction with machine learning algorithms to uncover the most discriminative morphological (word structure) features for probabilistic language classification. The language classifier is implemented in Scala using a modern Spark engine for high performance parallel computing on distributed clusters.
The algorithm uses character N-gram features as input to a set of 235 L2-regularized logistic regression models which are combined in a one-versus-all manner to identify the most probable languages for a piece of text as the output.
This model was trained on the Wikipedia Language Identification Database (WiLI), which consists of 235,000 paragraphs in 235 languages.
The performance of the model was tested on a validation subset of the Wikipedia Language Identification Database.
The input(s) to this model must adhere to the following specifications:
This model will output the following:
Necessary cookies are absolutely essential for the website to function properly. These cookies ensure basic functionalities and security features of the website, anonymously.
Functional cookies help to perform certain functionalities like sharing the content of the website on social media platforms, collect feedbacks, and other third-party features.
Performance cookies are used to understand and analyze the key performance indexes of the website which helps in delivering a better user experience and Modzy product offering.
Analytical cookies are used to understand how visitors interact with the website. These cookies help provide information on metrics the number of visitors, bounce rate, traffic source, etc.