Named Entity Recognition, Arabic

Model by Modzy

This model identifies the following entity types in Arabic text: Person, Organization, Location, and Miscellaneous (“Other”). It accepts UTF-8 encoded Arabic text with a maximum size of 200,000 characters. The model then returns a JSON file with a list of identified entities and the respective entity type label for each.

This model can be used to automatically categorize news articles, expedite the search engine process, accelerate content recommendations, or enhance the customer feedback loop, all in the Arabic language.

  • Description

    Product Description

    PERFORMANCE METRICS:

    75% F1 Score, 81% Precision, and 71% Recall

    This model was trained on the CoNLL-2003 training dataset. This model obtains a precision score of 98.15%, recall score of 90.61%, and F1 score of 89.72% on the CoNLL-2003 validation dataset. The dataset is a collection of news wire articles from the Reuters Corpus. Some of this model’s strengths include its high precision of close to 95.19% in detecting person names and its ability to understand the grammar and semantic relationships between words in a piece of text. This model has a lower accuracy in detecting miscellaneous entities in a text.

    F1 is the harmonic mean of the precision and recall, with best value of 1. It measures the balance between the two metrics. Further information here.

    A higher precision score indicates that the majority of labels predicted by the model for different classes are accurate. Further information here.

    A higher recall score indicates that the model finds and predicts correct labels for the majority of the classes it is supposed to find. Further information here.

    OVERVIEW:

    The model was built with a heterogenous framework that incorporates various techniques for Named Entity Recognition (NER):

    1. Representation learning and sequence labeling as the basic neural model.
    2. Ensemble learning to combine the output of different NER models.
    3. Dictionary-based string-matching model.

    The neural model assumes there are “n” different representation modules. The outputs from each of these modules are concatenated as the final token representation. Building upon this token representation of the input sequence, the model uses LSTM-CRF to extract entities, and a linear chain CRF is further leveraged to model the whole label sequence simultaneously.

    TRAINING:

    This model was trained on a dataset that consists of 28 hand-annotated Arabic Wikipedia articles, totaling 74,000 tokens. Each article has one line per token, and each line consists of the token (named entity) in UTF-8 encoding and the corresponding tag. The tags are in “BIO” format. For example, “B-PER” indicates that the token is the first token of a “Person” entity mention, “I-PER” indicates that the token is inside a multi-token “Person” entity mention but is not the first token of the entity mention, and “O” indicates that the token does not belong to an entity mention. The Miscellaneous annotations are coded “MIS”.

    VALIDATION:

    The performance of the model was tested on a validation subset of the training dataset that consists of seven articles.

    INPUT SPECIFICATION

    The input(s) to this model must adhere to the following specifications:

    Filename Maximum Size Accepted Format(s)
    input.txt
    1M .txt

    OUTPUT DETAILS

    This model will output the following:

    Filename Maximum Size Format
    results.json 1M .json