Multilingual Concept Extraction

Model by Blackbird.AI

Blackbird.AI’s Multilingual Concept Extraction model detects and classifies entities (persons, locations, organizations, places, temporal), context drivers (ideology, cause of death, criminal charges, religion), key phrases, social tags, and URLs from text in English, with beta support for Arabic and Spanish. The output is extracted text labeled by category. This model can be used for natural language understanding of digital narratives on social media, news media, or web data.

  • Description

    Product Description

    PERFORMANCE METRICS

    This model was trained using a combination of full and semi-supervision on a large (millions of documents) and diverse dataset of Twitter and reputable news content covering a wide variety of topics. This model obtains a macro precision score of 89.7%, a macro recall score of 89.6%, and a macro F1 score of 89.6%, indicating robust performance. The model’s strengths are that it works well for both short form (social) and long form (news article) text, and it is very low latency and suitable for processing streaming feeds. The model’s weakness is that while it is intrinsically multi-lingual, it will not do as well on non-English languages without additional training. However, it can be readily adapted and improved to any language on customer request.

    89.6% F1 Score – Is the harmonic mean of the precision and recall, with best value of 1. It measures the balance between the two metrics. Further information here.

    89.7% Precision – A higher precision score indicates that the majority of labels predicted by the model for different classes are accurate. Further information here.

    89.6% Recall – A higher recall score indicates that the model finds and predicts correct labels for the majority of the classes it is supposed to find. Further information here.

    OVERVIEW:

    Blackbird.AI’s Multilingual Concept Extraction model detects and classifies entities (persons, locations, organizations, places, temporal), context drivers (ideology, cause of death, criminal charges, religion), key phrases, social tags, and URLs from text in English, with beta support for Arabic and Spanish. The output is extracted text labeled by category. This model can be used for natural language understanding of digital narratives on social media, news media, or web data.

    Involved Parties:
    – PERSON e.g. Anthony Fauci, Donald Trump, Neera Tanden
    – ORGANIZATION e.g. FDA, New York Times, PETA
    – TITLE e.g. president, actor, farmer
    – USER MENTION e.g. @nytimes, @BorisJohnson

    Lead Category:
    – KEY PHRASE e.g. election results in multiple states, mail in ballot, bill gates vaccine, moderna covid vaccine research
    – HASHTAG e.g. #thehandmaidstale, #vaccines, #election
    – URL e.g. http://www.blackbird.ai

    Place:
    – LOCATION e.g. Wisconsin, China, Brazil, US
    – NATIONALITY e.g. American, Indian, Chinese

    Context Drivers:
    – IDEOLOGY e.g. capitalism, Democrats, Republicans, socialism, right wing
    – RELIGION e.g. Hindu, Muslim, independent
    – CAUSE OF DEATH e.g. war, disease, cancer
    – CRIMINAL CHARGE e.g. extortion, bribery, rape
    – MISC e.g. DNA, 11 million americans

    Temporal:
    – SET e.g. daily, annually
    – TIME e.g. night, overnight, morning
    – DATE e.g. 2019, last year, now
    – DURATION e.g. years, the last two years

    TRAINING:

    This model was trained using a combination of full and semi-supervision on a large (millions of documents) and diverse dataset of Twitter and reputable news content covering a wide variety of topics.

    VALIDATION:

    Model validation was performed on a test split of the training data containing 16K social media and news documents to quantify the model’s performance on identifying concepts.

    INPUT SPECIFICATION

    The input(s) to this model must adhere to the following specifications:

    Filename Maximum Size Accepted Format(s)
    input.txt 1M .txt

    OUTPUT DETAILS

    This model will output the following:

    Filename Maximum Size Format
    results.json 1M .json