Blackbird.AI’s Multilingual Concept Extraction model detects and classifies entities (persons, locations, organizations, places, temporal), context drivers (ideology, cause of death, criminal charges, religion), key phrases, social tags, and URLs from text in English, with beta support for Arabic and Spanish. The output is extracted text labeled by category. This model can be used for natural language understanding of digital narratives on social media, news media, or web data.
This model was trained using a combination of full and semi-supervision on a large (millions of documents) and diverse dataset of Twitter and reputable news content covering a wide variety of topics. This model obtains a macro precision score of 89.7%, a macro recall score of 89.6%, and a macro F1 score of 89.6%, indicating robust performance. The model’s strengths are that it works well for both short form (social) and long form (news article) text, and it is very low latency and suitable for processing streaming feeds. The model’s weakness is that while it is intrinsically multi-lingual, it will not do as well on non-English languages without additional training. However, it can be readily adapted and improved to any language on customer request.
89.6% F1 Score – Is the harmonic mean of the precision and recall, with best value of 1. It measures the balance between the two metrics. Further information here.
89.7% Precision – A higher precision score indicates that the majority of labels predicted by the model for different classes are accurate. Further information here.
89.6% Recall – A higher recall score indicates that the model finds and predicts correct labels for the majority of the classes it is supposed to find. Further information here.
– PERSON e.g. Anthony Fauci, Donald Trump, Neera Tanden
– ORGANIZATION e.g. FDA, New York Times, PETA
– TITLE e.g. president, actor, farmer
– USER MENTION e.g. @nytimes, @BorisJohnson
– KEY PHRASE e.g. election results in multiple states, mail in ballot, bill gates vaccine, moderna covid vaccine research
– HASHTAG e.g. #thehandmaidstale, #vaccines, #election
– URL e.g. http://www.blackbird.ai
– LOCATION e.g. Wisconsin, China, Brazil, US
– NATIONALITY e.g. American, Indian, Chinese
– IDEOLOGY e.g. capitalism, Democrats, Republicans, socialism, right wing
– RELIGION e.g. Hindu, Muslim, independent
– CAUSE OF DEATH e.g. war, disease, cancer
– CRIMINAL CHARGE e.g. extortion, bribery, rape
– MISC e.g. DNA, 11 million americans
– SET e.g. daily, annually
– TIME e.g. night, overnight, morning
– DATE e.g. 2019, last year, now
– DURATION e.g. years, the last two years
This model was trained using a combination of full and semi-supervision on a large (millions of documents) and diverse dataset of Twitter and reputable news content covering a wide variety of topics.
Model validation was performed on a test split of the training data containing 16K social media and news documents to quantify the model’s performance on identifying concepts.
The input(s) to this model must adhere to the following specifications:
This model will output the following: