This model takes unstructured text as input and returns the top ten topics that are found within the text using the Latent Dirichlet Allocation (LDA) algorithm. This model can be used in a variety of ways, such as reducing a large set of documents to a smaller subset containing topics of interest or using the returned topics as input features for document classification.
This is an unsupervised model and is used to draw inferences from datasets without labels, therefore no metrics can be recorded. The model was built using the entire English Wikipedia corpus which contains 1.9 billion words in more than 4.4 million articles.
This model uses the Latent Dirichlet Allocation (LDA) algorithm to reduce the dimensionality of a submitted text vector. This is an industry standard for topic modeling and it creates a distribution of words for a given number of topics. It then determines how similar the distribution of the given document is to the distribution of all the topics. It then returns the most similar topics.
LDA is an unsupervised algorithm. The entire English Wikipedia corpus was used, and LDA generated 1,200 topics in one-pass over the full dataset in batches of 158,000 documents. The LDA a-priori distribution parameters, η and α, were set to be symmetric with values equal to 0.00083. The text was lemmatized and stop-words were removed. Words that appeared in less than 20 documents or more than 10% of the corpus were discarded.
No validation was performed because this model is unsupervised.
The input(s) to this model must adhere to the following specifications:
The input file, “input.txt”, should contain utf-8 encoded English text.
This model will output the following:
The output file, “results.json”, will contain the list of topics returned by the model.
Necessary cookies are absolutely essential for the website to function properly. These cookies ensure basic functionalities and security features of the website, anonymously.
Functional cookies help to perform certain functionalities like sharing the content of the website on social media platforms, collect feedbacks, and other third-party features.
Performance cookies are used to understand and analyze the key performance indexes of the website which helps in delivering a better user experience and Modzy product offering.
Analytical cookies are used to understand how visitors interact with the website. These cookies help provide information on metrics the number of visitors, bounce rate, traffic source, etc.