This model takes unstructured text as input and returns the top ten topics that are found within the text using the Latent Dirichlet Allocation (LDA) algorithm. This model can be used in a variety of ways, such as reducing a large set of documents to a smaller subset containing topics of interest or using the returned topics as input features for document classification.
Many models are available for limited use in the free Modzy Basic account.
This is an unsupervised model and is used to draw inferences from datasets without labels, therefore no metrics can be recorded. The model was built using the entire English Wikipedia corpus which contains 1.9 billion words in more than 4.4 million articles.
This model uses the Latent Dirichlet Allocation (LDA) algorithm to reduce the dimensionality of a submitted text vector. This is an industry standard for topic modeling and it creates a distribution of words for a given number of topics. It then determines how similar the distribution of the given document is to the distribution of all the topics. It then returns the most similar topics.
LDA is an unsupervised algorithm. The entire English Wikipedia corpus was used, and LDA generated 1,200 topics in one-pass over the full dataset in batches of 158,000 documents. The LDA a-priori distribution parameters, η and α, were set to be symmetric with values equal to 0.00083. The text was lemmatized and stop-words were removed. Words that appeared in less than 20 documents or more than 10% of the corpus were discarded.
No validation was performed because this model is unsupervised.
The input(s) to this model must adhere to the following specifications:
The input file, “input.txt”, should contain utf-8 encoded English text.
This model will output the following:
The output file, “results.json”, will contain the list of topics returned by the model.
Get a video demo and join the community of developers and customers building the future of Artificial Intelligence.