Text Topic Modeling

Model by Modzy

This model takes unstructured text as input and returns the top ten topics that are found within the text using the Latent Dirichlet Allocation (LDA) algorithm. This model can be used in a variety of ways, such as reducing a large set of documents to a smaller subset containing topics of interest or using the returned topics as input features for document classification.

  • Description

    Product Description


    This is an unsupervised model and is used to draw inferences from datasets without labels, therefore no metrics can be recorded. The model was built using the entire English Wikipedia corpus which contains 1.9 billion words in more than 4.4 million articles.


    This model uses the Latent Dirichlet Allocation (LDA) algorithm to reduce the dimensionality of a submitted text vector. This is an industry standard for topic modeling and it creates a distribution of words for a given number of topics. It then determines how similar the distribution of the given document is to the distribution of all the topics. It then returns the most similar topics.


    LDA is an unsupervised algorithm. The entire English Wikipedia corpus was used, and LDA generated 1,200 topics in one-pass over the full dataset in batches of 158,000 documents. The LDA a-priori distribution parameters, η and α, were set to be symmetric with values equal to 0.00083. The text was lemmatized and stop-words were removed. Words that appeared in less than 20 documents or more than 10% of the corpus were discarded.


    No validation was performed because this model is unsupervised.


    The input(s) to this model must adhere to the following specifications:

    Filename Maximum Size Accepted Format(s)
    input.txt 5M .txt

    The input file, “input.txt”, should contain utf-8 encoded English text.


    This model will output the following:

    Filename Maximum Size Format
    results.json 1M .json

    The output file, “results.json”, will contain the list of topics returned by the model.