Python Topic Modeling

Published in

Heartbeat

7 min readNov 22, 2022

A sort of statistical modeling called topic modeling is used to identify abstract “subjects” that exist in a group of texts. Topic modeling has grown in importance during the past few years using unsupervised machine learning, and subject modeling may be used to arrange text (or picture, DNA, etc.) data such that related text fragments can be recognized. Making one subject per document template and one topic per word template, are both modeled as Dirichlet distributions.

In this article, we will learn in-depth about topic modeling, the Latent Dirichlet Allocation (LDA), and finally, we will perform topic modeling on a dataset using the LDA approach.

Let’s ease into it by understanding what topic modeling means.

Topic Modeling

By grouping the documents into categories, topic modeling is an unsupervised method that aims to evaluate massive amounts of text data. For topic modeling, there are no labels associated with the text data. Instead, topic modeling tries to organize the text into clusters based on shared traits.

Topic modeling’s most well-known use is the clustering of several news stories that belong to the same category. In other words, the grouping of papers that deal with the same issue. It is important to emphasize that assessing the success of topic modeling is extremely difficult because there are no right solutions. Furthermore, the user must decide on a title or theme for each document in a cluster based on commonalities between them.

Advantages of topic modeling

You can organize, comprehend, and summarize several different topics at once by using topic modeling. You can make judgments based on data by swiftly and simply identifying latent thematic patterns that exist across the data. Some business use cases of topic modeling include:

Identifying customer support tickets automatically
Directing talks to the appropriate teams depending on the subject
Implementing consumer feedback widely
Producing material that is data-driven and problem-focused
Enhancing sales strategy

In the next section, we’ll go through a method for topic modeling that utilizes Latent Dirichlet Allocation and demonstrate how it may be used with Python.

The Latent Dirichlet Allocation (LDA)

In LDA, each topic is assumed to be a mixture of an underlying collection of words, and each document is assumed to be a mixture of a set of topic probabilities.

LDA is a matrix factorization technique that presumptively uses a variety of themes to build documents. Words are then generated from those themes depending on their probability distribution. When confronted with a collection of documents, LDA goes back and tries to determine what themes would have initially generated those papers. The LDA is built on two broad theories:

Documents with similar wording generally cover the same topic.
Documents with word groupings that appear often together typically cover the same subject.

These assumptions are reasonable given that publications with the same subject matter, such as those on sport, would contain phrases like “athletics,” “champion”, “competition”, “field,” “player,” etc. The second argument is that several documents may fall under the same category if certain terms are commonly found together in those texts.

Performing LDA in Python

In this section, we’ll break down every step required to conduct LDA using Python, providing you with the skills you need to finish this lesson on your own.

Most projects fail before they get to production. Check out our free ebook to learn how to implement an MLOps lifecycle to better monitor, train, and deploy your machine learning models to increase output and iteration.

- Prerequisites

For us to continue with this article, we need to have the following python libraries installed: pandas, numpy, and sci-kit learn. We can install these libraries using “pip”.

- Importing the important library.

import random
import pandas as pd
import numpy as npfrom sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

- Selecting the dataset for you

The dataset we will use for the topic modeling is a very popular dataset on Kaggle. It contains 568,454 food reviews that Amazon users left up until October 2012. We will divide the customer reviews into 5 groups using LDA.
Download the dataset here.

data = pd.read_csv('amazon-reviews.csv')
data = data.head(2500)
data.head()

The dataset is quite voluminous so we will be using just 2,500 rows.

data.shape
data.isnull().sum()
data.dropna()

Basically, we had to run a quick check on the dataset, check the number of null values and also remove them so that they won’t cause us trouble later on.

- Picking the column

Since the “Text” column contains the reviews, LDA will only be applied to that column; the other columns will be disregarded. Let's have a quick look at the “Text” column, row 4:

data['Text'][4]

text column

The terms in our data must first be compiled into a vocabulary before we can use LDA. We can do this with the aid of a count vectorizer.

vec = CountVectorizer(max_df=0.85, min_df=2, stop_words='english')
v_matrix = vec.fit_transform(data['Text'].values.astype('U'))v_matrix

For the purpose of constructing a document-term matrix, we utilize the CountVectorizer class from the sklearn.feature_extraction.text module. Only words that are used in under 85% of the document and in at least two other documents are included. We also get rid of all the stop words because they don’t really help with topic modeling.

With this, each of our 2,500 documents is represented as a 4,663-dimensional vector, which means that our vocabulary has 4,663 words.

- Creating topics

We will use LDA to generate topics and the probability distribution for each word in each topic’s vocabulary.

LDA = LatentDirichletAllocation(n_components=5, random_state=45)
LDA.fit(v_matrix)

To execute LDA on our document-term matrix, we utilized the LatentDirichletAllocation class from sklearn.decomposition package. The n_components option defines how many categories or subjects we wish to use to segment our content into. Let’s set the value of the random_state option to 45.

Let’s choose terms at random from our vocabulary. We are aware that every word in our vocabulary is contained in the count vectorizer. The ID of the word we wish to collect may be sent to the get_feature_names() function.
Let’s fetch 20 words from our vocabulary.

for i in range(20):
    random_id = random.randint(0,len(vec.get_feature_names()))
    print(vec.get_feature_names()[random_id])

Run a quick check to find words with the highest likelihood or probability for the first topic.

topic_1 = LDA.components_[0]

Utilizing the argsort() method, we can sort the indexes in accordance with the probabilities. After sorting, the 10 words with the greatest chances will then belong to the last 10 indexes of the array.

top_topics = topic_1.argsort()[-10:]
top_topics

The value of the words may then be obtained from the vec object using these indexes.

for i in top_topics:
    print(vec.get_feature_names()[i])

From the output, the words show that the first topic may be around food/pastries. Let’s print the words that have the highest probability for each of the five topics.

for a,topic in enumerate(LDA.components_):
    print(f'Top 10 words for topic #{a}:')
    print([vec.get_feature_names()[a] for a in topic.argsort()[-10:]])
    print('\n')

The results show that the second topic often includes reviews about dog food, etc. The third topic could be reviews of online delivery services. You can notice that all of the categories share a few terms in common.

As the last step, we will include a column in the initial data frame. We may achieve this by passing our document-term matrix to the LDA.transform() method. In this way, the likelihood of each subject will be allocated to each document.

new_topic_results = LDA.transform(v_matrix)
new_topic_results.shape

You should see (2500, 5), which indicates that there are 5 columns in each of the documents, each of which corresponds to a probability value for a different issue. Calling the argmax() function with the axis argument set to 1 will return the subject index with the highest value. Let’s add a new column called “Topic” to the data frame that gives each row in the column a topic value.

data['Topic'] = new_topic_results.argmax(axis=1)
data.head()

You can see the new column topic and the corresponding topic values.

Conclusion

Topic modeling is one of the most well-liked NLP subfields. It is used to arrange vast volumes of untagged text data. In this essay, we focused on a single approach to topic modeling using the LDA. In this article, we saw how Latent Dirichlet Allocation may be used for topic modeling of an Amazon reviews dataset. You can check other common approaches to topic modeling such as Non-Negative Matrix factorization, Latent Semantic Analysis (LSA/LSI), and Probabilistic Latent Semantic Analysis (pLSA).

Editor’s Note: Heartbeat is a contributor-driven online publication and community dedicated to providing premier educational resources for data science, machine learning, and deep learning practitioners. We’re committed to supporting and inspiring developers and engineers from all walks of life.

Editorially independent, Heartbeat is sponsored and published by Comet, an MLOps platform that enables data scientists & ML teams to track, compare, explain, & optimize their experiments. We pay our contributors, and we don’t sell ads.

If you’d like to contribute, head on over to our call for contributors. You can also sign up to receive our weekly newsletter (Deep Learning Weekly), check out the Comet blog, join us on Slack, and follow Comet on Twitter and LinkedIn for resources, events, and much more that will help you build better ML models, faster.