To illustrate the application of Latent Dirichlet Allocation (LDA) in Python, consider a dataset of 3,500 tweets mentioning prominent political figures: Bernie Sanders, Kamala Harris, Joe Biden, and Elizabeth Warren. Collected in early November 2019, this dataset offers a valuable opportunity to analyze the underlying topics discussed in these tweets.
By applying LDA to this dataset, we can effectively identify and visualize the dominant topics present within the tweets. Tools like pyLDAvis
can be employed to create interactive visualizations that aid in understanding the topic distribution and relationships.
Exhibit 25.40 provides the Python code for performing topic modelling on this political tweet dataset, demonstrating the steps involved in applying LDA and visualizing the results.
# Load tweets data to dataframe df
import pandas as pd
df = pd.read_csv(r"data/tweet_data.csv", names= ["id", "text", "date", "name", "username", "followers", "loc"])
df
lower()
to each tweet.
import string
ppl = ["berniesanders", "kamalaharris", "joebiden", "ewarren"]
# data cleaning function
def clean(txt):
'''
The next line of code is used to clean and preprocess a string of text (txt)
by removing punctuation and converting all characters to lowercase. Here’s a
breakdown of how it works:
string.punctuation: This is a string constant from the string module
in Python that contains all the common punctuation characters
(e.g., !"#$%&'()*+,-./:;<=>?@[\]^_{|}~`). It is used as the target for removal
from the original text.
str.maketrans("", "", string.punctuation): The str.maketrans() method
creates a translation table that can be used with the translate() method. The
first two arguments are empty strings (""), meaning no characters are being
replaced or mapped to other characters. The third argument is string.punctuation,
indicating that all punctuation characters should be removed.
The resulting translation table will map all punctuation characters to None,
effectively removing them from the string.
txt.translate(...): The translate() method is called on the string txt,
using the translation table created by str.maketrans(). It returns a new string
with all the punctuation characters removed.
'''
txt = str(txt.translate(str.maketrans("", "", string.punctuation))).lower()
txt = str(txt).split() # Split into items (i.e., words)
for word in txt:
if "http" in word: # Remove http items
txt.remove(word)
for hashtag in ppl: # Remove “berniesanders”, “kamalaharris” ...
if hashtag in txt:
txt.remove(hashtag)
txt = (" ".join(txt)) # Join back the words in txt into sentence
return txt
df.text = df.text.apply(clean) # Clean the text field in df with def clean().
df.text
0 text 1 greggonzalez68 we had a party and every one wa... 2 mcuban yes healthcare is a human right and eve... 3 mcuban there will also be private doctors who ... 4 mollycrabapple lsarsour sensanders ilhan aoc r... ... 3496 yes 100 trump 2020 3497 levibullen addresses their “impulse” but i’m s... 3498 so do something quit playing the victim amongs... 3499 bsptx1 yeezyeezy234 ellyngail petercoffin read... 3500 seditio hyapatialee yup nra controls all calif... Name: text, Length: 3501, dtype: object
import warnings
warnings.simplefilter("ignore")
import gensim
from gensim.utils import simple_preprocess
'''
The simple_preprocess function from the Gensim library is a helpful tool for NLP
tasks. It prepares text data by converting it into a list of lowercase words
(tokens), removing special characters, and filtering out words that are too short
or too long. This standardization makes it easier to work with text data in
subsequent NLP processes.
'''
from gensim.parsing.preprocessing import STOPWORDS as stopwords
import nltk
nltk.download("wordnet")
from nltk.stem import WordNetLemmatizer as lemm, SnowballStemmer as stemm
from nltk.stem.porter import *
import numpy as np
np.random.seed(0)
stemmer = stemm(language="english")
# function that lemmatizes and stems.
def lemm_stemm(txt):
return stemmer.stem(lemm().lemmatize(txt, pos="v"))
# function that removes stopwords, lemmatizes and stems
def preprocess(txt):
r = [lemm_stemm(token) for token in simple_preprocess(txt) if token not in stopwords and len(token) > 2]
'''
This line uses list comprehension to create a list r that stores processed tokens
(words) from the input text.
It calls simple_preprocess(txt) to break down the text (txt) into individual words
(tokens).
Each token is passed through lemm_stemm(token), which appears performs lemmatization
and stemming.
'''
return r
# Assign cleaned and prepared documents to a new variable, proc_docs.
proc_docs = df.text.apply(preprocess)
proc_docs
0 [text] 1 [greggonzalez, parti, sob, obama, win, tear, f... 2 [mcuban, yes, healthcar, human, right, famili,... 3 [mcuban, privat, doctor, work, outsid, charg, ... 4 [mollycrabappl, lsarsour, sensand, ilhan, aoc,... ... 3496 [yes, trump] 3497 [levibullen, address, impuls, sure, studi, redu] 3498 [quit, play, victim, real, victim] 3499 [bsptx, yeezyeezi, ellyngail, petercoffin, rea... 3500 [seditio, hyapatiale, yup, nra, control, calif... Name: text, Length: 3501, dtype: object
Dictionary (in LDA) is a list of all unique terms that occur throughout our collection of documents.
# Using gensim’s corpora package to construct the dictionary.
dictionary = gensim.corpora.Dictionary(proc_docs)
dictionary.filter_extremes(no_below=5, no_above= .90)
'''
filter_extremes() is a method used to remove very rare words (those that appear in
very few documents) and very common words (those that appear in too many documents).
These extreme cases can often be uninformative or even detrimental in text analysis.
no_below=5: Removes words that appear in fewer than 5 documents.
no_above=0.90: Removes words that appear in more than 90% of the documents.
'''
print("Dictionary length:", len(dictionary))
items = list(dictionary.items()) # Convert dictionary to a list of key-value pairs
print(items[:100]) # Print the first 100 items
Dictionary length: 972 [(0, 'fear'), (1, 'obama'), (2, 'parti'), (3, 'sad'), (4, 'tear'), (5, 'win'), (6, 'care'), (7, 'deserv'), (8, 'famili'), (9, 'healthcar'), (10, 'human'), (11, 'includ'), (12, 'mcuban'), (13, 'right'), (14, 'yes'), (15, 'charg'), (16, 'doctor'), (17, 'privat'), (18, 'want'), (19, 'work'), (20, 'aoc'), (21, 'gop'), (22, 'here'), (23, 'lsarsour'), (24, 'sensand'), (25, 'truth'), (26, 'amp'), (27, 'educ'), (28, 'elizabeth'), (29, 'law'), (30, 'tell'), (31, 'today'), (32, 'warren'), (33, 'world'), (34, 'ggreenwald'), (35, 'open'), (36, 'primari'), (37, 'state'), (38, 'tulsigabbard'), (39, 'immigr'), (40, 'love'), (41, 'one'), (42, 'real'), (43, 'teach'), (44, 'way'), (45, 'agre'), (46, 'fail'), (47, 'thank'), (48, 'ball'), (49, 'leav'), (50, 'like'), (51, 'bankrupt'), (52, 'fuck'), (53, 'go'), (54, 'secur'), (55, 'social'), (56, 'point'), (57, 'republican'), (58, 'sound'), (59, 'talk'), (60, 'biden'), (61, 'expect'), (62, 'person'), (63, 'presid'), (64, 'run'), (65, 'year'), (66, 'coup'), (67, 'feel'), (68, 'free'), (69, 'time'), (70, 'awjedward'), (71, 'dictat'), (72, 'make'), (73, 'sens'), (74, 'fool'), (75, 'gov'), (76, 'interest'), (77, 'unfortun'), (78, 'clear'), (79, 'funni'), (80, 'jeremycorbyn'), (81, 'statement'), (82, 'sorri'), (83, 'campaign'), (84, 'cut'), (85, 'middl'), (86, 'month'), (87, 'promis'), (88, 'tax'), (89, 'petebuttigieg'), (90, 'second'), (91, 'wait'), (92, 'bolivia'), (93, 'liber'), (94, 'militari'), (95, 'peac'), (96, 'play'), (97, 'role'), (98, 'dont'), (99, 'malagrav')]
Bag-of-words is a collection of all our documents broken down into matrices. Matrices consist of a term’s identifier and the number of times it occurs in the document.
'''
This code is preparing and training an LDA model to discover 5 topics from a
preprocessed corpus of documents (proc_docs).
First, the documents are converted into a Bag-of-Words format (bow).
Then, the LdaMulticore model is initialized and trained on the BoW data, using 5
topics, iterating over the data 2 times, and utilizing 2 workers for parallelization.
The resulting lda model can then be used to inspect the topics, assign topics to
new documents, or infer word-topic distributions.
'''
n = 5 # Number of clusters we want to fit our data into.
bow = [dictionary.doc2bow(doc) for doc in proc_docs]
lda = gensim.models.LdaMulticore(bow, num_topics= n, id2word=dictionary, passes=2, workers=2)
'''
Parameters:
bow: The corpus represented in the Bag-of-Words format. This is the input
data (i.e., the list of documents in BoW form).
num_topics=n: The number of topics (clusters) to find in the corpus. In
this case, it's set to n = 5, meaning the model will attempt to group the documents
into 5 distinct topics.
id2word=dictionary: The dictionary that maps word IDs to actual words,
helping interpret the topics.
passes=2: The number of passes (iterations) over the entire corpus. More
passes may improve the model but will take longer to train. Here, it will run
through the data twice.
workers=2: This specifies the number of worker threads to use for parallel processing, speeding up the model training process.
'''
print(bow[:20])
# View clusters
'''
print_topics() is a method in Gensim's LDA model that displays the words and
their importance (weights) for each topic.
The method's parameter specifies the number of topics to print:
If you provide a positive integer (e.g., 3), it will display that many topics.
If you use -1, it tells the method to print all topics in the model.
'''
for id, topic in lda.print_topics(-1):
print()
print(f"TOPIC: {id} \n WORDS: {topic}")
[[], [(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1)], [(6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1)], [(12, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1)], [(20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1)], [(26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1)], [(34, 1), (35, 1), (36, 1), (37, 1), (38, 1)], [(39, 1)], [(18, 1), (40, 1), (41, 1), (42, 1), (43, 1), (44, 1)], [(8, 1), (45, 1), (46, 1)], [(47, 1)], [(48, 1), (49, 1), (50, 1)], [(51, 1), (52, 1), (53, 2), (54, 1), (55, 1)], [(37, 1)], [(50, 1), (56, 1), (57, 1), (58, 1), (59, 1)], [(30, 1), (60, 1), (61, 1), (62, 1), (63, 1), (64, 1), (65, 1)], [(66, 1), (67, 1), (68, 1), (69, 1)], [(50, 1), (70, 1), (71, 1), (72, 1), (73, 2)], [(63, 1), (74, 1), (75, 1), (76, 1), (77, 1)], [(42, 1), (78, 1), (79, 1), (80, 1), (81, 1)]] TOPIC: 0 WORDS: 0.021*"cenkuygur" + 0.020*"anakasparian" + 0.019*"realdonaldtrump" + 0.018*"krystalbal" + 0.016*"emmavigeland" + 0.015*"like" + 0.012*"johniadarola" + 0.010*"repadamschiff" + 0.009*"believ" + 0.007*"endors" TOPIC: 1 WORDS: 0.022*"berni" + 0.017*"know" + 0.013*"yes" + 0.013*"biden" + 0.012*"joe" + 0.012*"pay" + 0.010*"democrat" + 0.009*"need" + 0.009*"right" + 0.009*"plan" TOPIC: 2 WORDS: 0.018*"realdonaldtrump" + 0.017*"your" + 0.013*"aoc" + 0.012*"peopl" + 0.011*"joe" + 0.010*"like" + 0.010*"presid" + 0.010*"win" + 0.009*"run" + 0.009*"think" TOPIC: 3 WORDS: 0.022*"vote" + 0.020*"trump" + 0.015*"like" + 0.012*"want" + 0.011*"way" + 0.011*"talk" + 0.011*"say" + 0.010*"year" + 0.009*"fan" + 0.008*"xjrh" TOPIC: 4 WORDS: 0.019*"peopl" + 0.019*"ananavarro" + 0.012*"thank" + 0.011*"cenkuygur" + 0.011*"like" + 0.011*"go" + 0.010*"trump" + 0.010*"theyoungturk" + 0.009*"look" + 0.009*"support"
Most good machine learning models and applications have a feedback loop. This is a way to evaluate the model’s performance, scalability, and overall quality. In the topic modeling space, we use coherence scores to determine how “coherent” our model is. Coherence is a float value between 0 and 1, the higher the value, the more robust the model.
# Eval via coherence scoring
from gensim import corpora, models
from gensim.models import CoherenceModel
from pprint import pprint
# Instance of CoherenceModel is created and assigned to the variable coh.
coh = CoherenceModel(model=lda, texts= proc_docs, dictionary = dictionary, coherence = "c_v")
'''
Parameters:
model=lda: This specifies the trained LDA model for which coherence is being
calculated. lda is the model created above.
texts=proc_docs: This is the list of tokenized and preprocessed documents
used in training the LDA model. It helps the coherence model understand the context
of the topics.
dictionary=dictionary: The dictionary used to map word IDs to actual words.
This is necessary for interpreting the topics and calculating coherence.
coherence="c_v": This specifies the coherence measure to use. The "c_v"
coherence measure is based on a sliding window, which considers the top words in
each topic and calculates coherence based on their co-occurrences in the documents.
'''
coh_lda = coh.get_coherence()
'''
The get_coherence() method is called on the coh object to calculate the
coherence score of the LDA model. This score indicates how semantically meaningful
the topics are. High scores (range 0 - 1) generally indicate better topic coherence.
'''
print("Coherence Score:", coh_lda)
Coherence Score: 0.3434103753069543
# !pip install pyLDAvis ... install pyLDAvis if not installed
import pyLDAvis.gensim as pyldavis
import pyLDAvis
# Create a visualization data structure for the LDA model.
lda_display = pyldavis.prepare(lda, bow, dictionary)
'''Render the interactive visualization in a Jupyter Notebook or other
interactive environments.
'''
pyLDAvis.display(lda_display)
'''
Topic size: The size of each circle represents the prevalence of that topic in the
corpus. This is calculated based on the proportion of words in the corpus that are
assigned to that topic.
Axes: represent the 2D projection of the topics in a high-dimensional space.
Specifically, the axes are derived from Principal Component Analysis (PCA) or
Multidimensional Scaling (MDS) to visualise the distance between topics.
Proximity of circles: Topics that are closer together are more similar in terms of
word distribution.
'''
Use the Search Bar to find content on MarketingMind.
Contact | Privacy Statement | Disclaimer: Opinions and views expressed on www.ashokcharan.com are the author’s personal views, and do not represent the official views of the National University of Singapore (NUS) or the NUS Business School | © Copyright 2013-2025 www.ashokcharan.com. All Rights Reserved.