Topic Modelling Tweets with LDA in Python

To illustrate the application of Latent Dirichlet Allocation (LDA) in Python, consider a dataset of 3,500 tweets mentioning prominent political figures: Bernie Sanders, Kamala Harris, Joe Biden, and Elizabeth Warren. Collected in early November 2019, this dataset offers a valuable opportunity to analyze the underlying topics discussed in these tweets.

By applying LDA to this dataset, we can effectively identify and visualize the dominant topics present within the tweets. Tools like pyLDAvis can be employed to create interactive visualizations that aid in understanding the topic distribution and relationships.

Exhibit 25.40 provides the Python code for performing topic modelling on this political tweet dataset, demonstrating the steps involved in applying LDA and visualizing the results.

Load Tweets Data
# Load tweets data to dataframe df
import pandas as pd
df = pd.read_csv(r"data/tweet_data.csv", names= ["id", "text", "date", "name", "username", "followers", "loc"])
df 
Tweets dataset
Data Cleaning
  • Remove punctuation marks, special characters and url links and then apply lower() to each tweet.
  • Remove instances of “berniesanders”, “kamalaharris”, “joebiden”, and “ewarren” to prevent skewing term frequencies, as each document contains at least one of these terms.
import string
ppl = ["berniesanders", "kamalaharris", "joebiden", "ewarren"]

# data cleaning function
def clean(txt):
    
    '''
    The next line of code is used to clean and preprocess a string of text (txt) 
    by removing punctuation and converting all characters to lowercase. Here’s a 
    breakdown of how it works: 
    
    string.punctuation: This is a string constant from the string module 
    in Python that contains all the common punctuation characters 
    (e.g., !"#$%&'()*+,-./:;<=>?@[\]^_{|}~`). It is used as the target for removal 
    from the original text.
    
    str.maketrans("", "", string.punctuation): The str.maketrans() method 
    creates a translation table that can be used with the translate() method. The 
    first two arguments are empty strings (""), meaning no characters are being 
    replaced or mapped to other characters. The third argument is string.punctuation, 
    indicating that all punctuation characters should be removed.
    The resulting translation table will map all punctuation characters to None, 
    effectively removing them from the string.
    
    txt.translate(...): The translate() method is called on the string txt, 
    using the translation table created by str.maketrans(). It returns a new string 
    with all the punctuation characters removed.
    '''
    txt = str(txt.translate(str.maketrans("", "", string.punctuation))).lower() 
    txt = str(txt).split()  # Split into items (i.e., words)
    for word in txt:
        if "http" in word: # Remove http items 
            txt.remove(word)
    for hashtag in ppl:  # Remove “berniesanders”, “kamalaharris” ...
        if hashtag in txt:
            txt.remove(hashtag)
    txt = (" ".join(txt)) # Join back the words in txt into sentence 
    return txt
    
df.text = df.text.apply(clean)  # Clean the text field in df with def clean().
df.text
0                                                    text
1       greggonzalez68 we had a party and every one wa...
2       mcuban yes healthcare is a human right and eve...
3       mcuban there will also be private doctors who ...
4       mollycrabapple lsarsour sensanders ilhan aoc r...
                              ...                        
3496                                   yes 100 trump 2020
3497    levibullen addresses their “impulse” but i’m s...
3498    so do something quit playing the victim amongs...
3499    bsptx1 yeezyeezy234 ellyngail petercoffin read...
3500    seditio hyapatialee yup nra controls all calif...
Name: text, Length: 3501, dtype: object
Data Processing
  • Stemming and Lemmatization.
  • Remove stopwords: Gensim’s STOPWORDS is a list of terms considered irrelevant or likely to clutter our bag-of-words. In NLP, "stopwords" are terms we want to exclude from our model. We will use this list to filter out these unnecessary terms from the tweets.
import warnings 
warnings.simplefilter("ignore")
import gensim
from gensim.utils import simple_preprocess
'''
The simple_preprocess function from the Gensim library is a helpful tool for NLP 
tasks. It prepares text data by converting it into a list of lowercase words 
(tokens), removing special characters, and filtering out words that are too short 
or too long. This standardization makes it easier to work with text data in 
subsequent NLP processes.
'''
from gensim.parsing.preprocessing import STOPWORDS as stopwords
import nltk
nltk.download("wordnet")
from nltk.stem import WordNetLemmatizer as lemm, SnowballStemmer as stemm
from nltk.stem.porter import *
import numpy as np
np.random.seed(0)
stemmer = stemm(language="english")

#  function that lemmatizes and stems.
def lemm_stemm(txt):
    return stemmer.stem(lemm().lemmatize(txt, pos="v"))

# function that removes stopwords, lemmatizes and stems
def preprocess(txt):
    r = [lemm_stemm(token) for token in simple_preprocess(txt) if token not in stopwords and len(token) > 2]
    '''
    This line uses list comprehension to create a list r that stores processed tokens 
    (words) from the input text.
    
    It calls simple_preprocess(txt) to break down the text (txt) into individual words 
    (tokens). 
    
    Each token is passed through lemm_stemm(token), which appears performs lemmatization 
    and stemming.
    '''
    return r

# Assign cleaned and prepared documents to a new variable, proc_docs.
proc_docs = df.text.apply(preprocess)
proc_docs
0                                                  [text]
1       [greggonzalez, parti, sob, obama, win, tear, f...
2       [mcuban, yes, healthcar, human, right, famili,...
3       [mcuban, privat, doctor, work, outsid, charg, ...
4       [mollycrabappl, lsarsour, sensand, ilhan, aoc,...
                              ...                        
3496                                         [yes, trump]
3497     [levibullen, address, impuls, sure, studi, redu]
3498                   [quit, play, victim, real, victim]
3499    [bsptx, yeezyeezi, ellyngail, petercoffin, rea...
3500    [seditio, hyapatiale, yup, nra, control, calif...
Name: text, Length: 3501, dtype: object
Constructing Dictionary

Dictionary (in LDA) is a list of all unique terms that occur throughout our collection of documents.

# Using gensim’s corpora package to construct the dictionary.
dictionary = gensim.corpora.Dictionary(proc_docs)

dictionary.filter_extremes(no_below=5, no_above= .90)
'''
filter_extremes() is a method used to remove very rare words (those that appear in 
very few documents) and very common words (those that appear in too many documents). 
These extreme cases can often be uninformative or even detrimental in text analysis.

no_below=5: Removes words that appear in fewer than 5 documents.

no_above=0.90: Removes words that appear in more than 90% of the documents.
'''

print("Dictionary length:", len(dictionary))
items = list(dictionary.items())  # Convert dictionary to a list of key-value pairs
print(items[:100])  # Print the first 100 items 

Dictionary length: 972
[(0, 'fear'), (1, 'obama'), (2, 'parti'), (3, 'sad'), (4, 'tear'), (5, 'win'), (6, 'care'), (7, 'deserv'), (8, 'famili'), (9, 'healthcar'), (10, 'human'), (11, 'includ'), (12, 'mcuban'), (13, 'right'), (14, 'yes'), (15, 'charg'), (16, 'doctor'), (17, 'privat'), (18, 'want'), (19, 'work'), (20, 'aoc'), (21, 'gop'), (22, 'here'), (23, 'lsarsour'), (24, 'sensand'), (25, 'truth'), (26, 'amp'), (27, 'educ'), (28, 'elizabeth'), (29, 'law'), (30, 'tell'), (31, 'today'), (32, 'warren'), (33, 'world'), (34, 'ggreenwald'), (35, 'open'), (36, 'primari'), (37, 'state'), (38, 'tulsigabbard'), (39, 'immigr'), (40, 'love'), (41, 'one'), (42, 'real'), (43, 'teach'), (44, 'way'), (45, 'agre'), (46, 'fail'), (47, 'thank'), (48, 'ball'), (49, 'leav'), (50, 'like'), (51, 'bankrupt'), (52, 'fuck'), (53, 'go'), (54, 'secur'), (55, 'social'), (56, 'point'), (57, 'republican'), (58, 'sound'), (59, 'talk'), (60, 'biden'), (61, 'expect'), (62, 'person'), (63, 'presid'), (64, 'run'), (65, 'year'), (66, 'coup'), (67, 'feel'), (68, 'free'), (69, 'time'), (70, 'awjedward'), (71, 'dictat'), (72, 'make'), (73, 'sens'), (74, 'fool'), (75, 'gov'), (76, 'interest'), (77, 'unfortun'), (78, 'clear'), (79, 'funni'), (80, 'jeremycorbyn'), (81, 'statement'), (82, 'sorri'), (83, 'campaign'), (84, 'cut'), (85, 'middl'), (86, 'month'), (87, 'promis'), (88, 'tax'), (89, 'petebuttigieg'), (90, 'second'), (91, 'wait'), (92, 'bolivia'), (93, 'liber'), (94, 'militari'), (95, 'peac'), (96, 'play'), (97, 'role'), (98, 'dont'), (99, 'malagrav')]
Developing Model

Bag-of-words is a collection of all our documents broken down into matrices. Matrices consist of a term’s identifier and the number of times it occurs in the document.

'''
This code is preparing and training an LDA model to discover 5 topics from a 
preprocessed corpus of documents (proc_docs).

First, the documents are converted into a Bag-of-Words format (bow).

Then, the LdaMulticore model is initialized and trained on the BoW data, using 5 
topics, iterating over the data 2 times, and utilizing 2 workers for parallelization.

The resulting lda model can then be used to inspect the topics, assign topics to 
new documents, or infer word-topic distributions.
'''
n = 5 # Number of clusters we want to fit our data into.
bow = [dictionary.doc2bow(doc) for doc in proc_docs]
lda = gensim.models.LdaMulticore(bow, num_topics= n, id2word=dictionary, passes=2, workers=2)
'''
Parameters:
bow: The corpus represented in the Bag-of-Words format. This is the input 
data (i.e., the list of documents in BoW form).

num_topics=n: The number of topics (clusters) to find in the corpus. In 
this case, it's set to n = 5, meaning the model will attempt to group the documents 
into 5 distinct topics.

id2word=dictionary: The dictionary that maps word IDs to actual words, 
helping interpret the topics.

passes=2: The number of passes (iterations) over the entire corpus. More 
passes may improve the model but will take longer to train. Here, it will run 
through the data twice.

workers=2: This specifies the number of worker threads to use for parallel processing, speeding up the model training process.
'''

print(bow[:20])

# View clusters
'''
print_topics() is a method in Gensim's LDA model that displays the words and 
their importance (weights) for each topic.
The method's parameter specifies the number of topics to print: 
    If you provide a positive integer (e.g., 3), it will display that many topics.
    If you use -1, it tells the method to print all topics in the model.
'''
for id, topic in lda.print_topics(-1):
    print() 
    print(f"TOPIC: {id} \n WORDS: {topic}")
[[], [(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1)], [(6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1)], [(12, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1)], [(20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1)], [(26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1)], [(34, 1), (35, 1), (36, 1), (37, 1), (38, 1)], [(39, 1)], [(18, 1), (40, 1), (41, 1), (42, 1), (43, 1), (44, 1)], [(8, 1), (45, 1), (46, 1)], [(47, 1)], [(48, 1), (49, 1), (50, 1)], [(51, 1), (52, 1), (53, 2), (54, 1), (55, 1)], [(37, 1)], [(50, 1), (56, 1), (57, 1), (58, 1), (59, 1)], [(30, 1), (60, 1), (61, 1), (62, 1), (63, 1), (64, 1), (65, 1)], [(66, 1), (67, 1), (68, 1), (69, 1)], [(50, 1), (70, 1), (71, 1), (72, 1), (73, 2)], [(63, 1), (74, 1), (75, 1), (76, 1), (77, 1)], [(42, 1), (78, 1), (79, 1), (80, 1), (81, 1)]]

TOPIC: 0 
 WORDS: 0.021*"cenkuygur" + 0.020*"anakasparian" + 0.019*"realdonaldtrump" + 0.018*"krystalbal" + 0.016*"emmavigeland" + 0.015*"like" + 0.012*"johniadarola" + 0.010*"repadamschiff" + 0.009*"believ" + 0.007*"endors"

TOPIC: 1 
 WORDS: 0.022*"berni" + 0.017*"know" + 0.013*"yes" + 0.013*"biden" + 0.012*"joe" + 0.012*"pay" + 0.010*"democrat" + 0.009*"need" + 0.009*"right" + 0.009*"plan"

TOPIC: 2 
 WORDS: 0.018*"realdonaldtrump" + 0.017*"your" + 0.013*"aoc" + 0.012*"peopl" + 0.011*"joe" + 0.010*"like" + 0.010*"presid" + 0.010*"win" + 0.009*"run" + 0.009*"think"

TOPIC: 3 
 WORDS: 0.022*"vote" + 0.020*"trump" + 0.015*"like" + 0.012*"want" + 0.011*"way" + 0.011*"talk" + 0.011*"say" + 0.010*"year" + 0.009*"fan" + 0.008*"xjrh"

TOPIC: 4 
 WORDS: 0.019*"peopl" + 0.019*"ananavarro" + 0.012*"thank" + 0.011*"cenkuygur" + 0.011*"like" + 0.011*"go" + 0.010*"trump" + 0.010*"theyoungturk" + 0.009*"look" + 0.009*"support"
Evaluation via Coherence Scoring

Most good machine learning models and applications have a feedback loop. This is a way to evaluate the model’s performance, scalability, and overall quality. In the topic modeling space, we use coherence scores to determine how “coherent” our model is. Coherence is a float value between 0 and 1, the higher the value, the more robust the model.

# Eval via coherence scoring
from gensim import corpora, models
from gensim.models import CoherenceModel
from pprint import pprint

# Instance of CoherenceModel is created and assigned to the variable coh.
coh = CoherenceModel(model=lda, texts= proc_docs, dictionary = dictionary, coherence = "c_v")
'''
Parameters:
model=lda: This specifies the trained LDA model for which coherence is being 
calculated. lda is the model created above.

texts=proc_docs: This is the list of tokenized and preprocessed documents 
used in training the LDA model. It helps the coherence model understand the context 
of the topics.

dictionary=dictionary: The dictionary used to map word IDs to actual words. 
This is necessary for interpreting the topics and calculating coherence.

coherence="c_v": This specifies the coherence measure to use. The "c_v" 
coherence measure is based on a sliding window, which considers the top words in 
each topic and calculates coherence based on their co-occurrences in the documents.
'''

coh_lda = coh.get_coherence()
'''
The get_coherence() method is called on the coh object to calculate the 
coherence score of the LDA model. This score indicates how semantically meaningful 
the topics are. High scores (range 0 - 1) generally indicate better topic coherence.
'''

print("Coherence Score:", coh_lda) 
Coherence Score: 0.3434103753069543
Visualize with pyLDAvis
# !pip install pyLDAvis ... install pyLDAvis if not installed
import pyLDAvis.gensim as pyldavis
import pyLDAvis

# Create a visualization data structure for the LDA model.
lda_display = pyldavis.prepare(lda, bow, dictionary)

'''Render the interactive visualization in a Jupyter Notebook or other 
interactive environments.
'''
pyLDAvis.display(lda_display) 
'''
Topic size: The size of each circle represents the prevalence of that topic in the 
corpus. This is calculated based on the proportion of words in the corpus that are 
assigned to that topic.

Axes: represent the 2D projection of the topics in a high-dimensional space. 
Specifically, the axes are derived from Principal Component Analysis (PCA) or 
Multidimensional Scaling (MDS) to visualise the distance between topics.

Proximity of circles: Topics that are closer together are more similar in terms of 
word distribution.
'''
Topic Modelling - Visualization    

Exhibit 25.40 Topic modelling political tweets with LDA in Python. Download this notebook to follow along with the Python implementation on Jupyter.


Previous     Next

Use the Search Bar to find content on MarketingMind.