As mentioned in the Section Text Data Analysis Process, text data processing involves cleaning, tokenizing and tagging to extract hashtags and keywords such as noun phrases.
Exhibit 25.20 provides the Python code for achieving these processes.
The code takes some Facebook data stored in a text file named fb_data.txt
and extracts interesting bits like hashtags and keywords (noun phrases). Here is a breakdown of what it does:
fb_data.txt
file and reads its contents into a variable called text
. It then performs some basic cleaning:
strip()
.
re.sub
). This also removes hashtags at this stage, which we will handle later.
lower()
.
nltk
library to split the cleaned text (text
) into individual words or phrases called “tokens”. This is done with nltk.word_tokenize
.
It then prints the first 100 tokens and tells you there are a total of 21888 tokens.
re.findall
to search for any text starting with “#” followed by letters (\w+
) and stores them in a list called hashtags
. It then prints the extracted hashtags, which in this case are ['Kindle', '2', 'Kindle', 'books', 'reading']
.
averaged_perceptron_tagger
from nltk
. It then uses nltk.pos_tag
to tag each token with its part of speech (e.g., noun, verb, adjective). The output is a list of tuples where each tuple contains a word and its corresponding tag (token, tag).
lst_pos
containing the tags for nouns (NN), adjectives (JJ), and verbs (VB). It then iterates through the list of token-tag tuples (pos
) and checks if the tag starts with any of the tags in lst_pos
. If it does, the actual word (token) is added to a new list called keywords. Finally, it prints the list of extracted keywords, which includes words like “drug”, “runners”, “senator”, “read”, “story”, etc.
"NP: {?*}"
using regular expressions. This pattern basically says a noun phrase (NP) can contain any number of words (?*
). It then creates a parser object cp
using nltk.RegexpParser
with the defined grammar.
cp.parse(pos)
to create a parse tree based on the tokens and their tags. Imagine the tree structure showing how words are connected grammatically. It then iterates through subtrees (branches) of the parse tree and checks if the subtree’s label is “NP” (noun phrase).
result
, and prints the first 100 noun phrases, which include “the murder”, “the state attorney”, “a 19th century story”, etc.
import re # regular expressions - used for data cleaning
import nltk # natural language toolkit
import pandas as pd # panel data analysis/python data analysis
import numpy as np # numeric python
# Open fb_data.txt file and assign to the variable f
with open('data/fb_data.txt') as f:
# read file f and assign the resulting string to variable text
text = f.read()
print(text[:500] + "...") # print the first 500 character in text
Drug Runners and a U.S. Senator have something to do with the Murder http://www.amazon.com/Circumstantial-Evidence-Getting-Florida-Bozarth-ebook/dp/B004FPZ452/ref=pd_rhf_p_t_1 The State Attorney Knows... NOW So Will You. GET Ypur Copy TODAY Heres a single, to add, to Kindle. Just read this 19th century story: "The Ghost of Round Island". Its about a man (French/American Indian) and his dog sled transporting a woman across the ice, from Mackinac Island to Cheboygan - and the ghost that... If you...
# Clean (basic cleaning)
text0 = text.strip() # remove whitespaces. Save in text0 to retrieve hashtags.
text = re.sub(r'[^\w\s]','',text0) # remove punctuations. Also removes hashtags.
text = text.lower() # convert to lower case
# Tokenize
tokens = nltk.word_tokenize(text)
print(len(tokens), " tokens\n")
print("Tokens (100): \n", tokens[:100]) # print the first 100 tokens
21888 tokens Tokens (100): ['drug', 'runners', 'and', 'a', 'us', 'senator', 'have', 'something', 'to', 'do', 'with', 'the', 'murder', 'httpwwwamazoncomcircumstantialevidencegettingfloridabozarthebookdpb004fpz452refpd_rhf_p_t_1', 'the', 'state', 'attorney', 'knows', 'now', 'so', 'will', 'you', 'get', 'ypur', 'copy', 'today', 'heres', 'a', 'single', 'to', 'add', 'to', 'kindle', 'just', 'read', 'this', '19th', 'century', 'story', 'the', 'ghost', 'of', 'round', 'island', 'its', 'about', 'a', 'man', 'frenchamerican', 'indian', 'and', 'his', 'dog', 'sled', 'transporting', 'a', 'woman', 'across', 'the', 'ice', 'from', 'mackinac', 'island', 'to', 'cheboygan', 'and', 'the', 'ghost', 'that', 'if', 'you', 'tire', 'of', 'nonfiction', 'check', 'out', 'httpwwwamazoncomsrefnb_sb_nossurlsearchalias3dapsfieldkeywordsdanielleleezwisslerx0y0', 'ghost', 'of', 'round', 'island', 'is', 'supposedly', 'nonfiction', 'why', 'is', 'barnes', 'and', 'nobles', 'version', 'of', 'the', 'kindle', 'so', 'much', 'more', 'expensive', 'than', 'the', 'kindle']
# Extract Hashtags
hashtags = re.findall(r"#(\w+)", text0)
print("Hashtags: \n", hashtags) # print hashtags
Hashtags: ['Kindle', '2', 'Kindle', 'books', 'reading']
# Tag (POS Tagging)
nltk.download('averaged_perceptron_tagger') # download the module
# tag the tokens based on their syntactic categories and grammatical roles.
pos = nltk.pos_tag(tokens) # list of tuples
print("POS tuples (50):\n", pos[:50]) # output first 50 POS tuples (token, tag)
POS tuples (50): [('drug', 'NN'), ('runners', 'NNS'), ('and', 'CC'), ('a', 'DT'), ('us', 'PRP'), ('senator', 'NN'), ('have', 'VBP'), ('something', 'NN'), ('to', 'TO'), ('do', 'VB'), ('with', 'IN'), ('the', 'DT'), ('murder', 'NN'), ('httpwwwamazoncomcircumstantialevidencegettingfloridabozarthebookdpb004fpz452refpd_rhf_p_t_1', 'VBD'), ('the', 'DT'), ('state', 'NN'), ('attorney', 'NN'), ('knows', 'NNS'), ('now', 'RB'), ('so', 'RB'), ('will', 'MD'), ('you', 'PRP'), ('get', 'VB'), ('ypur', 'JJ'), ('copy', 'NN'), ('today', 'NN'), ('heres', 'VBZ'), ('a', 'DT'), ('single', 'JJ'), ('to', 'TO'), ('add', 'VB'), ('to', 'TO'), ('kindle', 'VB'), ('just', 'RB'), ('read', 'VB'), ('this', 'DT'), ('19th', 'JJ'), ('century', 'NN'), ('story', 'NN'), ('the', 'DT'), ('ghost', 'NN'), ('of', 'IN'), ('round', 'NN'), ('island', 'NN'), ('its', 'PRP$'), ('about', 'IN'), ('a', 'DT'), ('man', 'NN'), ('frenchamerican', 'JJ'), ('indian', 'JJ')]
# Extract Keywords - Nouns, Verbs and Adjectives
# Filter tokens with tags of noun, verb, adjective
lst_pos = ('NN','JJ','VB')
keywords = [] # intialize keywords list
for tup in pos: # for each (token, tag) tuple
# if the tag, i.e. tup[1], starts with either 'NN','JJ' or 'VB'
if tup[1].startswith(lst_pos):
# then add the token to the list of keywords
keywords.append(tup[0])
'''
the above can be reduced to a single line in shortform
keywords = [tup[0] for tup in pos if tup[1].startswith(lst_pos)]
'''
print("Keywords - Nouns, Verbs and Adjectives: \n", keywords) # print the keywords
Keywords - Nouns, Verbs and Adjectives: ['drug', 'runners', 'senator', 'have', 'something', 'do', 'murder', 'httpwwwamazoncomcircumstantialevidencegettingfloridabozarthebookdpb004fpz452refpd_rhf_p_t_1', 'state', 'attorney', 'knows', 'get', 'ypur', 'copy', 'today', 'heres', 'single', 'add', 'kindle', 'read', '19th', 'century', 'story', 'ghost', 'round', 'island', 'man', 'frenchamerican', 'indian', 'dog', 'sled', 'transporting', 'woman', 'ice', 'mackinac', 'island', 'cheboygan', 'ghost', 'tire', 'nonfiction', 'check', 'httpwwwamazoncomsrefnb_sb_nossurlsearchalias3dapsfieldkeywordsdanielleleezwisslerx0y0', 'ghost', 'round', 'island', 'is', 'supposedly', 'nonfiction', 'is', 'barnes', 'nobles', 'version', 'kindle', 'expensive', 'kindle', 'maria', 'do', 'mean', 'nook', 'be', 'careful', 'books', 'buy', 'kindle', 'are', 'piece', 'electronics', 'vice', 'versa', 'i', 'love', 'kindle', 'are', 'people', 'swear', 'nook', 'color', 'screenme', 'i', 'want', 'ereader', 'is', 'reader', 'i', 'dont', 'need', 'color', 'kindle', 'battery', 'lasts', 'longer', 'unit', 'isnt', 'heavy', 'make', 'difference', 'reading', 'few', 'hours', 'kindle']
# Extract Noun Phrases
grammar = "NP: {<DT>?<JJ>*<NN>}" # regular expression pattern
'''
The regular expression pattern "NP: {<DT>?<JJ>*<NN>}" is a syntactic
pattern defined in the context of chunking or parsing text using POS tags. This specific
pattern is often used in NLP tasks, particularly in syntactic parsing or chunking, to
identify noun phrases (NP).
Here's a breakdown of what each component in the pattern represents:
"NP:" is a label for the pattern. It indicates that the following pattern is intended to
match noun phrases.
"{...}" denotes the beginning and end of the chunk pattern definition.
Within the curly braces, the pattern components are defined as follows:
"<DT>": Matches a determiner. In English grammar, determiners include words like
"a", "an", "the", "this", "that", etc.
"?": Indicates that the preceding element (in this case, "<DT>") is optional. This
means that a noun phrase may or may not contain a determiner.
"<JJ>*": Matches zero or more adjectives. "<JJ>" represents an adjective, and
the asterisk "*" indicates zero or more occurrences of adjectives. This allows for
noun phrases that may have multiple adjectives preceding the noun.
"<NN>": Matches a singular noun. This part ensures that the noun phrase contains at
least one noun.
Therefore, the overall pattern "NP: {<DT>?<JJ>*<NN>}" describes a noun phrase that
can optionally start with a determiner, followed by zero or more adjectives, and ending
with a singular noun. This pattern can match noun phrases such as "the big house", "a
black cat", "this old book", etc.
'''
# Generate a parse tree from the POS tagged list the regular expression grammar.
cp = nltk.RegexpParser(grammar) # cp is a nltk.RegexpParser object created using grammar.
tree = cp.parse(pos) # parse() method generates a parse tree for pos
print("Tree (20): \n", tree[:20], "\n")
result = []
'''
Iterate over all subtrees of 'tree' that match the condition specified by the lambda
function, which checks if the label of the subtree is 'NP' (noun phrase).
'''
for subtree in tree.subtrees(filter=lambda t: t.label() == 'NP'):
# Extract noun phrases. (Exclude single words)
'''
Each subtree representing a noun phrase is processed. If the length of the leaves
(terminal nodes) of the subtree is greater than 1, it indicates that the subtree
contains more than one word, hence it's a valid noun phrase. The words of the noun
phrase are extracted and joined together to form a string, which is then appended
to the result list.
'''
if(len(subtree.leaves())>1):
outputs = [tup[0] for tup in subtree.leaves()] # Extract words into list 'outputs'
outputs = " ".join(outputs) # string the words together
result.append(outputs)
print("Noun Phrases (100): \n", result[:100])
Tree (20): [Tree('NP', [('drug', 'NN')]), ('runners', 'NNS'), ('and', 'CC'), ('a', 'DT'), ('us', 'PRP'), Tree('NP', [('senator', 'NN')]), ('have', 'VBP'), Tree('NP', [('something', 'NN')]), ('to', 'TO'), ('do', 'VB'), ('with', 'IN'), Tree('NP', [('the', 'DT'), ('murder', 'NN')]), ('httpwwwamazoncomcircumstantialevidencegettingfloridabozarthebookdpb004fpz452refpd_rhf_p_t_1', 'VBD'), Tree('NP', [('the', 'DT'), ('state', 'NN')]), Tree('NP', [('attorney', 'NN')]), ('knows', 'NNS'), ('now', 'RB'), ('so', 'RB'), ('will', 'MD'), ('you', 'PRP')] Noun Phrases (100): ['the murder', 'the state', 'ypur copy', 'this 19th century', 'the ghost', 'a man', 'a woman', 'the ice', 'the ghost', 'httpwwwamazoncomsrefnb_sb_nossurlsearchalias3dapsfieldkeywordsdanielleleezwisslerx0y0 ghost', 'supposedly nonfiction', 'the nook', 'the kindle', 'that piece', 'the nook', 'the color', 'an ereader', 'a reader', 'the kindle', 'the unit', 'a difference', 'a bad idea', 'big name', 'the market', 'a huge factor', 'the money', 'the menu button', 'a kindle', 'a book', 'the patience', 'im gon', 'new book', 'a fan', 'the kindle book', 'tick toc', 'each chapter', 'the real dealmy', 'each page', 'real big textthank', 'the new york', 'a kindle', 'a single anyone', 'the time', 'the kindle', 'the update', 'any difference', 'a fan', 'oen way', 'the time', 'the original kindle', 'the dust', 'this kindle', 'this mystery', 'a thing', 'this game', 'this straightkindle', 'a gaming', 'the shark', 'the money', 'the kindle app', 'p yes', 'either text', 'simple time', 'ive heard', 'the one', 'favorite try', 'triple town', 'the beginning', 'the middle', 'the end', 'every word', 'any standalone', 'interactive fiction', 'wifi kindle', 'a book', 'the option', 'triple town', 'the fact', 'an airplane', 'a car', 'those fancy schmancy', 'the kindle', 'an option', 'the experience', 'any sense yes', 'the ipad', 'kindle love triple', 'the kindle', 'much fun', 'the original i', 'a ereader', 'ok i bit', 'wont turn', 'this thing', 'no one', 'kindle old school', 'beloved i', 'awesome i', 'favorite thing', 'kindle love']
Use the Search Bar to find content on MarketingMind.
Contact | Privacy Statement | Disclaimer: Opinions and views expressed on www.ashokcharan.com are the author’s personal views, and do not represent the official views of the National University of Singapore (NUS) or the NUS Business School | © Copyright 2013-2025 www.ashokcharan.com. All Rights Reserved.