Text Data Processing in Python

As mentioned in the Section Text Data Analysis Process, text data processing involves cleaning, tokenizing and tagging to extract hashtags and keywords such as noun phrases.

Exhibit 25.20 provides the Python code for achieving these processes.

The code takes some Facebook data stored in a text file named fb_data.txt and extracts interesting bits like hashtags and keywords (noun phrases). Here is a breakdown of what it does:

  1. Opening the File and Cleaning Up Text: The code opens the fb_data.txt file and reads its contents into a variable called text. It then performs some basic cleaning:
    • Removes leading and trailing whitespaces using strip().
    • Removes punctuation using regular expressions (re.sub). This also removes hashtags at this stage, which we will handle later.
    • Converts all letters to lowercase using lower().
  2. Splitting Text into Words (Tokenization): The code uses the nltk library to split the cleaned text (text) into individual words or phrases called “tokens”. This is done with nltk.word_tokenize. It then prints the first 100 tokens and tells you there are a total of 21888 tokens.
  3. Extracting Hashtags: Now that punctuation is removed, we can use regular expressions again to find hashtags. The code uses re.findall to search for any text starting with “#” followed by letters (\w+) and stores them in a list called hashtags. It then prints the extracted hashtags, which in this case are ['Kindle', '2', 'Kindle', 'books', 'reading'].
  4. Tagging Words with Parts of Speech (POS): This part involves understanding the grammatical role of each word. The code downloads a tagging module called averaged_perceptron_tagger from nltk. It then uses nltk.pos_tag to tag each token with its part of speech (e.g., noun, verb, adjective). The output is a list of tuples where each tuple contains a word and its corresponding tag (token, tag).
    It prints the first 50 of these tuples, showing words like “drug” with tag “NN” (noun), “runners” with tag “NNS” (plural noun), etc.
  5. Extracting Keywords (Nouns, Verbs, and Adjectives): Here, we want to find words that describe things (nouns), actions (verbs), or qualities (adjectives). The code defines a list called lst_pos containing the tags for nouns (NN), adjectives (JJ), and verbs (VB). It then iterates through the list of token-tag tuples (pos) and checks if the tag starts with any of the tags in lst_pos. If it does, the actual word (token) is added to a new list called keywords. Finally, it prints the list of extracted keywords, which includes words like “drug”, “runners”, “senator”, “read”, “story”, etc.
  6. Extracting Noun Phrases: This part focuses on finding groups of words that act as nouns together, like “the murder” or “a new book”. The code defines a grammar pattern "NP: {?*}" using regular expressions. This pattern basically says a noun phrase (NP) can contain any number of words (?*). It then creates a parser object cp using nltk.RegexpParser with the defined grammar.
    The code uses cp.parse(pos) to create a parse tree based on the tokens and their tags. Imagine the tree structure showing how words are connected grammatically. It then iterates through subtrees (branches) of the parse tree and checks if the subtree’s label is “NP” (noun phrase).
    If it is a noun phrase with more than one word (to avoid single words), the code extracts the individual words from the subtree and joins them with spaces to form a string.
    Finally, it adds this extracted noun phrase to a list called result, and prints the first 100 noun phrases, which include “the murder”, “the state attorney”, “a 19th century story”, etc.
Import Libraries and Read Facebook Data
import re  # regular expressions - used for data cleaning 
import nltk  # natural language toolkit
import pandas as pd  # panel data analysis/python data analysis 
import numpy as np  # numeric python

# Open fb_data.txt file and assign to the variable f
with open('data/fb_data.txt') as f: 
    # read file f and assign the resulting string to variable text
    text = f.read()  
    print(text[:500] + "...")  # print the first 500 character in text    
Drug Runners and  a U.S. Senator have something to do with the Murder http://www.amazon.com/Circumstantial-Evidence-Getting-Florida-Bozarth-ebook/dp/B004FPZ452/ref=pd_rhf_p_t_1 The State Attorney Knows... NOW So Will You. GET Ypur Copy TODAY
Heres a single, to add, to Kindle. Just read this 19th century story: "The Ghost of Round Island". Its about a man (French/American Indian) and his dog sled transporting a woman across the ice, from Mackinac Island to Cheboygan - and the ghost that...
If you...
Clean
# Clean (basic cleaning) 
text0 = text.strip()  # remove whitespaces. Save in text0 to retrieve hashtags.
text = re.sub(r'[^\w\s]','',text0)  # remove punctuations. Also removes hashtags.
text = text.lower()  # convert to lower case

# Tokenize
tokens = nltk.word_tokenize(text)
print(len(tokens), " tokens\n")
print("Tokens (100): \n", tokens[:100]) # print the first 100 tokens
21888  tokens

Tokens (100):
['drug', 'runners', 'and', 'a', 'us', 'senator', 'have', 'something', 'to', 'do', 'with', 'the', 'murder', 'httpwwwamazoncomcircumstantialevidencegettingfloridabozarthebookdpb004fpz452refpd_rhf_p_t_1', 'the', 'state', 'attorney', 'knows', 'now', 'so', 'will', 'you', 'get', 'ypur', 'copy', 'today', 'heres', 'a', 'single', 'to', 'add', 'to', 'kindle', 'just', 'read', 'this', '19th', 'century', 'story', 'the', 'ghost', 'of', 'round', 'island', 'its', 'about', 'a', 'man', 'frenchamerican', 'indian', 'and', 'his', 'dog', 'sled', 'transporting', 'a', 'woman', 'across', 'the', 'ice', 'from', 'mackinac', 'island', 'to', 'cheboygan', 'and', 'the', 'ghost', 'that', 'if', 'you', 'tire', 'of', 'nonfiction', 'check', 'out', 'httpwwwamazoncomsrefnb_sb_nossurlsearchalias3dapsfieldkeywordsdanielleleezwisslerx0y0', 'ghost', 'of', 'round', 'island', 'is', 'supposedly', 'nonfiction', 'why', 'is', 'barnes', 'and', 'nobles', 'version', 'of', 'the', 'kindle', 'so', 'much', 'more', 'expensive', 'than', 'the', 'kindle']
Extract Hashtags
# Extract Hashtags
hashtags = re.findall(r"#(\w+)", text0)
print("Hashtags: \n", hashtags) # print hashtags
Hashtags:
['Kindle', '2', 'Kindle', 'books', 'reading']
Tag (POS Tagging)
# Tag (POS Tagging)
nltk.download('averaged_perceptron_tagger')  # download the module

# tag the tokens based on their syntactic categories and grammatical roles.
pos = nltk.pos_tag(tokens)  # list of tuples
print("POS tuples (50):\n", pos[:50]) # output first 50 POS tuples (token, tag) 
POS tuples (50):
[('drug', 'NN'), ('runners', 'NNS'), ('and', 'CC'), ('a', 'DT'), ('us', 'PRP'), ('senator', 'NN'), ('have', 'VBP'), ('something', 'NN'), ('to', 'TO'), ('do', 'VB'), ('with', 'IN'), ('the', 'DT'), ('murder', 'NN'), ('httpwwwamazoncomcircumstantialevidencegettingfloridabozarthebookdpb004fpz452refpd_rhf_p_t_1', 'VBD'), ('the', 'DT'), ('state', 'NN'), ('attorney', 'NN'), ('knows', 'NNS'), ('now', 'RB'), ('so', 'RB'), ('will', 'MD'), ('you', 'PRP'), ('get', 'VB'), ('ypur', 'JJ'), ('copy', 'NN'), ('today', 'NN'), ('heres', 'VBZ'), ('a', 'DT'), ('single', 'JJ'), ('to', 'TO'), ('add', 'VB'), ('to', 'TO'), ('kindle', 'VB'), ('just', 'RB'), ('read', 'VB'), ('this', 'DT'), ('19th', 'JJ'), ('century', 'NN'), ('story', 'NN'), ('the', 'DT'), ('ghost', 'NN'), ('of', 'IN'), ('round', 'NN'), ('island', 'NN'), ('its', 'PRP$'), ('about', 'IN'), ('a', 'DT'), ('man', 'NN'), ('frenchamerican', 'JJ'), ('indian', 'JJ')]
Extract Keywords - Nouns, Verbs and Adjectives
# Extract Keywords - Nouns, Verbs and Adjectives
# Filter tokens with tags of noun, verb, adjective 
lst_pos = ('NN','JJ','VB')

keywords = [] # intialize keywords list
for tup in pos: # for each (token, tag) tuple
    # if the tag, i.e. tup[1], starts with either 'NN','JJ' or 'VB'
    if tup[1].startswith(lst_pos): 
        # then add the token to the list of keywords
        keywords.append(tup[0])  

''' 
the above can be reduced to a single line in shortform
keywords = [tup[0] for tup in pos if tup[1].startswith(lst_pos)]
'''

print("Keywords - Nouns, Verbs and Adjectives: \n", keywords)  # print the keywords
Keywords - Nouns, Verbs and Adjectives:
['drug', 'runners', 'senator', 'have', 'something', 'do', 'murder', 'httpwwwamazoncomcircumstantialevidencegettingfloridabozarthebookdpb004fpz452refpd_rhf_p_t_1', 'state', 'attorney', 'knows', 'get', 'ypur', 'copy', 'today', 'heres', 'single', 'add', 'kindle', 'read', '19th', 'century', 'story', 'ghost', 'round', 'island', 'man', 'frenchamerican', 'indian', 'dog', 'sled', 'transporting', 'woman', 'ice', 'mackinac', 'island', 'cheboygan', 'ghost', 'tire', 'nonfiction', 'check', 'httpwwwamazoncomsrefnb_sb_nossurlsearchalias3dapsfieldkeywordsdanielleleezwisslerx0y0', 'ghost', 'round', 'island', 'is', 'supposedly', 'nonfiction', 'is', 'barnes', 'nobles', 'version', 'kindle', 'expensive', 'kindle', 'maria', 'do', 'mean', 'nook', 'be', 'careful', 'books', 'buy', 'kindle', 'are', 'piece', 'electronics', 'vice', 'versa', 'i', 'love', 'kindle', 'are', 'people', 'swear', 'nook', 'color', 'screenme', 'i', 'want', 'ereader', 'is', 'reader', 'i', 'dont', 'need', 'color', 'kindle', 'battery', 'lasts', 'longer', 'unit', 'isnt', 'heavy', 'make', 'difference', 'reading', 'few', 'hours', 'kindle']
Extract Noun Phrases
# Extract Noun Phrases
grammar = "NP: {<DT>?<JJ>*<NN>}"  # regular expression pattern
'''
The regular expression pattern "NP: {<DT>?<JJ>*<NN>}" is a syntactic 
pattern defined in the context of chunking or parsing text using POS tags. This specific 
pattern is often used in NLP tasks, particularly in syntactic parsing or chunking, to 
identify noun phrases (NP).

Here's a breakdown of what each component in the pattern represents:
"NP:" is a label for the pattern. It indicates that the following pattern is intended to 
    match noun phrases.
"{...}" denotes the beginning and end of the chunk pattern definition.

Within the curly braces, the pattern components are defined as follows:
    "<DT>": Matches a determiner. In English grammar, determiners include words like 
        "a", "an", "the", "this", "that", etc.
    "?": Indicates that the preceding element (in this case, "<DT>") is optional. This 
        means that a noun phrase may or may not contain a determiner.
    "<JJ>*": Matches zero or more adjectives. "<JJ>" represents an adjective, and 
        the asterisk "*" indicates zero or more occurrences of adjectives. This allows for 
        noun phrases that may have multiple adjectives preceding the noun.
    "<NN>": Matches a singular noun. This part ensures that the noun phrase contains at 
        least one noun.

Therefore, the overall pattern "NP: {<DT>?<JJ>*<NN>}" describes a noun phrase that 
can optionally start with a determiner, followed by zero or more adjectives, and ending 
with a singular noun. This pattern can match noun phrases such as  "the big house", "a 
black cat", "this old book", etc.
'''
# Generate a parse tree from the POS tagged list the regular expression grammar. 
cp = nltk.RegexpParser(grammar)  # cp is a nltk.RegexpParser object created using grammar. 
tree = cp.parse(pos)  # parse() method generates a parse tree for pos
print("Tree (20): \n", tree[:20], "\n")

result = []
'''
Iterate over all subtrees of 'tree' that match the condition specified by the lambda 
function, which checks if the label of the subtree is 'NP' (noun phrase).
'''
for subtree in tree.subtrees(filter=lambda t: t.label() == 'NP'):
    # Extract noun phrases. (Exclude single words)
    '''
    Each subtree representing a noun phrase is processed. If the length of the leaves 
    (terminal nodes) of the subtree is greater than 1, it indicates that the subtree 
    contains more than one word, hence it's a valid noun phrase. The words of the noun 
    phrase are extracted and joined together to form a string, which is then appended 
    to the result list.
    '''
    if(len(subtree.leaves())>1):
        outputs = [tup[0] for tup in subtree.leaves()]  # Extract words into list 'outputs'
        outputs = " ".join(outputs) # string the words together
        result.append(outputs)

print("Noun Phrases (100): \n", result[:100])
Tree (20): 
[Tree('NP', [('drug', 'NN')]), ('runners', 'NNS'), ('and', 'CC'), ('a', 'DT'), ('us', 'PRP'), Tree('NP', [('senator', 'NN')]), ('have', 'VBP'), Tree('NP', [('something', 'NN')]), ('to', 'TO'), ('do', 'VB'), ('with', 'IN'), Tree('NP', [('the', 'DT'), ('murder', 'NN')]), ('httpwwwamazoncomcircumstantialevidencegettingfloridabozarthebookdpb004fpz452refpd_rhf_p_t_1', 'VBD'), Tree('NP', [('the', 'DT'), ('state', 'NN')]), Tree('NP', [('attorney', 'NN')]), ('knows', 'NNS'), ('now', 'RB'), ('so', 'RB'), ('will', 'MD'), ('you', 'PRP')]

Noun Phrases (100):
['the murder', 'the state', 'ypur copy', 'this 19th century', 'the ghost', 'a man', 'a woman', 'the ice', 'the ghost', 'httpwwwamazoncomsrefnb_sb_nossurlsearchalias3dapsfieldkeywordsdanielleleezwisslerx0y0 ghost', 'supposedly nonfiction', 'the nook', 'the kindle', 'that piece', 'the nook', 'the color', 'an ereader', 'a reader', 'the kindle', 'the unit', 'a difference', 'a bad idea', 'big name', 'the market', 'a huge factor', 'the money', 'the menu button', 'a kindle', 'a book', 'the patience', 'im gon', 'new book', 'a fan', 'the kindle book', 'tick toc', 'each chapter', 'the real dealmy', 'each page', 'real big textthank', 'the new york', 'a kindle', 'a single anyone', 'the time', 'the kindle', 'the update', 'any difference', 'a fan', 'oen way', 'the time', 'the original kindle', 'the dust', 'this kindle', 'this mystery', 'a thing', 'this game', 'this straightkindle', 'a gaming', 'the shark', 'the money', 'the kindle app', 'p yes', 'either text', 'simple time', 'ive heard', 'the one', 'favorite try', 'triple town', 'the beginning', 'the middle', 'the end', 'every word', 'any standalone', 'interactive fiction', 'wifi kindle', 'a book', 'the option', 'triple town', 'the fact', 'an airplane', 'a car', 'those fancy schmancy', 'the kindle', 'an option', 'the experience', 'any sense yes', 'the ipad', 'kindle love triple', 'the kindle', 'much fun', 'the original i', 'a ereader', 'ok i bit', 'wont turn', 'this thing', 'no one', 'kindle old school', 'beloved i', 'awesome i', 'favorite thing', 'kindle love']

Exhibit 25.20   This code demonstrates how to clean, tokenize, tag, and extract valuable information like hashtags and keywords (noun phrases) from a textual data file containing Facebook posts. Jupyter notebook.


Previous     Next

Use the Search Bar to find content on MarketingMind.