Data Cleaning Techniques

Text cleaning is a fundamental step in preparing data for NLP tasks. Some basic techniques include:

  • Whitespace Removal: Eliminate unnecessary spaces in the text.
  • Punctuation Removal: Remove punctuation to avoid unwanted noise in the data.
  • HTML Tag Removal: Strip out any HTML elements that are irrelevant to the analysis.
  • URL Removal: Exclude links that do not contribute to the meaning of the text.
  • Word Standardization: Normalize repetitive or extended characters (e.g., "coooooool" to "cool").
  • Lowercasing: Convert all text to lowercase for uniformity.
  • Stop Word Removal: Remove common words that do not add significant meaning, such as "the," "and," or "is."

These techniques are illustrated through the example of data cleaning and tokenization using Python, in Exhibit 25.9.

Import Libraries
import re  # regular expressions - used for cleaning text data
import itertools  # iterator tools - functions that work on iterators
import nltk  # natural language toolkit
from nltk.corpus import stopwords

verbatim = "    Amaaaaaaazing!!! This product is lit 🔥🔥🔥 <br> I can't believe it! http://www.fantasticproduct.com <p>Check it out!</p> 😍💯 #musthave #bestever"
Remove whitespace
# Remove whitespace:
verbatim = verbatim.strip()
print(verbatim)
Amaaaaaaazing!!! This product is lit 🔥🔥🔥 <br> I can't believe it! https://www.fantasticproduct.com <p>Check it out!</p> 😍💯 #musthave #bestever
Remove html
# Remove html:
'''
The re module provides support for regular expressions, which are powerful tools for pattern matching and string manipulation. Regular expressions allow you to search for patterns within strings, perform substitutions, and more. If you have a different context or if "re lib" refers to something else, please provide more details for clarification.
'''            

verbatim = re.sub(r'<[^c]+?>', '', verbatim)
print(verbatim)
Amaaaaaaazing!!! This product is lit 🔥🔥🔥  I can't believe it! http://www.fantasticproduct.com  Check it out! 😍💯 #musthave #bestever
Remove URLs
# Remove urls:
verbatim = re.sub(r'https?:\/\/.*[\r\n]*', ' ', verbatim, flags=re.MULTILINE)
print(verbatim)
Amaaaaaaazing!!! This product is lit 🔥🔥🔥  I can't believe it!
Remove Punctuations
# Remove punctuations:
verbatim = re.sub(r'[^\w\s]', '', verbatim)     

'''
To understand the above statement, let's break it down:
  • re: This is the Python module for regular expressions, which provides support for working with regular expressions.
  • sub(): This is a function in the re module used for substitution. It searches for a pattern in a string and replaces it with a specified string.
  • r'[^\w\s]': This is a regular expression pattern with the following elements:<
    • r: This denotes a raw string literal in Python, which means that backslashes are treated as literal characters and not as escape characters.
    • [^\w\s]: This is a character class that matches any character that is not a word character (\w, which includes letters, digits, and underscores) or whitespace character (\s). The ^ inside the square brackets negates the character class, so [^\w\s] matches any character that is not a word character or whitespace character.
    • '': This is an empty string, which means that any characters matched by the regular expression pattern will be replaced with nothing (i.e., removed).
    • text: This is the string on which the substitution operation is performed.
Overall, the command re.sub(r'[^\w\s]', '', text) removes all non-word characters and non-whitespace characters from the text string, effectively cleaning the string by removing any punctuation or special characters. '''
print(verbatim)
Amaaaaaaazing This product is lit  I cant believe it
Standardise Words
# Standardise words:
verbatim = ''.join(''.join(s)[:2] for _, s in itertools.groupby(verbatim))
print(verbatim)
Amaazing This product is lit I cant believe it
Split Attached Words
# Split attached words:
verbatim = ' '.join(re.findall('[A-Z][^A-Z]*', verbatim))
print(verbatim)
Amaazing This product is lit I cant believe it
Lowercase
# Lowercase:
verbatim = verbatim.lower()
print(verbatim)
amaazing this product is lit i cant believe it
Remove Rtopwords
# Remove Stopwords:
verbatim = ' '.join([word for word in verbatim.split() if word not in (stopwords.words('english'))])
print(verbatim)
amaazing product lit cant believe
Tokenize
# Tokens:
tokens = nltk.word_tokenize(verbatim)
print(tokens)
['amaazing', 'product', 'lit', 'cant', 'believe']

Exhibit 25.9   Example of Data Cleaning and Tokenization using Python. Jupyter notebook.

In addition to data cleaning the example in Exhibit 25.9 also tokenizes the text using the nltk (natural language toolkit) library. This leads us to the next section, Natural Language Processing, where tokenization is covered in detail.


Previous     Next

Use the Search Bar to find content on MarketingMind.