Text cleaning is a fundamental step in preparing data for NLP tasks. Some basic techniques include:
These techniques are illustrated through the example of data cleaning and tokenization using Python, in Exhibit 25.9.
import re # regular expressions - used for cleaning text data
import itertools # iterator tools - functions that work on iterators
import nltk # natural language toolkit
from nltk.corpus import stopwords
verbatim = " Amaaaaaaazing!!! This product is lit 🔥🔥🔥 <br> I can't believe it! http://www.fantasticproduct.com <p>Check it out!</p> 😍💯 #musthave #bestever"
# Remove whitespace:
verbatim = verbatim.strip()
print(verbatim)
Amaaaaaaazing!!! This product is lit 🔥🔥🔥 <br> I can't believe it! https://www.fantasticproduct.com <p>Check it out!</p> 😍💯 #musthave #bestever
# Remove html:
'''
The re module provides support for regular expressions, which are powerful tools for pattern matching and string manipulation. Regular expressions allow you to search for patterns within strings, perform substitutions, and more. If you have a different context or if "re lib" refers to something else, please provide more details for clarification.
'''
verbatim = re.sub(r'<[^c]+?>', '', verbatim)
print(verbatim)
Amaaaaaaazing!!! This product is lit 🔥🔥🔥 I can't believe it! http://www.fantasticproduct.com Check it out! 😍💯 #musthave #bestever
# Remove urls:
verbatim = re.sub(r'https?:\/\/.*[\r\n]*', ' ', verbatim, flags=re.MULTILINE)
print(verbatim)
Amaaaaaaazing!!! This product is lit 🔥🔥🔥 I can't believe it!
# Remove punctuations:
verbatim = re.sub(r'[^\w\s]', '', verbatim)
'''
To understand the above statement, let's break it down:
- re: This is the Python module for regular expressions, which provides support for working with regular expressions.
- sub(): This is a function in the re module used for substitution. It searches for a pattern in a string and replaces it with a specified string.
- r'[^\w\s]': This is a regular expression pattern with the following elements:<
- r: This denotes a raw string literal in Python, which means that backslashes are treated as literal characters and not as escape characters.
- [^\w\s]: This is a character class that matches any character that is not a word character (\w, which includes letters, digits, and underscores) or whitespace character (\s). The ^ inside the square brackets negates the character class, so [^\w\s] matches any character that is not a word character or whitespace character.
- '': This is an empty string, which means that any characters matched by the regular expression pattern will be replaced with nothing (i.e., removed).
- text: This is the string on which the substitution operation is performed.
Overall, the command re.sub(r'[^\w\s]', '', text) removes all non-word characters and non-whitespace characters from the text string, effectively cleaning the string by removing any punctuation or special characters.
'''
print(verbatim)
Amaaaaaaazing This product is lit I cant believe it
# Standardise words:
verbatim = ''.join(''.join(s)[:2] for _, s in itertools.groupby(verbatim))
print(verbatim)
Amaazing This product is lit I cant believe it
# Split attached words:
verbatim = ' '.join(re.findall('[A-Z][^A-Z]*', verbatim))
print(verbatim)
Amaazing This product is lit I cant believe it
# Lowercase:
verbatim = verbatim.lower()
print(verbatim)
amaazing this product is lit i cant believe it
# Remove Stopwords:
verbatim = ' '.join([word for word in verbatim.split() if word not in (stopwords.words('english'))])
print(verbatim)
amaazing product lit cant believe
# Tokens:
tokens = nltk.word_tokenize(verbatim)
print(tokens)
['amaazing', 'product', 'lit', 'cant', 'believe']
In addition to data cleaning the example in Exhibit 25.9 also tokenizes the text using the nltk (natural language toolkit) library. This leads us to the next section, Natural Language Processing, where tokenization is covered in detail.
Use the Search Bar to find content on MarketingMind.
Contact | Privacy Statement | Disclaimer: Opinions and views expressed on www.ashokcharan.com are the author’s personal views, and do not represent the official views of the National University of Singapore (NUS) or the NUS Business School | © Copyright 2013-2025 www.ashokcharan.com. All Rights Reserved.