Stemming and Lemmatization in Python

The Porter Stemmer is a commonly used algorithm for stemming, which involves applying rules to reduce words to their root form. Exhibit 25.13 demonstrates how to implement stemming in Python using the Porter Stemmer from the Natural Language Toolkit (NLTK) library. This example tokenizes and stems a sequence of notable quotes.

Stemming
import nltk

# If you haven't already, download these packages:
'''
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
'''

# Get the set of English stopwords from NLTK
from nltk.corpus import stopwords

text = """
    May the Force be with you.
    There's no place like home.
    I'm the king of the world!
    Carpe diem. 
    Elementary, my dear Watson.
    It's alive! 
    My mama always said life was like a box of chocolates. 
    I'll be back.
    """

''' 
Tokenize the sentences:
sent_tokenize is a function from the Natural Language Toolkit (NLTK) library in Python, 
used to split a given text into a list of individual sentences. It is a sentence tokenizer. 
It identifies sentence boundaries in a text, even when the text contains complex structures 
like abbreviations, punctuation, or special characters.
'''
sentences = nltk.sent_tokenize(text)
print("Original text: \n", sentences)

# Initialize a Porter Stemmer
stemmer = nltk.PorterStemmer()

# Stemming
for i in range(len(sentences)):
    words = nltk.word_tokenize(sentences[i])  # tokenize the words in each sentence

    '''
    The 'for word in words' block of code can be replaced by this one line shortform:
        stemmed_words = [stemmer.stem(word) for word in words if word not in set(stopwords.words('english'))]
    '''
    
    stemmed_words = []
    for word in words:
        if word not in set(stopwords.words('english')): # exclude stopwords
            stemmed_words.append(stemmer.stem(word))
    sentences[i] = ' '.join(stemmed_words)

print("\n\nFiltered and Stemmed Words: \n", sentences)
Output:
Original text: 
['May the Force be with you.', "There's no place like home.", "I'm the king of the world!", 'Carpe diem.', 'Elementary, my dear Watson.', "It's alive!", 'My mama always said life was like a box of chocolates.', "I'll be back."]

Filtered and Stemmed Words: 
['may forc', 'theres place like home', 'im king world', 'carp diem', 'elementari dear watson', 'aliv', 'mama alway said life like box chocol', 'ill back']

Exhibit 25.13   Stemming using Python implementation: This example tokenizes and stems a sequence of notable quotes using the Porter Stemmer. Jupyter notebook.

Exhibit 25.14 demonstrates how to implement lemmatization in Python using the WordNet Lemmatizer from the NLTK library. This example tokenizes and lemmatizes a the same sequence of notable quotes used in our stemming example.

Lemmatization
import nltk

# Lemmatizer
from nltk.stem import WordNetLemmatizer

# Get the set of English stopwords from NLTK
from nltk.corpus import stopwords

text = """
    May the Force be with you.
    There's no place like home.
    I'm the king of the world!
    Carpe diem. 
    Elementary, my dear Watson.
    It's alive! 
    My mama always said life was like a box of chocolates. 
    I'll be back.
    """

# Tokenize the sentences
sentences = nltk.sent_tokenize(text)
print("Original text:\n", sentences)

# Initialize lemmatizer
lemmatizer = WordNetLemmatizer()

# Lemmatization
for i in range(len(sentences)):
    words = nltk.word_tokenize(sentences[i])  # tokenize the words in each sentence

    # Filter out stopwords and lemmatize each word
    lemma_words = [lemmatizer.lemmatize(word) for word in words if word not in set(stopwords.words('english'))]
    sentences[i] = ' '.join(lemma_words)

print("\n\n Filtered and Lemmatized Words: \n", sentences)
Output:
Original text:
 ['May the Force be with you.', "There's no place like home.", "I'm the king of the world!", 'Carpe diem.', 'Elementary, my dear Watson.', "It's alive!", 'My mama always said life was like a box of chocolates.', "I'll be back."]

 Filtered and Lemmatized Words: 
 ['May Force .', "There 's place like home .", "I 'm king world !", 'Carpe diem .', 'Elementary , dear Watson .', "It 's alive !", 'My mama always said life like box chocolate .', "I 'll back ."]

Exhibit 25.14   Lemmatization using Python implementation: This example tokenizes and lemmatizes a sequence of notable quotes using the WordNet Lemmatizer. Jupyter notebook.


Natural Language Processing, coupled with data cleaning and text processing techniques, is revolutionizing how we interact with and analyze vast amounts of unstructured text data. From improving customer service through chatbots to driving insights from social media conversations, NLP is at the forefront of many technological advancements. As these techniques continue to evolve, their applications will become even more integral to various fields, unlocking new possibilities in data-driven decision-making.


Previous     Next

Use the Search Bar to find content on MarketingMind.