Word tokenization splits text into individual words using delimiters such as spaces and punctuation marks. For instance, a sentence like “What is tokenization?” would be tokenized into individual words: “What”, “is”, “tokenization”.
While it is widely used, it faces challenges with unknown or Out Of Vocabulary (OOV) words, where the model might replace unknown words with a generic token, potentially leading to less accurate results.
Towards the end of the code in Exhibit 25.9, Section Data Cleaning Techniques you will see an example of tokenization. The text in the example is tokenized with the nltk’s word_tokenize
method:
tokens = nltk.word_tokenize(verbatim)
Use the Search Bar to find content on MarketingMind.
Contact | Privacy Statement | Disclaimer: Opinions and views expressed on www.ashokcharan.com are the author’s personal views, and do not represent the official views of the National University of Singapore (NUS) or the NUS Business School | © Copyright 2013-2025 www.ashokcharan.com. All Rights Reserved.