Word Tokenization

Word tokenization splits text into individual words using delimiters such as spaces and punctuation marks. For instance, a sentence like “What is tokenization?” would be tokenized into individual words: “What”, “is”, “tokenization”.

While it is widely used, it faces challenges with unknown or Out Of Vocabulary (OOV) words, where the model might replace unknown words with a generic token, potentially leading to less accurate results.

Towards the end of the code in Exhibit 25.9, Section Data Cleaning Techniques you will see an example of tokenization. The text in the example is tokenized with the nltk’s word_tokenize method:
tokens = nltk.word_tokenize(verbatim)


Previous     Next

Use the Search Bar to find content on MarketingMind.