Sub-Word Tokenization

"unhappiness"["un", "happ", "iness"]
"subword"["sub", "word"]
"impossible"["im", "poss", "ible"]
"revision"["re", "vision"]
"cats"["cat", "s"]
"running"["run", "ning"]
"understandable"["under", "stand", "able"]
"happily"["hap", "pily"]
"misunderstood"["mis", "under", "stood"]
"unbelievable"["un", "believ", "able"]
"machinating"["unknown", "ing"]

Exhibit 25.11 Sub-word tokenization examples using byte pair encoding.

As depicted through the examples in Exhibit 25.11, sub-word tokenization breaks words into smaller units based on linguistic rules, such as separating prefixes and suffixes. This approach is particularly useful for handling OOV words.

For instance, the word “unhappiness” can be tokenized as [“un”, “happ”, “iness”], allowing the model to make inferences about the word’s function in a sentence. Similarly the word “machinating“ can be broken down into “unknown” and “ing”, providing insights into the word’s function. Affixes like “ing” can indicate whether a word functions as a verb turned into a noun or as a present participle, narrowing down possible meanings.


Previous     Next

Use the Search Bar to find content on MarketingMind.