"unhappiness" | ["un", "happ", "iness"] |
---|---|
"subword" | ["sub", "word"] |
"impossible" | ["im", "poss", "ible"] |
"revision" | ["re", "vision"] |
"cats" | ["cat", "s"] |
"running" | ["run", "ning"] |
"understandable" | ["under", "stand", "able"] |
"happily" | ["hap", "pily"] |
"misunderstood" | ["mis", "under", "stood"] |
"unbelievable" | ["un", "believ", "able"] |
"machinating" | ["unknown", "ing"] |
As depicted through the examples in Exhibit 25.11, sub-word tokenization breaks words into smaller units based on linguistic rules, such as separating prefixes and suffixes. This approach is particularly useful for handling OOV words.
For instance, the word “unhappiness” can be tokenized as [“un”, “happ”, “iness”], allowing the model to make inferences about the word’s function in a sentence. Similarly the word “machinating“ can be broken down into “unknown” and “ing”, providing insights into the word’s function. Affixes like “ing” can indicate whether a word functions as a verb turned into a noun or as a present participle, narrowing down possible meanings.
Use the Search Bar to find content on MarketingMind.
Contact | Privacy Statement | Disclaimer: Opinions and views expressed on www.ashokcharan.com are the author’s personal views, and do not represent the official views of the National University of Singapore (NUS) or the NUS Business School | © Copyright 2013-2025 www.ashokcharan.com. All Rights Reserved.