
Tokenization is one of the most fundamental concepts in Natural Language Processing, yet most people don't think about it until it starts causing problems. Token limits, API costs, and unexpected model behaviour all trace back to how text is tokenized before a model ever sees it.
This article explains what tokenization is, why it matters, and exactly how it works, with a practical example so you can see what's actually happening under the hood.
Tokenization is the process of splitting text into smaller units called tokens so machines can process language efficiently. These tokens may represent whole words, subwords, characters, or other linguistic units, depending on the tokenizer design.
The purpose is not just to split text. It is to convert human language into structured components that can be mapped into numerical representations for machine learning models. In modern NLP systems, tokenization directly influences how a model interprets, stores, and processes input.
Machine learning models do not understand raw text. They operate entirely on numbers. Tokenization serves as the bridge between natural language and numerical computation.
It is the first operational step in converting text into embeddings or model inputs, and it has downstream effects on nearly everything else. Tokenization directly impacts:
Without tokenization, text cannot be processed, embedded, or analyzed by any NLP system.
Let's walk through a practical example. Take this sentence:
"F22 Labs: A software studio based out of Chennai."
Tokenization typically follows two steps.
The first step is breaking the sentence into smaller units. Depending on the tokenizer, those units can be words, subwords, or individual characters.
Word-level tokenization breaks text into complete words and punctuation:
["F22", "Labs", ":", "A", "software", "studio", "based", "out", "of", "Chennai", "."]Subword tokenization splits words into smaller meaningful pieces. This is what most modern LLMs use:
["F22", "Lab", "s", ":", "A", "software", "studio", "based", "out", "of", "Chennai", "."]Character-level tokenization splits every single character including spaces:
["F", "2", "2", " ", "L", "a", "b", "s", ":", " ", "A", " ", "s", "o", "f", "t", "w", "a", "r", "e", ...]Most large language models use subword tokenization because it balances vocabulary efficiency with flexibility when handling rare or domain-specific terms.
Walk away with actionable insights on AI adoption.
Limited seats available!
After splitting, each token is mapped to a unique numerical ID using a predefined vocabulary. The vocabulary assigns a fixed number to every known token.
Example vocabulary:
{
"F22": 1501,
"Labs": 1022,
":": 3,
"A": 4,
"software": 2301,
"studio": 2302,
"based": 2303,
"Chennai": 2306,
".": 5
}Resulting token IDs:
[1501, 1022, 3, 4, 2301, 2302, 2303, 2306, 5]At this point, the model no longer sees text. It processes a sequence of numerical identifiers that represent the structured form of the original sentence. This numerical transformation is what allows neural networks to compute patterns, relationships, and meaning.
You can explore real tokenization using OpenAI's Tokenizer tool. Enter any sentence and the tool shows you exactly how GPT splits it into tokens and what ID each token receives.
This is useful for understanding why certain words split unexpectedly, how your prompt's token count affects cost, and how close you are to hitting a model's context limit.
Byte Pair Encoding (BPE) starts with individual characters and merges the most frequent pairs iteratively until it builds a vocabulary of common subwords. Used by GPT models.
SentencePiece tokenizes text without relying on whitespace, making it language-agnostic. Used by models like T5 and LLaMA.
WordPiece is similar to BPE but selects merges based on likelihood rather than frequency. Used by BERT.
Character-level tokenization splits everything into individual characters. Simple but produces very long sequences and struggles with meaning.
Word-level tokenization is the most intuitive but creates large vocabularies and fails on out-of-vocabulary words.
Word-level tokenization struggles with rare words. If a word isn't in the vocabulary, the model cannot process it. Character-level tokenization handles any input but produces sequences that are too long and too hard for models to learn patterns from.
Walk away with actionable insights on AI adoption.
Limited seats available!
Subword tokenization solves both problems. It keeps common words as single tokens and breaks rare words into recognizable pieces. The word "tokenization" might become ["token", "ization"], which the model can still interpret meaningfully even if it hasn't seen the full word before.
Tokenization is the foundational layer of every NLP pipeline. It converts raw language into structured tokens and numerical IDs that models can process. Whether you are building chat systems, working with embeddings, or optimizing prompts, understanding tokenization helps you control cost, manage context limits, and make better decisions about how you structure your inputs.
It looks like a preprocessing step, but it shapes nearly every downstream decision in modern AI systems.
Tokenization is the process of splitting text into smaller units called tokens so that models can process language numerically. These tokens can be words, subwords, or characters depending on the tokenizer.
Token count determines how much text fits within a model's context window and directly affects API cost. Most models charge per token, so more efficient tokenization means lower cost and better use of the context limit.
Most modern LLMs use subword tokenization methods such as Byte Pair Encoding (BPE) or SentencePiece. These approaches balance vocabulary size with the ability to handle rare and domain-specific words.
A token ID is the unique number assigned to a token within a model's vocabulary. Once text is tokenized, the model processes these numerical IDs rather than the original text.
Yes. Each model may use a different tokenizer, which can result in different token splits and counts for the same input. GPT models use BPE, BERT uses WordPiece, and LLaMA uses SentencePiece.
Walk away with actionable insights on AI adoption.
Limited seats available!