Definition
Tokenization
Tokenization splits text into pieces called tokens so language models can process, count and generate text.
Short definition
Tokenization is the process of breaking text into smaller units called tokens. A token can be a word, part of a word, punctuation mark or symbol, depending on the tokenizer.
How it works
Language models do not read raw text exactly like humans. Text is converted into token IDs, processed by the model and then converted back into readable text. Token counts affect context windows, pricing, latency and output length.
Example
The phrase artificial intelligence may be split into two tokens or more depending on the model. A long report can consume thousands of tokens before the model writes a single answer.
Why it matters
Understanding tokens helps explain why models have context limits, why prompts cost money and why shorter instructions can improve speed. Tokenization also affects multilingual performance because languages split differently.
A token is not a word
A short common word may be one token, while an unusual name or code fragment may be split into several pieces. Spaces, punctuation and formatting also matter. There is therefore no universal conversion rate between visible words and tokens.
Practical implications
An application must budget for the system prompt, conversation history, retrieved documents and generated answer. Exceeding the context limit may cause earlier information to be removed. Use the tokenizer associated with the actual model and preserve headroom instead of estimating cost and limits from character count alone.