Definition
Tokenization
Tokenization splits text into pieces called tokens so language models can process, count and generate text.
Short definition
Tokenization is the process of breaking text into smaller units called tokens. A token can be a word, part of a word, punctuation mark or symbol, depending on the tokenizer.
How it works
Language models do not read raw text exactly like humans. Text is converted into token IDs, processed by the model and then converted back into readable text. Token counts affect context windows, pricing, latency and output length.
Example
The phrase artificial intelligence may be split into two tokens or more depending on the model. A long report can consume thousands of tokens before the model writes a single answer.
Why it matters
Understanding tokens helps explain why models have context limits, why prompts cost money and why shorter instructions can improve speed. Tokenization also affects multilingual performance because languages split differently.