← AI glossary

Definition

Tokenization

Tokenization splits text into pieces called tokens so language models can process, count and generate text.

Also known as: tokens, tokenizer

Short definition

Tokenization is the process of breaking text into smaller units called tokens. A token can be a word, part of a word, punctuation mark or symbol, depending on the tokenizer.

How it works

Language models do not read raw text exactly like humans. Text is converted into token IDs, processed by the model and then converted back into readable text. Token counts affect context windows, pricing, latency and output length.

Example

The phrase artificial intelligence may be split into two tokens or more depending on the model. A long report can consume thousands of tokens before the model writes a single answer.

Why it matters

Understanding tokens helps explain why models have context limits, why prompts cost money and why shorter instructions can improve speed. Tokenization also affects multilingual performance because languages split differently.

A token is not a word

A short common word may be one token, while an unusual name or code fragment may be split into several pieces. Spaces, punctuation and formatting also matter. There is therefore no universal conversion rate between visible words and tokens.

Practical implications

An application must budget for the system prompt, conversation history, retrieved documents and generated answer. Exceeding the context limit may cause earlier information to be removed. Use the tokenizer associated with the actual model and preserve headroom instead of estimating cost and limits from character count alone.