Definition

Tokenization

Tokenization splits text into pieces called tokens so language models can process, count and generate text.

Updated May 3, 2026Also known as: tokens, tokenizer

Short definition

Tokenization is the process of breaking text into smaller units called tokens. A token can be a word, part of a word, punctuation mark or symbol, depending on the tokenizer.

How it works

Language models do not read raw text exactly like humans. Text is converted into token IDs, processed by the model and then converted back into readable text. Token counts affect context windows, pricing, latency and output length.

Example

The phrase artificial intelligence may be split into two tokens or more depending on the model. A long report can consume thousands of tokens before the model writes a single answer.

Why it matters

Understanding tokens helps explain why models have context limits, why prompts cost money and why shorter instructions can improve speed. Tokenization also affects multilingual performance because languages split differently.