Definition

Inference

Inference is the phase where a trained AI model receives new input and produces an output, prediction or generated response.

Updated May 3, 2026Also known as: model serving, prediction time

Short definition

Inference is what happens when a trained model is used. The model receives input, processes it and returns an output such as a classification, prediction, generated text or image.

How it works

During inference, the model is not learning from scratch. It applies what it learned during training. For language models, inference often means generating one token at a time while considering the prompt and previous output.

Example

When you ask a chatbot to summarize a document, training is not happening in that moment. The model is performing inference: using its existing parameters plus your context to produce a response.

Why it matters

Inference determines user experience and operating cost. Latency, token usage, model size, hardware and caching all affect whether an AI feature feels fast and affordable enough for real use.