Definition
Inference
Inference is the phase where a trained AI model receives new input and produces an output, prediction or generated response.
Short definition
Inference is what happens when a trained model is used. The model receives input, processes it and returns an output such as a classification, prediction, generated text or image.
How it works
During inference, the model is not learning from scratch. It applies what it learned during training. For language models, inference often means generating one token at a time while considering the prompt and previous output.
Example
When you ask a chatbot to summarize a document, training is not happening in that moment. The model is performing inference: using its existing parameters plus your context to produce a response.
Why it matters
Inference determines user experience and operating cost. Latency, token usage, model size, hardware and caching all affect whether an AI feature feels fast and affordable enough for real use.