Definition
Inference
Inference is the phase where a trained AI model receives new input and produces an output, prediction or generated response.
Short definition
Inference is what happens when a trained model is used. The model receives input, processes it and returns an output such as a classification, prediction, generated text or image.
How it works
During inference, the model is not learning from scratch. It applies what it learned during training. For language models, inference often means generating one token at a time while considering the prompt and previous output.
Example
When you ask a chatbot to summarize a document, training is not happening in that moment. The model is performing inference: using its existing parameters plus your context to produce a response.
Why it matters
Inference determines user experience and operating cost. Latency, token usage, model size, hardware and caching all affect whether an AI feature feels fast and affordable enough for real use.
Local and cloud inference
A model can run on a user's device, a company server or a cloud platform. Local inference can support offline use and reduce data transfer, but it is limited by available hardware. Cloud inference offers larger models and elastic capacity while introducing network dependency, usage charges and additional data-governance decisions.
What to measure
For generative models, useful metrics include time to first token, tokens per second and complete response time. Product teams should also measure the full workflow, including retrieval and tool calls. Optimizing the model alone will not help when most latency comes from another service or a poorly designed application pipeline.