← AI glossary

Definition

Model Evaluation

Model evaluation measures whether an AI model is accurate, reliable, safe and useful enough for its intended task.

Also known as: AI evaluation, evals

Short definition

Model evaluation is the process of testing an AI model against criteria that matter for a use case. It can measure accuracy, completeness, safety, latency, cost, bias, robustness and user satisfaction.

How it works

Teams create test datasets, benchmark tasks, human review rubrics or automated checks. For generative AI, evaluation often combines exact metrics with human judgement because good answers can vary in wording.

Example

Before deploying a customer support assistant, a team can test whether it answers policy questions correctly, refuses unsafe requests, cites sources and escalates uncertain cases.

Why it matters

Without evaluation, teams mostly guess whether a model is good enough. Evals turn AI adoption into an engineering process: compare versions, catch regressions and decide when a system needs human review.

What a useful test set contains

Include normal requests, edge cases, ambiguous instructions, missing information and attempts to misuse the system. Examples should resemble real traffic rather than easy demonstrations selected by the development team.

Continuous evaluation

A benchmark run before launch is not enough. Models, prompts, retrieval sources and user behavior change over time. Maintain a stable regression set and add failures discovered in production. Automated judges can accelerate iteration, but domain experts are still necessary for specialized or high-impact tasks where an apparently fluent answer may contain a consequential error.