Definition

Model Evaluation

Model evaluation measures whether an AI model is accurate, reliable, safe and useful enough for its intended task.

Updated May 3, 2026Also known as: AI evaluation, evals

Short definition

Model evaluation is the process of testing an AI model against criteria that matter for a use case. It can measure accuracy, completeness, safety, latency, cost, bias, robustness and user satisfaction.

How it works

Teams create test datasets, benchmark tasks, human review rubrics or automated checks. For generative AI, evaluation often combines exact metrics with human judgement because good answers can vary in wording.

Example

Before deploying a customer support assistant, a team can test whether it answers policy questions correctly, refuses unsafe requests, cites sources and escalates uncertain cases.

Why it matters

Without evaluation, teams mostly guess whether a model is good enough. Evals turn AI adoption into an engineering process: compare versions, catch regressions and decide when a system needs human review.