Definition
Model Evaluation
Model evaluation measures whether an AI model is accurate, reliable, safe and useful enough for its intended task.
Short definition
Model evaluation is the process of testing an AI model against criteria that matter for a use case. It can measure accuracy, completeness, safety, latency, cost, bias, robustness and user satisfaction.
How it works
Teams create test datasets, benchmark tasks, human review rubrics or automated checks. For generative AI, evaluation often combines exact metrics with human judgement because good answers can vary in wording.
Example
Before deploying a customer support assistant, a team can test whether it answers policy questions correctly, refuses unsafe requests, cites sources and escalates uncertain cases.
Why it matters
Without evaluation, teams mostly guess whether a model is good enough. Evals turn AI adoption into an engineering process: compare versions, catch regressions and decide when a system needs human review.