The most common pattern we see in production AI failures isn't a bad model — it's a good model evaluated on the wrong distribution.
Teams typically validate AI systems using benchmark datasets: held-out splits from the same distribution they trained on, or standard academic evaluations like MMLU or HellaSwag. These are reasonable proxies for general capability. They are poor proxies for your specific production task.
The distribution problem
Consider a language model deployed to classify customer support tickets. The team evaluated it on a labelled sample of historical tickets and found 91% accuracy. They shipped it. Three months later, the model was misrouting a disproportionate share of tickets from enterprise accounts — accounts with longer, more formally-worded messages than the historical distribution used for evaluation.
The model hadn't degraded. The eval had never covered that subpopulation.
This failure mode is structural. Evaluation on aggregate accuracy conceals performance on the subsets that matter most. A 91% overall accuracy with 60% accuracy on a 15% slice of high-value inputs is a significantly worse outcome than the headline number suggests.
What an evaluation harness actually requires
A useful eval isn't just a held-out test set. It requires:
- Task decomposition. What are the distinct sub-tasks the model is performing? A document summarisation system is doing entity recognition, coreference resolution, compression, and factual retention simultaneously. Each should be evaluated separately.
- Failure mode enumeration. What are the ways the model can fail, and what is the cost of each? A false negative in fraud detection costs differently than a false positive. An evaluation that only tracks aggregate F1 obscures this.
- Slice analysis. Does performance degrade on particular input subsets — by length, domain, writing register, or recency? If so, the aggregate metric is hiding a real problem.
- Representative sampling. The eval set should reflect production inputs, not just the training distribution. If production will include edge cases, adversarial inputs, or novel phrasings, those need to be in the eval.
The practical implication
Build the eval harness before you build the system. This is counterintuitive — it feels like you're writing tests for code that doesn't exist yet. But the act of constructing the eval forces you to be precise about what the system is supposed to do. Ambiguity in the eval almost always reflects ambiguity in the specification.
If you can't write a clear evaluation for a system, you don't yet understand the problem well enough to build the solution.