The agent evaluation problem: why 87% passing is worse than 40% failing loudly.

Most production AI agents pass their evals. That's not the reassurance it sounds like. The real failure mode isn't a low score — it's a system that silently degrades while the dashboard stays green.

There's a particular kind of failure I keep seeing in production agentic systems, and it almost always presents the same way: someone runs the eval suite, reports a pass rate in the high eighties, and the system is shipped. Six weeks later a customer discovers the agent has been quietly wrong about something important for most of that time. When the team looks back at the eval dashboard, every run is green.

This is not a story about bad engineers. The engineers I see doing this are careful people. It's a story about an evaluation discipline that hasn't caught up with the systems it's meant to evaluate.

Why 87% is a lie

A pass rate is a single number summarising behaviour across a test set. For deterministic software that's fine — you change the code, the number moves, you investigate. For agentic systems the number is a compressed signal over an enormous latent space, and the compression hides almost everything that matters.

Consider two systems with identical 87% scores:

FIG 01Same score, very different systems
PropertySystem ASystem B
Pass rate87%87%
Failure shapeRandomly distributed edge casesAll failures on high-value customers
Failure modeDeclines confidently, asks for humanFabricates plausible answer
Drift over 4 weeksStableSteadily worse, still passing
Operational postureShippableActively dangerous

The pass rate tells you nothing about which system you have. And in my experience you almost always have System B, because the tests that survive in an eval suite over time are the tests that pass — selection pressure makes your suite an optimistic portrait of the system.

The four failure modes evals miss

1. Loud failure vs. silent failure

A system that fails 60% of the time but knows it's failing — declining to answer, escalating to a human, flagging uncertainty — is a system you can operate. A system that fails 13% of the time and doesn't know it is a system that will injure customers. Evals that score on correctness without weighting failure-awareness are scoring the wrong thing.

2. Drift inside a passing score

The underlying model changes. The tool ecosystem changes. The data distribution your users send you changes. An 87% score today and an 87% score in six weeks can hide a complete change in which 13% is failing. If your eval suite doesn't track which tests flipped between runs, you're flying blind.

3. Tail-value concentration

A 13% failure rate is fine if failures are random. It is catastrophic if failures cluster on your top customers, your highest-revenue workflows, or your most regulated cases. Evals that don't weight by business value turn into averages that obscure the only thing the board cares about: where does this actually hurt us?

4. Overfitting to the eval set

This is the most common and the most embarrassing. The eval suite becomes part of the development loop; prompts are tuned until the score goes up; the system gets better at the eval and not at the world. I've seen teams discover this only after a customer-visible incident, which is the most expensive place to discover it.

Rule of thumb
If the same team writes the prompts, the tools, and the evals, your score is marketing. An honest evaluation function has to have some measure of independence from the system it's grading — different author, different data source, or ideally both.

What an evaluation discipline actually looks like

The teams I see getting this right treat evaluation as a first-class engineering discipline, not a CI gate. It has these characteristics:

  • E.01Live-traffic evaluation, not just a frozen test set. Sampled real queries, graded continuously, with humans in the loop for the high-value tail.
  • E.02Failure-aware scoring. A decline-to-answer on an ambiguous query is a pass; a confident wrong answer is a compound fail.
  • E.03Value-weighted aggregates. “87% passing” is meaningless; “99.4% passing on top-decile revenue, 78% elsewhere” is operational information.
  • E.04Independent authorship. The people writing evals are not the people writing prompts. The eval set is versioned, diffed, and reviewed like production code.
  • E.05Kill-switch criteria. Defined in advance: at what score on which metric does the system get rolled back? If there is no number, there is no discipline.
The purpose of evaluation is not to prove the system works. It's to find the ways in which it doesn't, faster than your customers will.

For boards and exec sponsors

If you oversee an AI programme, the single most useful question you can ask in your next review is not “what's the pass rate?” It's: “show me the last three things our evaluation system caught that we wouldn't have known about otherwise”.

If the answer is a thoughtful list of three specific regressions, you have a functioning eval discipline. If the answer is a restatement of the headline score, or worse, silence — you have a dashboard. And dashboards, in this domain, are how organisations discover their AI systems are broken roughly at the same time their customers do.


If you're standing up an agentic programme and want a hand pressure-testing the evaluation setup, that's part of what I do.

Robert Kehoe
Board advisory · AI & cyber
Book a discovery call →
— Continue reading