live2026-05-27·2 min read

Evals are the product

Most AI products don't fail on the model. They fail on the absence of a way to know whether a change made things better. The eval harness — not the model — is the asset you own.

Models are rented. You'll swap yours two or three times this year, and the swap will take an afternoon. The eval harness is owned — the curated cases, the calibrated judge, the regression history that answers "is it actually better?" with a number. That asymmetry is the whole argument: the durable asset isn't the model, it's the thing that tells you the truth about the model.

Teams that fail with AI fail "almost universally" on the absence of that thing. And the trap is predictable — they jump straight to A/B testing, the one rung that requires production traffic, and skip the two below it that would have let them improve at all.

L1 · assertions

per-commit · cheap · no model

L2 · calibrated judge

trace logs · agreement-tracked

L3 · A/B

production only

Fig 1 — the ladder. Cheap, fast assertions at the bottom; expensive A/B at the top. Most teams wrongly start at the top.

You can't eyeball your way up this ladder. LLM output is non-deterministic and context-dependent, so a vibe check can't scale past the handful of traces you happen to read — and generic, off-the-shelf metrics are often worse than useless. The method that works is unglamorous: read a hundred real traces, cluster the failures into five or ten modes, and let frequency set your priorities. Deterministic failures become code assertions on a golden set. Subjective ones go to a model judge — scored binary PASS/FAIL, never a 1–5 Likert, because nobody agrees on a 3 versus a 4.

And here's the part people skip: an uncalibrated judge is a random number generator wearing a lab coat.

judge bias — swing in win-rate / preference

position (swap the order)80% · 2.5% → 82.5%

self-enhancement (judges own output)40% · 87.8% vs 47.6%

verbosity (longer looks better)20% · before length control

Fig 2 — how far a judge swings on things that aren't the answer. Calibrate before you trust.

Swap which answer you show first and a judge's win-rate can move from 2.5% to 82.5% — an eighty-point swing on ordering alone. A model asked to grade its own output prefers it ~88% of the time, where a human does ~48%. So you calibrate: show every pair in both orders and count only consistent wins, hold temperature low, give an explicit rubric, and measure the judge's agreement against a human-labeled set. Do that and a judge reaches ~80% agreement with humans — which is also roughly what two humans reach with each other. Calibration is the difference between a second opinion and a coin.

One honest wrinkle to design around: humans aren't ground truth either. We rate assertive-but-wrong answers ~15–20% higher than cautious-but-right ones. Calibrate a judge naïvely against human preference and you teach it to reward confident nonsense — the exact failure a good system should punish. So your probes should reward abstention, not just correctness.

Finally, gate on it. When you compare two versions, they faced the same probes — so compare them paired, with McNemar, not by subtracting two noisy accuracies. A merge that can't beat the held-out set doesn't land. That's not bureaucracy; it's the only thing standing between "feels better" and "is better."