Evals are the new unit tests

A national prototype kilogram replica under glass

A national prototype kilogram replica — measurement only means something against a standard. Public domain.

Prompting is the part that doesn't scale

There is a comfortable story the LLM-app gold rush tells about itself, in which the skilled work is prompt engineering: the right incantation, the cleverly structured context, the system prompt that unlocks the model's true potential. I have written before that the system prompt is mostly the product for the AI IDE vendors, and that prompting matured into context engineering for everyone else. Both of those are real. Neither of them is the discipline that separates teams who ship reliable LLM features from teams who ship demos that fall over in production.

That discipline is evals — the systematic, repeatable measurement of model and agent output quality against a fixed standard. And the claim of this post is straightforward: evals are to LLM applications what unit tests are to deterministic code. A team that ships an LLM feature without an eval harness is doing the equivalent of shipping a payments system with no tests and a good feeling about it. The reason this is not obvious yet is that the analogy has one load-bearing difference, and that difference is the whole game.

Unit tests assert equality. Evals measure distributions.

A unit test asserts that add(2, 2) returns exactly 4. It is binary, deterministic, and repeatable: same input, same output, green or red. That contract is the entire reason the test is useful — it pins behaviour to an exact value and screams the moment that value changes.

LLM output breaks every part of that contract. The same prompt produces different completions across runs (temperature, sampling, model updates underneath you). "Correct" is rarely a single string — there are many acceptable summaries of an article, many valid SQL queries that answer a question, many phrasings of a polite refusal. You cannot assert exact equality against a moving, non-deterministic target. So the unit-test contract has to be rewritten: instead of does this exact output equal the expected output, an eval asks across N runs on a representative dataset, what fraction of outputs meet the bar — and the bar is a graded criterion, not a string match.

flowchart LR
    subgraph UT["Unit test (deterministic)"]
        A["add(2,2)"] --> B["== 4"] --> C["pass / fail"]
    end
    subgraph EV["Eval (non-deterministic)"]
        D["prompt × N runs"] --> E["golden dataset<br/>+ graded criterion"]
        E --> F["pass-rate / score<br/>distribution"]
        F --> G["regression vs.<br/>last known-good"]
    end

This is why "it worked when I tried it" is not evidence about an LLM feature. One sample from a distribution tells you almost nothing about the distribution. The eval is the instrument that turns anecdote into measurement: run the case fifty times, score each output, report the pass-rate and how it moved since the last model or prompt change. Without that instrument you are not engineering the feature, you are vibing it — and I have already written about where that bill arrives.

The four kinds of eval, and when each earns its keep

"Eval" is an umbrella over several distinct techniques, and the mistake teams make is reaching for the cheapest one for everything. They trade off cost, scalability, and trustworthiness, and a serious harness uses more than one.

Golden datasets. A curated set of inputs paired with known-good outputs or acceptance criteria, scored by exact match, fuzzy match, or a rubric. This is the closest thing to a classic test suite, and it is the backbone of any harness because it is cheap to re-run and fully deterministic in its scoring even when the model is not. The work is front-loaded: building a dataset that actually represents production traffic — including the ugly edge cases — is the expensive, unglamorous part, and it is the part that pays back. Platforms like Braintrust lean on dataset management precisely because the dataset is the asset, not the prompt.

LLM-as-judge. A second model scores the output of the first against a natural-language rubric. This scales where golden datasets and humans do not — you can grade thousands of open-ended outputs for "is this a faithful summary" without writing a regex¹ for faithfulness. It is also the technique with the sharpest failure mode, which the next section is entirely about.

🔗 Learn more — ¹ What is a regular expression?

Human eval. Domain experts rate outputs directly. It is the slowest and most expensive option and the only one that reliably catches the things automated graders miss — tone, subtle factual drift, domain nuance, the "almost but not quite right" category that experienced reviewers flag and machines wave through. Human eval does not scale, so the right use is to calibrate the cheaper methods against it: sample, have humans grade, and check that your LLM-judge agrees with them often enough to be trusted.

Regression evals. The CI/CD-facing layer. Pin a known-good pass-rate, run the suite on every prompt change, model upgrade, or dependency bump, and fail the build if the score drops. This is where evals stop being a research activity and become an engineering gate. Promptfoo — a CLI and library that runs headless, integrates with CI/CD, and is used by both OpenAI and Anthropic — is built for exactly this; tellingly, it was acquired by OpenAI in 2026 and kept open source, which is a reasonable signal about where the industry thinks the value sits.

The pattern that experienced teams converge on, per the tooling roundups, is two tools: a lightweight framework for CI/CD gating paired with a heavier platform for human annotation and regression tracking.

The trap: LLM-as-judge is a circular guardrail

LLM-as-judge is the technique everyone reaches for, because it is the one that scales, and it is also the one most likely to quietly lie to you. The problem is structural, and it is the same problem I argued was fundamental in AI cannot guardrail against AI: when the verifier and the generator are cut from the same cloth, their blind spots overlap, and the verification is worth less than it looks.

The bias is documented and measurable. A NeurIPS-era paper, Self-Preference Bias in LLM-as-a-Judge (Wataoka, Takahashi, Ri), found that GPT-4 exhibits significant self-preference bias — it scores its own outputs higher than human evaluators do. The interesting part is the root cause: the authors traced it not to conscious self-recognition but to perplexity. LLMs "assign significantly higher evaluations to outputs with lower perplexity than human evaluators, regardless of whether the outputs were self-generated." Because a model naturally generates the low-perplexity, familiar-looking text it also finds easy to read, it inadvertently rewards exactly the kind of fluent-but-wrong output that is hardest for humans to catch. The judge is biased toward the failure mode you most need it to catch.

And self-preference is only one of the documented biases. The broader literature finds verbosity bias (preferring longer answers), positional bias (preferring whichever option came first), and self-enhancement bias, with the choice of judge model itself being the single largest driver of positional bias. A 2026 audit of LLM-as-judge for software-engineering tasks found the same pattern of systematic, exploitable bias. None of this means LLM-as-judge is useless. It means it is an instrument with a known calibration error, and an instrument you have not calibrated against ground truth is decoration.

The discipline that makes it trustworthy is the same one that makes any measurement trustworthy: calibrate against a reference. Use a different model family for the judge than for the generator so their blind spots do not perfectly overlap. Anchor the judge on a golden dataset with human-verified labels and measure the judge's agreement with humans before you trust it on unlabelled traffic. Swap option order to detect positional bias. And keep a human in the loop on the cases that matter, because the only way to know your automated grader is still honest is to periodically check it against someone who is not made of the same training data.

What an eval harness actually looks like in practice

Concretely, for a team shipping an LLM feature — say, a support-ticket summariser — the harness is less exotic than it sounds:

A golden dataset of real tickets with human-written acceptance criteria for each ("must name the product area," "must not invent a refund amount," "must flag if the customer is angry"). A few hundred cases beats a few thousand synthetic ones.
Deterministic checks first. Anything you can assert without a model — does the output parse as JSON, does it stay under the length limit, does it cite a ticket ID that actually exists — runs as cheap, exact assertions. These are real unit tests, and they catch the dumbest failures for free.
An LLM-judge for the graded criteria, from a different model family, anchored to the human labels in the golden set and reported with its agreement rate against those labels so you know how much to trust it.
A regression gate in CI. Every prompt edit, model bump, or context-pipeline change re-runs the suite. A drop below the pinned pass-rate fails the build, the same way a failing unit test does.
A human-eval sample on a cadence — weekly, or per release — to recalibrate the judge and catch the drift no automated grader sees.

The tooling exists and is mature: OpenAI Evals (OpenAI-only), Promptfoo and LangSmith and Braintrust (model-agnostic), and a long tail besides. The choice of framework matters far less than whether the harness exists at all and runs on every change. This is the same lesson as the diff-review one in Review LLM diffs as a team: the deterministic, machine-produced artifact is the thing the model cannot talk its way around, and it is the artifact you build the gate on.

A short close

The reason evals deserve the "new unit tests" framing is not that they are fashionable. It is that they occupy the same structural slot in the engineering process — the repeatable measurement that turns "seems fine" into "verified against a standard, and here is the number" — adapted for the one fact that everything about LLMs comes back to: the output is a distribution, not a value, so you measure pass-rates, not equality.

Teams that ship reliable LLM features have an eval harness, run it on every change, and treat a dropping pass-rate as a build failure. Teams that do not are flying blind, mistaking a good demo for a working system, and they will find out which they had the same way the vibe coders are finding out — later, in production, in front of a customer. Prompting got you the demo. Evals are how you find out whether you have a product. And the one thing the eval harness must not become is a hall of mirrors where the model that wrote the answer also decides the answer was good — because as the self-preference data shows, it will, and it will be wrong in exactly the way you most needed it to be right.