← back to blog

Eval engineering: the regression gate no one built

You wouldn't ship a web app without tests. You probably ship AI features without evals. Here's the minimum viable eval engineering setup that catches regressions before users do.

evalsllm-as-judgeregression-testingai-engineering

Most AI features in production have less regression coverage than a junior dev's weekend project. The feature ships, works in demos, and then silently degrades when someone changes a prompt, the model, or a tool. Users hit it; ticketing complains; engineering can't reproduce it because it's nondeterministic.

The fix is eval engineering: treat evals as infrastructure, gate merges on them, and spend as much on quality as on features. Here's the minimum viable version.

Three layers, not one

Evals aren't one thing. You need three:

| Layer | Size | Cadence | What it catches | |---|---|---|---| | Golden set | 20-100 items | Every PR (< 2 min) | Core-use-case regression | | Regression set | 200-500 items | Nightly / pre-release | Broad quality drift | | Full eval | 1000-5000 items | Per major release | Statistical validation |

Teams that skip the golden set try to run the full eval on every PR. It's too slow; reviewers bypass it; it stops gating. Teams that only have golden sets catch P0s but miss the gradual P2 drift. You need all three.

The golden set is the one that matters

Golden items are the handful you must never regress on. Each item qualifies only if:

  • It represents a real user workflow
  • The expected output is objectively checkable
  • A regression on it would be P0
  • It discriminates (some configurations fail it)

Reject "nice to have" items. Reject flaky items. Reject items that 100% of your configurations pass — they're not discriminating.

After every production incident, add a golden item that would have caught the incident. That's how the set earns its weight over time.

Calibrate your LLM judge

For generative outputs, don't rely on exact match. Use an LLM judge, but calibrate it:

  1. Label 50-200 items by hand
  2. Run the judge on the same set
  3. Compute Cohen's kappa between human and judge
  4. If kappa < 0.6, the judge prompt is bad — rewrite it

Recalibrate quarterly or whenever you change the judge. Use a different model family from what you're evaluating (Opus judges Sonnet, not itself) to reduce self-preference bias.

Pairwise comparison outperforms rubric scoring when you can get it. Show two outputs, ask which is better. Control for position bias by running each pair twice with swapped order.

Significance testing, not eyeballing

A 2% drop on 100 items is noise. A 2% drop on 10,000 items with a tight confidence interval is real. Use paired bootstrap on scalar metrics, McNemar's test on pass/fail:

import numpy as np
def bootstrap_ci(scores, n=1000, alpha=0.05):
    means = [np.mean(np.random.choice(scores, size=len(scores), replace=True))
             for _ in range(n)]
    return np.percentile(means, [100*alpha/2, 100*(1-alpha/2)])

Gating rule we like: block merge if any stratum regresses ≥ 3% with p < 0.05, or aggregate regresses ≥ 5% with p < 0.05. Smaller drops flag for review but don't block.

The CI setup

# .github/workflows/evals.yml
name: evals
on: pull_request
jobs:
  golden:
    timeout-minutes: 3
    steps:
      - run: npm run eval:golden  # fail on any regression
  regression:
    timeout-minutes: 20
    steps:
      - run: npm run eval:regression  # stratum-level thresholds

Golden blocks; regression publishes a report as a PR comment. Make it human-readable: per-stratum scores, confidence intervals, top regressing items with judge reasoning.

Model upgrade gating

When a new model ships, don't just swap it in. Run the full eval on both old and new. Expect some regressions on specific strata even when the average goes up — larger models can change behavior in ways your users notice.

Canary first. 5-10% of traffic, watch live metrics for a week. If safety, quality, and business metrics hold, roll forward.

Baselines are artifacts, not vibes

"Main is the baseline" is not a baseline. Store versioned baselines per (prompt, model, dataset) in an artifact store:

evals/baselines/
  prompt-v42_claude-sonnet-4-6_dataset-v3.json

When a PR merges, update the baseline. New baseline is the reference for the next PR. This gives you real longitudinal trend data — without it, you can't tell "we've drifted 10% over 6 months" from "we drifted in the last week".

Things people get wrong

  1. Judge = generator — the same model evaluates itself. Scores are inflated. Always use a different (stronger) model family
  2. No calibration — judge scores look clean but don't correlate with human preference. You're reporting noise
  3. No stratified reporting — aggregate hides localized regressions. A model can improve on average while silently wrecking edge cases
  4. Lowering thresholds to pass CI — the regression is real; you're just choosing not to see it
  5. Static eval sets — evals rot. Prune items that hit 100% pass for months. Add items after every incident
  6. Manual reviews as the only gate — humans are slow, inconsistent, and expensive. Automate the repeatable 90%

Start here

Three steps, in order, each deliverable in a week:

  1. Week 1: Build a golden set of 20 items for your most-critical AI feature. Wire it into CI. Gate merges on it
  2. Week 2: Add an LLM-as-judge evaluator. Calibrate against 50 human-labeled items
  3. Week 3: Expand to a 200-item regression set. Add stratification by use case. Publish per-PR reports

After three weeks you have coverage most teams never build. After a quarter, you have a real quality flywheel: incidents feed the golden set, the judge gets smarter, every PR has a quality signal attached.

For the full playbook — dataset design, contamination checks, LLM-judge prompts, cost optimization — see our evals skills suite.

════════════════════════════════════════════════════════════════
latestaiagents | MIT License

Made with </> for the AI agent community