skip to content
Topic

evals

3 essays on this topic.

  1. The Eval Gap

    The scarce AI skill isn't building — it's knowing if what you built actually works.

  2. LLM evals aren't data science

    Evaluating LLM systems requires judgment, not statistics. That shifts who's qualified to do it — and where the gap is in most organisations.

  3. AI Evals: Why Teams Build Metrics Before They've Read a Trace

    Most teams build evaluators before reading a single trace. The sequence that actually works is the opposite: observe, categorise, then measure.