Topic
evals
3 essays on this topic.
Papers
- The Eval Gap
The scarce AI skill isn't building — it's knowing if what you built actually works.
- LLM evals aren't data science
Evaluating LLM systems requires judgment, not statistics. That shifts who's qualified to do it — and where the gap is in most organisations.
- AI Evals: Why Teams Build Metrics Before They've Read a Trace
Most teams build evaluators before reading a single trace. The sequence that actually works is the opposite: observe, categorise, then measure.