Topic

evaluation

4 essays on this topic.

Papers

2 May 2026
A framework rejection is not the end of the evaluation
Forty thousand stars are voting on something. The framework verdict was correct. Closing the file was premature — the value still lives in the dependency tree.
20 Mar 2026
Cross-Model Review: Why Model Diversity Beats Model Capability
When AI models review each other's work, independence matters more than intelligence. The same principle that makes external audit valuable makes cross-model review sharper than same-family review.
17 Mar 2026
Personas Exploit a Blind Spot in LLM-as-Judge Evaluation
Persona prompting generates the exact type of hallucination that automated LLM judges reward as 'depth.' Two experiments, blind evaluation, and a fact-check that flipped the finding.
15 Mar 2026
The Debate Round Is Where Value Lives
Independent parallel reviews produce overlapping findings. The cross-critique round produces resolution. That's where multi-agent value actually emerges.