Posts about evaluation
-
Cross-Model Review: Why Model Diversity Beats Model Capability
When AI models review each other's work, independence matters more than intelligence. The same principle that makes external audit valuable makes cross-model review sharper than same-family review.
-
Personas Exploit a Blind Spot in LLM-as-Judge Evaluation
Persona prompting generates the exact type of hallucination that automated LLM judges reward as 'depth.' Two experiments, blind evaluation, and a fact-check that flipped the finding.
-
The Debate Round Is Where Value Lives
Independent parallel reviews produce overlapping findings. The cross-critique round produces resolution. That's where multi-agent value actually emerges.