skip to content
Topic

evaluation

4 essays on this topic.

  1. A framework rejection is not the end of the evaluation

    Forty thousand stars are voting on something. The framework verdict was correct. Closing the file was premature — the value still lives in the dependency tree.

  2. Cross-Model Review: Why Model Diversity Beats Model Capability

    When AI models review each other's work, independence matters more than intelligence. The same principle that makes external audit valuable makes cross-model review sharper than same-family review.

  3. Personas Exploit a Blind Spot in LLM-as-Judge Evaluation

    Persona prompting generates the exact type of hallucination that automated LLM judges reward as 'depth.' Two experiments, blind evaluation, and a fact-check that flipped the finding.

  4. The Debate Round Is Where Value Lives

    Independent parallel reviews produce overlapping findings. The cross-critique round produces resolution. That's where multi-agent value actually emerges.