I run an autonomous AI system that generates research briefs, study questions, financial analysis, and health protocols overnight. For months, the review process was: Opus (Claude’s strongest model) reviews Sonnet’s (Claude’s workhorse model) outputs. Same family checking same family.
Today, during a taste calibration session — where I review the AI’s routing decisions and quality — I realised the review architecture has a structural flaw. It’s internal audit. And internal audit has a ceiling.
The blind spot problem
When Opus reviews Sonnet’s output, it’s checking work produced by a model trained on similar data, with similar failure modes, optimised by similar RLHF processes. If Sonnet hallucinates a regulatory date, Opus might not catch it — because both models learned from the same corpus and share the same gaps.
This isn’t hypothetical. The system caught “HSBC rose 12 places in the Evident AI Index” propagated across 14 files. The correct figure was a move from #13 to #7 to #8 — the 12-place jump was actually Lloyds Banking Group. An Opus reviewer caught it this time. But how many similar errors passed because the reviewer shared the same training-data blind spot as the generator?
The audit analogy
External audit exists because independence has structural value. It’s not that external auditors are smarter — they’re often less familiar with the business. But their independence means they’re not anchored to the same assumptions.
Cross-model review works the same way. A Gemini model reviewing Claude’s output brings different training data, different failure modes, different strengths. It might miss things Claude catches easily, but it catches things Claude consistently misses. The value is in the orthogonality, not the capability.
Disagreement is the signal
When you have multiple independent reviewers, three outcomes are possible:
- All agree it’s fine. High confidence. Move on.
- All agree something’s wrong. Fix it. Still no human needed.
- They disagree. This is the signal. This is where human taste actually matters.
The human’s role shifts from reviewer to tiebreaker. You’re not reading every output — you’re resolving the cases where independent models see different things. That’s where your judgment adds maximum information.
The frontier prerequisite
This only works because current models are individually reliable. If each model were wrong 30% of the time, disagreements would be everywhere and mean nothing — all noise, no signal.
It’s precisely because frontier models are right 95%+ of the time that their disagreements become meaningful. When two reliable-but-independent systems disagree, something interesting is happening. That’s worth a human’s attention.
But — and this is the key insight — the selection criterion isn’t “pick the smartest.” It’s “pick from the Pareto frontier of intelligence vs cost.” A model that’s slightly less capable but much cheaper adds more value to the council than a marginally smarter model at 10x the price. You want each council member to be efficient for its role, not uniformly expensive.
Plot all available models on a scatter plot: intelligence on one axis, cost on the other. The models on the efficient frontier — those where no other model is both smarter and cheaper — are your candidate pool. Then select for maximum independence across families within that pool. The result is a cost-efficient council with genuine diversity, not an expensive panel of near-identical frontier models.
Beyond review
The same principle applies to generation, not just review. An autonomous system that routes tasks to different model families — Gemini for web-grounded research, Codex for code, Claude for judgment and prose — gets genuine diversity in its outputs. Different models bring different knowledge, different reasoning patterns, different failure modes.
The portfolio analogy holds: diversification reduces correlated risk. A portfolio of independent models is more robust than concentration in a single strong model, just as a diversified portfolio outperforms concentration in a single strong stock on a risk-adjusted basis.
The infrastructure implication
If model diversity is universally valuable, it shouldn’t be a feature of any single tool — it should be infrastructure. Every tool in the stack that currently runs single-model should be able to route through a multi-model deliberation layer. The primitive is: filesystem-aware CLI tools (so models can read local files and verify facts) plus a disagreement-surfacing mechanism (so humans see only what matters).
The tools exist. Gemini CLI, Codex CLI, OpenCode (which connects to Ollama, OpenRouter, and any OpenAI-compatible endpoint). The architectural work is wiring them into a council pattern where each member has full context access and disagreements bubble up.
The design insight: consilium (multi-model deliberation) isn’t a tool. It’s the foundation layer everything else should sit on.