Correctness is model-determined • Terry Li

I ran the same 12 coding tasks through four different AI coding agent harnesses — Claude Code, OpenCode, Goose, and Droid — all using the same model (GLM-5.1 via ZhiPu). The result surprised me.

The setup

Nine synthetic tasks: implement from tests, explore a codebase, refactor across files, fix bugs, generate tests, use MCP tools, navigate a real codebase, multi-step workflows, follow coaching constraints.

Three real-world tasks on my actual codebase: a one-line surgical fix in a 1000-line Python file, adding a guard function to a hook system, implementing a PostToolUse handler.

All four harnesses got the same prompts, the same model, the same API endpoint.

The punchline

Correctness scores were nearly identical across all four harnesses. Every harness scored 10/10 on bugfix, 6/6 on exploration, 10/10 on simple implementation, 7/7 on coaching adherence. On the real-world tasks, all four scored 5/5 on the feature add and 4/5 on the hook implementation.

The harness is a thin proxy. GLM-5.1 does the work. The system prompt differences between harnesses change speed and verbosity, not correctness.

What does vary

Speed varies by task type, and no harness dominates:

Task type	Fastest
Simple implementation	Goose (30s vs 74s CC)
Bug fixes	Goose (37s)
Exploration	CC (28s)
Complex feature adds	OpenCode (50s vs 206s CC)
Coaching adherence	CC (65s)
Test generation thoroughness	Goose (26 functions vs 14 OpenCode)

The speed differences are real — 2-4x on some tasks. But they’re task-dependent, not harness-dependent. No single harness is fastest at everything.

What this means

If you’re choosing between AI coding agent CLIs, stop comparing them on “which writes better code.” They all write the same code — the model writes the code. Compare them on:

Integration depth — can the harness access your tools, files, and context?
Permission model — does headless mode work for your automation needs?
Config burden — how many lines of JSON to get started?
Speed profile — which is fastest for your typical task mix?

The correct architecture is to route by task type, not to pick one harness and use it for everything.

The routing table I landed on

implement X       → Claude Code (most reliable)
add feature Y     → OpenCode (fastest on complex features)
explore codebase  → Droid (3x faster on read-only tasks)
fix bug in X      → Goose (fastest on targeted fixes)

This is for my setup — GLM-5.1 via ZhiPu, with MCP tools and coaching injection. Your model and integration surface will shift the numbers. The principle holds: measure, then route.