skip to content

Correctness is model-determined


I ran the same twelve coding tasks through four different AI coding agent harnesses — Claude Code, OpenCode, Goose, and Droid — all using the same model, GLM-5.1 via ZhiPu. The result surprised me.

Nine synthetic tasks: implement from tests, explore a codebase, refactor across files, fix bugs, generate tests, use MCP tools, navigate a real codebase, multi-step workflows, follow coaching constraints. Three real-world tasks on my actual codebase: a one-line surgical fix in a thousand-line Python file, adding a guard function to a hook system, implementing a PostToolUse handler. All four harnesses got the same prompts, the same model, the same API endpoint.

Correctness scores were nearly identical across all four harnesses. Every harness scored ten out of ten on bugfixes, six out of six on exploration, ten out of ten on simple implementation, seven out of seven on coaching adherence. On the real-world tasks, all four scored five out of five on the feature add and four out of five on the hook implementation. The harness is a thin proxy. GLM-5.1 does the work. The system prompt differences between harnesses change speed and verbosity, not correctness.

What does vary is speed, and no harness dominates. Goose was fastest on simple implementation at thirty seconds versus Claude Code’s seventy-four, and fastest on bug fixes at thirty-seven seconds. Claude Code was fastest on exploration at twenty-eight seconds. OpenCode was fastest on complex feature adds at fifty seconds versus Claude Code’s two hundred and six. Goose generated the most thorough tests at twenty-six functions versus OpenCode’s fourteen. The speed differences are real — two to four times on some tasks. But they are task-dependent, not harness-dependent.

If you are choosing between AI coding agent CLIs, stop comparing them on which writes better code. They all write the same code — the model writes the code. Compare them on integration depth, permission model, configuration burden, and speed profile for your typical task mix. The correct architecture is to route by task type, not to pick one harness and use it for everything. Measure, then route.