skip to content

I made my coding agent dispatch system improve itself


Last night I made mtor — my coding agent dispatch system — improve itself. The tool that dispatches coding tasks to AI agents was dispatched as a coding task to an AI agent. And it worked.

mtor is a CLI that dispatches coding tasks to headless AI agents via Temporal. You give it a prompt, it sends the work to a GLM-5.1 instance running on a separate ARM server, monitors progress, and reports results as structured JSON. After several rounds of overnight dispatch, mtor had grown to a 952-line monolithic CLI file. Everything in one place — Temporal client logic, JSON envelope helpers, dispatch logic, health checks, command tree definitions.

I wrote a spec describing how to decompose the CLI: split it into seven properly separated modules, every import must resolve, no circular imports, preserve exact CLI behaviour. Then I dispatched that spec through mtor itself. GLM-5.1 read the spec, read the existing 952-line file, and produced seven properly separated modules with updated imports, forty-four new integration tests alongside the thirty-seven existing ones, a README, and an updated project config. All eighty-one tests passed. The decomposition was clean.

It did not work on the first try. The spec file had YAML frontmatter — triple dashes at the top — and the Claude CLI parsed those triple dashes as a command-line flag. The agent never saw the prompt. It died on argument parsing. mtor reported completed with exit code one, verdict rejected, one-second runtime. Without checking the actual output file, you would assume a rate limit or API error, not a CLI parsing bug. This is the kind of thing that only surfaces when you use the system under real conditions. Benchmarks do not test this. Staging environments do not reproduce it. Only bootstrapping — making the tool build itself — found it.

The fix was trivial: strip the frontmatter before dispatch. What this validates is more interesting. The architect-implementer split works. Claude Opus judges and writes specs. GLM-5.1 implements. The model quality gap is real but narrow enough that coaching, stall detection, and structured prompting close it for well-specified tasks. The spec is the interface, not the model. A well-written spec with clear file paths, dependency rules, and verification commands produces correct output regardless of which model executes it. The model is fungible. The spec quality is not.

After the bootstrap succeeded, I used the newly-decomposed mtor to dispatch eight more improvements to itself: a task risk classifier, auto-routing by task type, experiment mode, structured logging, checkpoint-and-restore for failed runs, and auto-decomposition for multi-task specs. Each improvement makes the next round of self-improvement more reliable. The system is eating its own tail — but in the productive way, where each bite adds capability.