skip to content
Terry Li

Last night I made mtor — my coding agent dispatch system — improve itself. The tool that dispatches coding tasks to AI agents was dispatched as a coding task to an AI agent. And it worked.

The setup

mtor is a CLI that dispatches coding tasks to headless AI agents via Temporal. Think of it as a translation controller: you give it a prompt, it sends the work to a GLM-5.1 instance running on a separate ARM server, monitors progress, and reports results as structured JSON.

The problem: after several rounds of overnight dispatch, mtor had grown to a 952-line monolithic cli.py. Everything in one file — Temporal client logic, JSON envelope helpers, dispatch logic, health checks, command tree definitions, all of it.

The bootstrap

I wrote a spec describing how to decompose the CLI:

Split cli.py into: cli.py (thin dispatch), client.py, envelope.py, dispatch.py, doctor.py, tree.py, init.py. Every import must resolve. No circular imports. Preserve exact CLI behavior.

Then I ran: mtor ~/specs/mtor-bootstrap.md

mtor dispatched the task to itself. GLM-5.1 read the spec, read the existing 952-line file, and produced:

  • 7 properly separated modules
  • Updated imports with correct dependency direction
  • 44 new integration tests alongside the 37 existing ones
  • A README and updated pyproject.toml

All 81 tests passed. The decomposition was clean.

What broke first

It didn’t work on the first try. The spec file had YAML frontmatter (--- at the top), and the claude CLI parsed --- as a command-line flag. Error: Unknown flag: ---. The agent never saw the prompt — it died on argument parsing.

The fix was trivial — strip the frontmatter. But the failure mode was invisible: mtor reported “COMPLETED, exit=1, verdict=rejected” with a 1-second runtime. Without checking the actual output file, you’d assume a rate limit or API error, not a CLI parsing bug.

This is the kind of thing that only surfaces when you use the system under real conditions. Benchmarks don’t test this. Staging environments don’t reproduce it. Only bootstrapping — making the tool build itself — found it.

What this validates

The architect-implementer split works. Claude Opus judges and writes specs. GLM-5.1 implements. The model quality gap (72.8% vs 75.6% on SWE-bench Verified) is real but narrow enough that coaching, stall detection, and structured prompting close it for well-specified tasks.

The key insight: the spec is the interface, not the model. A well-written spec with clear file paths, dependency rules, and verification commands produces correct output regardless of which model executes it. The model is fungible; the spec quality is not.

What’s next

After the bootstrap succeeded, I used the newly-decomposed mtor to dispatch 8 more improvements to itself: a task risk classifier, auto-routing by task type, experiment mode, structured JSONL logging, checkpoint-and-restore for failed runs, and auto-decomposition for multi-task specs.

Each improvement makes the next round of self-improvement more reliable. The system is eating its own tail — but in the productive way, where each bite adds capability.