The dispatch layer was eating the quality, not the model

4 Apr 2026 · 2 min read ·

We run an AI coding agent on GLM-5.1 via ZhiPu for automated code generation. It had a 54 percent rejection rate. Over half the tasks it attempted failed. The natural assumption: the model is not good enough.

We were wrong.

The dispatch path had seven layers between “I want this built” and “the model starts working”: human judgment, a routing skill, a markdown queue file, a Python poller running every sixty seconds, a Temporal workflow engine, a worker process, a shell script, and finally the LLM. Each layer added failure modes. The markdown queue had syntax parsing bugs — shell brackets in task IDs got expanded as globs. The poller tracked provider concurrency in memory, and when it restarted it thought all eight slots were full when only one workflow was actually running, so tasks sat in queue for hours. The workflow reported completed even when the model produced zero files and zero commits, because the review gate checked exit codes but not outcomes.

We bypassed the entire dispatch stack and called the model directly. Same prompt, same model, same task: build a 232-line MCP tool with Temporal SDK integration. It worked on the first attempt. Built the file, passed all thirteen tests, committed with a clean message. The model was fine. The infrastructure between us and the model was the bottleneck.

We collapsed seven layers to four: human judgment, an MCP tool, Temporal, and the worker with the LLM. The markdown queue, the poller, the routing skill — all removed. Direct SDK calls replaced file-based communication. We added review gates that check for actual outcomes — did a file get created, did a commit happen — rather than just exit codes.

When an AI system underperforms, the instinct is to blame the model. But models sit at the bottom of a stack. Every layer above them can degrade, mask, or block their output. Before concluding the model is insufficient, measure how much signal survives the dispatch path. In our case, 54 percent of the model failures were infrastructure failures wearing a model-shaped mask.