skip to content
Terry Li

We run an AI coding agent (GLM-5.1 via ZhiPu) for automated code generation. It had a 54% rejection rate. Over half the tasks it attempted failed. The natural assumption: the model isn’t good enough.

We were wrong.

The seven-layer stack

The dispatch path looked like this:

Human judgment -> routing skill -> markdown queue file -> Python poller (every 60s)
-> Temporal workflow engine -> worker process -> shell script -> LLM

Seven layers between “I want this built” and “the model starts working.” Each layer added failure modes:

  • The markdown queue had syntax parsing bugs. Shell brackets in task IDs got expanded as globs.
  • The poller tracked provider concurrency in memory. When it restarted, it thought all 8 slots were full when only 1 workflow was actually running. Tasks sat in queue for hours.
  • The workflow reported “COMPLETED” even when the model produced zero files and zero commits. The review gate checked exit codes but not outcomes.

The experiment

We bypassed the entire dispatch stack and called the model directly. Same prompt, same model, same task: build a 232-line MCP enzyme with Temporal SDK integration.

It worked on the first attempt. Built the file, passed all 13 tests, committed with a clean message.

The model was fine. The infrastructure between us and the model was the bottleneck.

The fix

We collapsed seven layers to four:

Human judgment -> MCP tool -> Temporal -> worker -> LLM

The markdown queue, the poller, the routing skill — all removed. Direct SDK calls replaced file-based communication. We added review gates that check for actual outcomes (did a file get created? did a commit happen?) rather than just exit codes.

The lesson

When an AI system underperforms, the instinct is to blame the model. But models sit at the bottom of a stack. Every layer above them can degrade, mask, or block their output. Before concluding the model is insufficient, measure how much signal survives the dispatch path.

In our case, 54% of the “model failures” were infrastructure failures wearing a model-shaped mask.