Assume the LLM never ran

11 Apr 2026 · 6 min read ·

I dispatched one small coding task to my AI agent pipeline. Ten minutes later, the log file was 208 MB. The symptom looked like a classic LLM runaway — the model had gotten stuck in a loop, spinning on some ambiguous prompt, burning tokens. The obvious fix: tune the prompt, add a stop condition, maybe swap to a smarter model.

None of that was the problem. The LLM never ran.

The system under test is mtor — a CLI that dispatches coding tasks to headless AI agents via Temporal. You write a spec file, run the dispatch command, and a worker on a separate ARM server picks it up, spawns an agent running GLM-5.1 via ZhiPu’s API, runs it against the task, and reports results as JSON. The worker, the dispatch path, and the agent harness are all things I built myself. Every layer is my code. I am the person who is supposed to know how it works.

The dispatch failed fast. Ten seconds in, Temporal reported activity failed with exit code negative one and no output file. Second attempt, same thing. Third attempt, same thing. The worker stayed alive. Status checks returned cleanly. Diagnostics said providers were healthy. No stack trace anywhere.

Earlier in the session I had fixed a completely separate bug — a systemd unit file that was silently dropping the worker’s authentication token because EnvironmentFile cannot parse shell export syntax. That one took 30,989 crash-loops to notice. After fixing it I assumed the pipeline was healthy.

I was wrong, but not in the way I expected.

I pulled the full workflow history out of Temporal’s admin-tools container, which turns out to be the single most useful debugging move for silent activity failures and I should have reached for it first. The failure event had a Python traceback pinned to the exact line. It said the working directory did not exist. Two problems hiding in that path. First, the repo had not been cloned on the worker host. I cloned it. Re-dispatched. Same error.

Second — and this is the one that hurt — the path started with a tilde. Python’s asyncio subprocess execution does not tilde-expand. That is a shell feature. The kernel sees the literal string, looks for a directory named tilde, does not find it, and throws. Fixed the expansion at the ingest boundary. Shipped it. Re-dispatched.

This time the workflow stayed in running state. Good sign. I let it cook for three minutes. Checked the log file on the worker. 107 MB. Checked again a minute later. 208 MB. That is when I stopped thinking “GLM is slow” and started thinking “something is very wrong.”

I grepped the log for distinct timestamps. 59,356 separate block headers over ten minutes. That is a hundred retries per second sustained. No LLM could possibly be called that fast — each real call takes seconds. So the thing writing to the log was not an LLM spinning. It was a retry loop firing without ever reaching the LLM.

I looked at the first block. The header showed the full spec file contents as the task argument. Right after it, stderr said “Unknown flag: ---”. The spec file started with YAML frontmatter. That leading triple-dash was being passed through to the Claude Code CLI as a positional argument. The CLI saw it, interpreted it as a flag, and exited with an error. Empty stdout, non-zero exit. A deterministic, unfixable failure being retried a hundred times a second.

And here is where the amplifier lived. I read the retry loop in the bash effector. It had a heuristic: if stdout is empty and the exit code is non-zero, classify it as a rate limit and retry. The comment in the code said “Empty output plus failure equals Claude got 429 and retried internally.” That assumption is wrong. Empty stdout with non-zero exit can mean the binary rejected its own flags, or the binary is not on PATH, or the process was killed before producing output, or stdin deadlocked. All deterministic. All being classified as transient. The retry loop happily kept hammering the same poison input. Exponential backoff capped out quickly. The outer worker re-invoked with a different provider. Same error, same classification, same retry. Fresh block appended to the log each time.

Nothing in the system was an LLM. The LLM was the one component that never ran.

Four patches shipped. Strip YAML frontmatter in the dispatch CLI before the task prompt reaches the harness. Tilde-expand spec repo fields at the ingest boundary. Replace the systemd EnvironmentFile with a sourced shell invocation so the auth token survives startup. Version-control the canonical systemd unit file. The retry misclassification I filed as a follow-up — it is the amplifier, not the primary, and rewriting a production retry loop late at night is a bad idea.

The debugging was layered in a specific way. Each layer’s fix revealed the next layer’s symptom. Each layer looked like it was the root cause until it was not. The temptation at every stopping point was to declare victory and move on. The discipline was to notice that the symptom — the log size, the retry count, the speed — was not consistent with the fix I had just shipped, and keep digging.

But the deeper thing is about naming. I kept thinking of this as “GLM is spinning.” Even after I knew there was a retry loop, I framed it as “GLM is spinning in a retry loop.” Only when I saw the hundred-retries-per-second rate did I realise no language model can operate at that tempo. The rate was the clue that the language model was not in the loop at all.

When someone says their AI agent is stuck in a loop, the right first question is not “what is wrong with the model.” It is “is the model actually executing?” Most AI agent pipelines have three or four layers between you and the model: a CLI wrapper, a subprocess runner, a retry loop, a workflow engine. Any of those can spin without the model ever being touched. And when they do spin, the symptom looks exactly like a model going off the rails, because you cannot see inside the pipeline — you can only see the log it produced.

The task the pipeline was trying to run was a small defensive fix to a different tool — a guard against spanned cells in a PowerPoint table handler. Fifteen lines. A ribosome could have done it in two minutes if it had ever started. I ended up writing it by hand while debugging the retry loop. Smallest commit of the day, longest debugging session of the day. That asymmetry is the signature of an infrastructure problem masquerading as a content problem.

Assume the LLM never ran. Check the pipeline first.