skip to content
Terry Li

I dispatched one small coding task to my AI agent pipeline. Ten minutes later, the log file was 208 MB. The symptom looked like a classic LLM runaway — the model had gotten stuck in a loop, spinning on some ambiguous prompt, burning tokens. The obvious fix: tune the prompt, add a stop condition, maybe swap to a smarter model.

None of that was the problem. The LLM never ran.

The pipeline

The system under test is mtor — a CLI that dispatches coding tasks to headless AI agents via Temporal. You write a spec file, run mtor --spec foo.md, and a worker on a separate ARM server picks it up, spawns an agent (Claude Code on GLM-5.1 via ZhiPu’s API), runs the agent against the task, and reports results as JSON.

The worker, the dispatch path, and the agent harness are all things I’ve built myself. Every layer is my code. I am the person who is supposed to know how it works.

The symptom

The dispatch failed fast. Ten seconds in, Temporal reported activity_failed, exit_code: -1, no output file. Second attempt, same thing. Third attempt, same thing. The worker stayed alive. mtor status returned cleanly. mtor doctor said providers were healthy. No stack trace anywhere.

Earlier in the session I’d fixed a completely separate bug — a systemd unit file that was silently dropping the worker’s authentication token because EnvironmentFile= can’t parse shell export syntax. That one took 30,989 crash-loops to notice. After fixing it I assumed the pipeline was healthy.

I was wrong, but not in the way I expected.

Layer one: the repo isn’t there

I pulled the full workflow history out of Temporal’s admin-tools container (which, it turns out, is the single most useful debugging move for silent activity failures and I should have reached for it first). The failure event had a Python traceback pinned to the exact line. It said:

FileNotFoundError: [Errno 2] No such file or directory:
'~/code/recombinase/.worktrees/ribosome-202446'

Two problems hiding in that path. First, the repo didn’t exist on the worker host — I’d only cloned it locally. Cloned it on the worker. Re-dispatched. Same error.

Second — and this is the one that hurt — the path started with a tilde. ~/code/recombinase. Python’s asyncio.create_subprocess_exec(cwd=...) does not tilde-expand. That’s a shell feature. The kernel sees the literal string ~/code/..., looks for a directory named ~, doesn’t find it, and throws.

Fixed the expansion at the ingest boundary. Shipped it. Re-dispatched.

Layer two: the task is running

This time the workflow stayed in RUNNING state. Good sign. I let it cook for three minutes. Checked the log file on the worker.

-rw-r--r-- 1 vivesca vivesca 107912538 ...

107 MB in three minutes. Checked it again a minute later: 208 MB. That’s when I stopped thinking “GLM is slow” and started thinking “something is very wrong.”

Layer three: the runaway

I grepped the log for distinct timestamps. 59,356 separate block headers over ten minutes. That’s a hundred retries per second sustained. No LLM could possibly be called that fast — each real GLM call takes seconds. So the thing writing to the log was not an LLM spinning, it was a retry loop firing without ever reaching the LLM.

I looked at the first block. The header showed the full spec file contents as the task argument. Right after it, stderr:

[stderr] Unknown flag: ---

There it was. The spec file started with YAML frontmatter. Standard stuff:

---
title: recombinase — guard _clear_cell against merged/spanned table cells
status: ready
---

That leading --- was being passed through to the Claude Code CLI as a positional argument. Claude Code CLI saw ---, interpreted it as a flag, and exited with “Unknown flag: ---”. Empty stdout, non-zero exit.

Layer four: the misclassification

A CLI rejecting its own arguments is a deterministic failure. You rerun it, you get the same error. There is no universe in which retrying helps. So why did the log grow to 208 MB?

I read the ribosome bash effector’s retry loop. Found this:

Terminal window
if echo "$output" | grep -qiE '429|rate.limit|quota'; then
_is_ratelimit=true
elif [[ -z "${output// /}" ]]; then
# Empty output + failure = claude got 429 and retried internally
_is_ratelimit=true

The comment is the bug. The author assumed that an empty stdout paired with a non-zero exit code could only mean one thing: Claude hit a rate limit, retried internally, and gave up. So they classified it as rate-limited and made it retryable.

That assumption is wrong. Empty stdout with non-zero exit can mean:

  • Claude rejected its own flags
  • The claude binary isn’t on PATH
  • The process was killed before producing output
  • Stdin deadlocked

All deterministic. All being classified as “transient, please retry.” The retry loop happily kept hammering the same poison input. Exponential backoff capped out quickly. The outer Python worker re-invoked the bash effector with a different provider. Same error, same classification, same retry. Fresh block appended to the log each time.

Nothing in the system was an LLM. The LLM was the one component that never ran.

The fix

Four patches, all shipped:

  1. Strip YAML frontmatter in the dispatch CLI before the task prompt ever reaches the harness. The regex already existed elsewhere in the codebase; it just wasn’t firing on the --spec path because of a gated code branch that only activated when the prompt was a file path in the positional argument, not a separate --spec flag.
  2. Tilde-expand spec repo fields at the ingest boundary, plus a defensive expansion inside the worker in case something else feeds it an unexpanded path.
  3. Replace EnvironmentFile= in the systemd unit with bash -c 'source .env.bootstrap && exec op run ...' so the worker’s auth token actually survives startup.
  4. Version-control the canonical systemd unit file so this particular hole can’t reopen silently.

The ribosome bash retry misclassification I filed as a follow-up. It’s the amplifier, not the primary, and rewriting a production retry loop late at night is a bad idea. Defense in depth, not defense right now.

What I actually learned

The debugging was layered in a specific way. Each layer’s fix revealed the next layer’s symptom. Each layer looked like it was the root cause until it wasn’t. The temptation at every stopping point was to declare victory and move on. The discipline was to notice that the symptom — the log size, the retry count, the speed — was not consistent with the fix I’d just shipped, and keep digging.

But the deeper thing is about naming. I kept thinking of this as “GLM is spinning.” Even after I knew there was a retry loop, I framed it as “GLM is spinning in a retry loop.” Only when I saw the 100-retries-per-second rate did I realize no language model can operate at that tempo. The rate was the clue that the language model was not in the loop at all.

When someone says their AI agent is stuck in a loop, the right first question is not “what’s wrong with the model” — it’s “is the model actually executing?” Most AI agent pipelines have three or four layers between you and the model: a CLI wrapper, a subprocess runner, a retry loop, a workflow engine. Any of those can spin without the model ever being touched. And when they do spin, the symptom looks exactly like a model going off the rails, because you can’t see inside the pipeline — you can only see the log it produced.

Assume the LLM never ran. Check the pipeline first.

Postscript

The task the pipeline was trying to run, ironically, was a small defensive fix to a different tool I’d shipped earlier that day — a guard against spanned cells in a PowerPoint table handler. Fifteen lines. A ribosome could have done it in two minutes if it had ever started. I ended up writing it by hand in Claude Code while debugging the retry loop. Shipped it in the same session. Smallest commit of the day, longest debugging session of the day. That asymmetry is the signature of an infrastructure problem masquerading as a content problem.

It is almost never the model.