The One Env Var That Cost a Day

9 Apr 2026 · 4 min read ·

I spent an entire day debugging why my AI coding pipeline could not dispatch tasks. The root cause was one environment variable name: ANTHROPIC_API_KEY should have been ANTHROPIC_AUTH_TOKEN.

mtor is a Temporal-based system that dispatches coding tasks to AI agents running on a remote server. The agents use Claude Code headlessly, connected to ZhiPu’s GLM-5.1 via their Anthropic-compatible API. Free tokens, unlimited coding. The previous night’s batch had landed six commits, so the system appeared to work.

Every task dispatched during the day produced zero output. Claude Code started, ran for thirty minutes in silence, and got killed by the stall detector. No error messages. No auth failures. Just nothing.

The wrong hypotheses arrived in order. ZhiPu is slow during daytime — tested with curl, ZhiPu responded in two seconds, eliminated. Claude Code’s startup is too heavy — switched to Goose, which completed the same task in two minutes, concluded Claude Code was the problem, wrong conclusion. The stall detector threshold is too low — raised it from thirty to sixty minutes, tasks still produced zero output. Claude Code print mode is a dumb pipe without tools — tested and proved Claude Code print is a full agent with tools, another dead end. Claude Code needs Anthropic OAuth credentials — copied credentials from another machine, Claude Code worked with Max subscription, concluded Claude Code plus GLM was impossible without Anthropic auth, wrong again.

The actual root cause was two lines in a bash script. The script set ANTHROPIC_API_KEY with the ZhiPu key and unset ANTHROPIC_AUTH_TOKEN. It should have set ANTHROPIC_AUTH_TOKEN and left it alone. ZhiPu’s coding plan uses AUTH_TOKEN for the x-api-key header in their Anthropic-compatible endpoint. Claude Code checks this variable specifically. API_KEY is a different variable for direct Anthropic API access. The unset line was added during a rename refactoring months ago. It actively killed any inherited auth token from the shell environment — which is how it had been working before the rename.

It took so long because the system appeared to work. The overnight batch had a twenty percent success rate. We assumed that was a model quality issue, not an auth failure. In reality, that twenty percent came from inherited environment variables that occasionally survived the env cleanup. Zero output looks the same regardless of cause — auth failure, slow API, startup hang all manifest as no stdout for N minutes. Each wrong hypothesis led to a productive fix — raising the stall threshold, adding Goose support, installing earlyoom — all real improvements that were not the root cause. Productive debugging is seductive because it feels like progress. And the provider docs used different variable names from the script, which was written before those docs existed.

We built three things to prevent recurrence. A preflight auth probe that sends a trivial command and verifies non-empty output before any real task — catches auth failures in thirty seconds instead of six hours. A config lockfile recording known-good env var names, URLs, and model names per provider, validated by a pre-commit hook. And a doctor command that makes a real API call to each provider, not just checking if the key exists, because “key exists but does not work” was exactly this failure mode.

The most expensive debugging happens when the system partially works. A system that completely fails is easy to diagnose. A system with twenty percent success rate generates just enough hope to prevent you from questioning the fundamentals. If I had asked “can Claude Code even authenticate?” in the first five minutes instead of assuming the overnight success meant auth was fine, I would have found the answer immediately. The preflight probe is that question, automated. One env var. One day. Now it is a thirty-second check.