The One Env Var That Cost a Day
/ 4 min read
I spent an entire day debugging why my AI coding pipeline couldn’t dispatch tasks. The root cause was one environment variable name: ANTHROPIC_API_KEY should have been ANTHROPIC_AUTH_TOKEN.
Here’s the debugging journey and why it took so long.
The setup
mtor is a Temporal-based system that dispatches coding tasks to AI agents running on a remote server. The agents use Claude Code headlessly, connected to ZhiPu’s GLM-5.1 via their Anthropic-compatible API. Free tokens, unlimited coding. The previous night’s batch had landed 6 commits, so the system appeared to work.
The symptoms
Every task dispatched during the day produced zero output. Claude Code started, ran for 30 minutes in silence, and got killed by the stall detector. No error messages. No auth failures. Just… nothing.
The wrong hypotheses (in order)
“ZhiPu is slow during daytime.” Tested with curl — ZhiPu responded in 2 seconds. Eliminated.
“Claude Code’s startup is too heavy.” Switched to Goose (a lighter harness). Goose completed the same task in 2 minutes. Concluded CC was the problem. Wrong conclusion — CC was failing for a different reason than startup overhead.
“The stall detector threshold is too low.” Raised it from 30 to 60 minutes. Tasks still produced zero output. The threshold wasn’t the issue.
“CC —print mode is a dumb pipe without tools.” Tested and proved CC —print IS a full agent with tools. Another dead end.
“CC needs Anthropic OAuth credentials.” Copied .credentials.json from another machine. CC worked with Max subscription credentials. Concluded CC+GLM was impossible without Anthropic auth. Wrong again.
The actual root cause
Two lines in a bash script:
# What the script had:ANTHROPIC_API_KEY="$_KEY"unset ANTHROPIC_AUTH_TOKEN
# What it should have been:ANTHROPIC_AUTH_TOKEN="$_KEY"# (don't unset it)ZhiPu’s coding plan uses ANTHROPIC_AUTH_TOKEN for the x-api-key header in their Anthropic-compatible endpoint. Claude Code checks this variable specifically. ANTHROPIC_API_KEY is a different variable that CC uses for direct Anthropic API access.
The unset ANTHROPIC_AUTH_TOKEN line was added during a rename refactoring months ago. It actively killed any inherited auth token from the shell environment — which is how it had been working before the rename.
Why it took so long
-
The system appeared to work. The overnight batch had a 20% success rate. We assumed that was a model quality issue, not an auth failure. In reality, that 20% came from inherited environment variables that occasionally survived the
env -icleanup. -
Zero output looks the same regardless of cause. Auth failure, slow API, startup hang — they all manifest as “no stdout for N minutes.” Without the preflight probe we later built, there was no way to distinguish them.
-
Each wrong hypothesis led to a productive fix. Raising the stall threshold, adding Goose support, installing earlyoom — these were all real improvements. They just weren’t the root cause. Productive debugging is seductive because it feels like progress.
-
The provider docs used different variable names. ZhiPu’s official docs say
ANTHROPIC_AUTH_TOKEN. The script was written before those docs existed, usingANTHROPIC_API_KEYby analogy with OpenAI’s convention. Both are valid environment variables — they just do different things.
What we built to prevent recurrence
Preflight auth probe. Before any real task, send claude --print -p "echo ok" and verify non-empty output. Catches auth failures in 30 seconds instead of 6 hours.
Config lockfile. A JSON file recording the known-good env var names, URLs, and model names per provider. Pre-commit hook validates the script matches the lockfile.
Doctor API probe. mtor doctor now makes a real API call to each provider, not just checking if the key exists. “Key exists but doesn’t work” was exactly today’s failure mode.
The meta-lesson
The most expensive debugging happens when the system partially works. A system that completely fails is easy to diagnose. A system with 20% success rate generates just enough hope to prevent you from questioning the fundamentals.
If I’d asked “can CC even authenticate?” in the first 5 minutes instead of assuming the overnight success meant auth was fine, I’d have found the answer immediately. The preflight probe is that question, automated.
One env var. One day. Now it’s a 30-second check.