Overnight Autonomous AI Coding: What Actually Works

10 Apr 2026 · 4 min read ·

Last night I ran an autonomous AI coding pipeline for eight hours while I slept. The system dispatched specs to GLM-5.1 via ZhiPu’s free coding plan, with Claude Code monitoring and reviewing every fifteen minutes. Here is what actually happened.

The architecture is a two-tier split. Claude Code on a small Fly.io instance acts as the architect — it writes specs, reviews diffs, dispatches tasks. A separate OCI ARM instance runs the executor: Claude Code in bare mode driving GLM-5.1 through ZhiPu’s Anthropic-compatible API. Temporal orchestrates the queue with two-hour activity timeouts and automatic retry. You dispatch as many specs as you want and Temporal drains them two at a time per provider.

I started with five specs for the system to improve itself: negative feedback dispatch throttling, load sensing for the worker, toggle commands for the quality gates, dispatch deduplication, and a freeze mechanism. Then eight more during the night as issues surfaced and the architect wrote new specs in response.

Five features merged to main in the first four hours. Each with tests, proper commit messages, and correct code. The spec-status feedback loop was the best — an A grade, 281 lines, exactly what was asked for. The negative feedback loop and the wiring were solid B-plus work. The toggle commands were a B, more invasive than necessary. The load sensing was the weakest at C-plus — it deleted an existing function and replaced it with something simpler, which is GLM’s most persistent failure mode.

Average quality: B-plus. For free tokens, overnight, with no human review until morning, that is usable.

Three things broke in instructive ways. The preflight probe — an “are you alive?” check that runs before each task — was missing a single flag. Without it, the agent tried to load OAuth, hooks, and MCP servers, all of which fail silently in the isolated environment. Result: empty response, retry every thirty seconds, burn the entire two-hour Temporal timeout. Eighty percent failure rate from one missing word in one line.

The system auto-merged approved branches to main on the worker. But I was also pushing fixes from the architect instance, which caused push races five times in one night. The fix was to stop the worker from touching main at all — it produces branches, the judgment layer reviews and merges. The execution layer should never own the canonical branch.

The funniest failure was an infinite test loop. A spec said “add three tests for auto-commit.” GLM added three tests. The review said “more coverage needed.” GLM added three more. Twenty-two commits in two hours, with increasingly obscure edge cases like null byte handling and carriage return no-ops. The fix was trivial: specs must now name exact test functions with an explicit stop condition. “Add test_foo and test_bar. Stop after these two. Do not add more.”

The autonomous monitoring pattern turned out to be the most important piece. Claude Code ran a loop that fired every fifteen minutes: check the queue, triage completed tasks, kill stuck ones, re-dispatch failures, set the next timer. Twenty-one cycles ran autonomously overnight. The session survived in tmux even when my SSH client disconnected. This works because Temporal is a queue — dispatch ten specs, walk away, the system drains them and handles retries on its own.

Throughput was about two tasks per hour with a single provider, limited by GLM’s token generation speed and the per-provider concurrency limit. With three providers this becomes six concurrent and roughly six tasks per hour. The harness flag to route across providers shipped during the session itself.

The question everyone asks is whether it is worth it. Five features with tests, overnight, zero human implementation time, zero API cost. The quality requires review — I graded every commit and found real issues. But “free overnight code at B-plus quality that needs a five-minute review” is a fundamentally different economics than “I will implement it myself tomorrow.” The judgment layer — spec quality plus review quality — is where human value concentrates. The implementation is commodity.