Overnight Autonomous AI Coding: What Actually Works • Terry Li

Last night I ran an autonomous AI coding pipeline for 8 hours while I slept (mostly). The system dispatched specs to GLM-5.1 via ZhiPu’s free coding plan, with Claude Code monitoring and reviewing every 15 minutes. Here’s what actually happened.

The setup

Architect (CC): Claude Code on a small Fly.io instance (soma). Writes specs, reviews diffs, dispatches tasks.
Executor (ribosome): Claude Code in --bare mode on an OCI ARM instance (ganglion). 4 CPU, 24GB RAM. Runs GLM-5.1 through ZhiPu’s Anthropic-compatible API.
Orchestrator: Temporal, with 2hr activity timeouts and automatic retry.
Queue: Dispatch as many specs as you want. Temporal drains them 2 at a time per provider.

What I dispatched

5 initial specs for mtor self-improvement (the system improving itself):

Negative feedback dispatch throttling
AMPK ganglion load sensing
Rapa/deptor toggle commands
Dispatch deduplication
Deptor freeze mechanism

Then 8 more during the night as issues surfaced and new specs were written.

What landed

5 features merged to main in the first 4 hours. Each with tests, proper commit messages, and correct code. Quality grades:

Feature	Grade	Lines	Notes
Negative feedback loop	B+	113	Clean design, tests included
Toggle rapa/deptor	B	77	More invasive than needed
AMPK sensing	C+	155	Deleted existing function (later recovered)
Spec-status feedback loop	A	281	Best commit — exactly what was asked
Feedback wiring	B+	165	Good provider-level tracking

Average: B+. For free tokens, overnight, no human review until morning — that’s usable.

What broke

1. The preflight probe (80% failure rate)

The ribosome runs a “are you alive?” check before starting work. This probe was missing --bare flag, so Claude Code tried to load OAuth, hooks, and MCP servers — all of which fail silently under the isolated env -i environment. Result: empty response → retry every 30 seconds → burn the entire 2-hour Temporal timeout.

Fix: One word added to one line: claude --bare --print -p "echo ok".

2. Auto-merge to main caused push races

The system auto-merged approved branches to main on ganglion. But soma (where I was also pushing fixes) would race with ganglion’s pushes. error: failed to push some refs five times in one night.

Fix: Changed the merge function to push branches to origin instead. CC reviews and merges. The judgment layer (CC) owns main, the execution layer (GLM) produces branches.

3. GLM infinite test loop (22 commits in 2 hours)

A spec said “add 3 tests for auto-commit.” GLM added 3 tests, the system said “more coverage needed,” GLM added 3 more, repeat. 22 commits of increasingly obscure edge-case tests: test_auto_commit_null_byte, test_auto_commit_newline_wf_id, test_auto_commit_CR_noop.

Fix: Specs must now name exact test functions with a stop condition: “Add test_foo, test_bar. Stop after these 2 — do not add more.”

What I learned about quality

Correctness is model-determined, not harness-determined. I tested CC, Goose, and OpenCode as harnesses — all running GLM-5.1 through the same API. The code quality was identical. The harness is infrastructure; the model does the work.

GLM-5.1 deletes things it shouldn’t. The biggest quality issue: GLM replaces existing sophisticated functions with simpler versions. A nuanced should_dispatch() that checked provider health + feedback + cooldown was replaced with a 1-line threshold check. The spec said “add” but GLM interpreted it as “replace.”

The verdict gate works. Flags like no_commit_on_success, destruction, and target_file_missing caught real problems. 69% of overnight merges were correct. The rejected 31% were legitimate rejections — duplicate work, empty diffs, or destructive changes.

The autonomous monitoring pattern

Claude Code ran a “securin” loop — a background timer that fires every 15 minutes:

sleep 900 && mtor list --count 50

When the timer completes, CC reads the results, triages completed tasks, kills stuck ones, re-dispatches failures, and sets the next timer. 21 cycles ran autonomously overnight. The session survives in tmux even when the SSH client disconnects.

This works because Temporal is a queue. Dispatch 10 specs, walk away. Temporal runs 2 at a time, retries failures, times out stuck tasks. CC just watches and reacts.

Throughput

~2 tasks per hour with a single provider (ZhiPu). Limited by:

GLM-5.1 token generation speed (~20-40 min per implementation)
Preflight flakiness (fixed, now <5% failure rate)
2 concurrent per provider limit

With 3 providers (ZhiPu + Volcano + Infini), this becomes 6 concurrent → ~6 tasks/hour. The --harness flag to route across providers shipped during the session.

Is it worth it?

5 features with tests, overnight, zero human implementation time, zero API cost (ZhiPu coding plan is free). The quality requires review — I graded every commit and found real issues. But “free overnight code at B+ quality that needs a 5-minute review” is a fundamentally different economics than “I’ll implement it myself tomorrow.”

The system is the message: write specs precisely, let cheap models execute, review the output. The judgment layer (spec quality + review quality) is where human value concentrates. The implementation is commodity.