900 LLM Tasks While I Slept

Last night I queued 900 coding tasks and went to sleep. By morning: 424 git commits, 7000+ tests, 36 new tools. Total compute cost: about five US dollars.

The setup

A single Fly.io machine in Tokyo — 8 cores, 32GB RAM, $11/day (I need to optimise this). On it runs a Python daemon that reads a Markdown task queue and dispatches headless Claude Code instances across three Chinese LLM providers: ZhiPu (GLM-5.1), Infini (deepseek-v3.2), and Volcano (ark-code). All three offer Anthropic-compatible API endpoints on flat-rate coding plans. Unlimited tokens for about 200 RMB/month each.

32 concurrent instances. 8 on ZhiPu, 8 on Infini, 16 on Volcano. Each picks up a task, runs Claude Code in --print --bare mode, writes code, runs tests, commits, moves on. The daemon tracks per-provider concurrency limits and auto-retries failures once.

The bug that broke everything

For the first two hours, 90% of tasks failed instantly. Exit code 2 — the bash script was dying before Claude even started.

The cause: I prepend a coaching file to every prompt. That file contains backticks and dollar signs (it’s Markdown with code examples). When bash does echo "$prompt", those characters trigger command substitution. The fix was embarrassingly simple: write to a temp file with printf, cat it back.

One character class in one line of bash was responsible for burning hundreds of task-slots. The system logged the failures, I spotted the pattern, fixed it, and the pass rate jumped from 10% to 72% within minutes.

What it actually produced

Mostly test files and effector fixes. The LLMs are good at: reading a module, writing tests for it, running pytest, fixing failures. They are bad at: creating new files in the right directories (bare mode has no file-write tools), and anything requiring judgment about what to build next.

The overnight batch also built a complete Playwright browser stack, a consulting card generator, a golem dashboard, and about 30 other small utilities. Quality is roughly 7/10 — functional but with rough edges. Some tests have hardcoded paths. Some effectors import the wrong libraries. These become tomorrow’s fix queue.

The auto-requeue loop

The most useful thing I built was making the daemon self-sustaining. When pending tasks drop below 50, it auto-generates more: scan for untested modules, find broken effectors, create fix tasks. The queue never empties. The daemon ran for 8.5 hours unattended, from 7pm to 3:30am, then spent another two hours churning through retries.

What I learned

The bottleneck is not compute or tokens — it is task specification quality. A well-specified task (“read this file, write tests, run pytest, fix failures, commit”) passes 80% of the time. A vague task (“health check these effectors”) passes 30%. The coaching file helps, but the real leverage is in how you write the queue.

The other lesson: read your own system before building new things. I spent an hour building a crude auto-dispatch loop, only to discover my codebase already had a LangGraph-based autonomous work loop called circulation.py that does exactly this, but better. Check the map before you start digging.

The setup

The bug that broke everything

What it actually produced

The auto-requeue loop

What I learned

Related Articles

my-post

The Architect-Implementer Split

One Skill, One Triggering Context

Share this article