What Anthropic's Managed Agents validates — and what to steal • Terry Li

On April 8, Anthropic launched Claude Managed Agents — a hosted platform for running long-lived AI agents in the cloud. You define the agent’s tasks, tools, and guardrails; they run the infrastructure.

The interesting part isn’t the product. It’s the architecture underneath. Anthropic published an engineering blog called “Scaling Managed Agents: Decoupling the brain from the hands” that walks through how they separated the orchestration loop from the execution environment. Reading it felt like seeing a blueprint of decisions I’ve already made — and a few I haven’t.

I run a personal agent infrastructure. An orchestrator dispatches coding tasks to headless workers on remote machines. I’ve been building this for months, iterating through the same problems Anthropic describes. This post is about what their architecture validates, what it does better, and what’s worth stealing.

The architecture, briefly

Managed Agents virtualizes three things:

Session — an append-only event log. Everything that happened, durably stored outside the agent’s context window.
Harness — the orchestration loop. Calls Claude, routes tool calls to sandboxes, handles retries. Stateless — if it crashes, a new one reads the session log and resumes.
Sandbox — an isolated container where code runs. Dies and gets replaced without affecting the session or harness.

The key property: all three are independently replaceable. Harness doesn’t know what the sandbox is. Session doesn’t care which harness reads it.

What it validates

If you’ve been building agent infrastructure, this architecture looks familiar.

I have an orchestrator (mtor) that dispatches tasks to workers (ribosome) on a remote ARM machine (ganglion). The orchestrator writes specs, the worker executes, a verdict gate checks the output. Each piece is separate. Workers are disposable. The orchestrator doesn’t touch implementation code.

Anthropic arrived at the same decomposition for the same reasons:

Workers must be cattle, not pets. Their early design put everything in one container. Container fails, session is lost. They couldn’t even debug it because user data was in there too. My equivalent: when ganglion OOMs (it did, two days ago), I reboot it and re-dispatch. Workers are interchangeable. The task state should survive.

The orchestrator must be stateless. Their harness crashes, a new one calls wake(sessionId) and resumes. My orchestrator (mtor) uses Temporal workflows — same pattern, different implementation. The workflow state is durable; the orchestrator process is not.

Credentials never enter the sandbox. They bake git tokens into local remotes during container init, store OAuth tokens in a vault behind a proxy. I use 1Password injection (op run) — the worker process gets env vars scoped to its runtime, never persisted. Same principle: the agent that runs untrusted code must never hold credentials it could exfiltrate.

The validation matters because these aren’t obvious decisions. The simpler path is one big container, credentials in environment variables, state in memory. Every agent framework starts there. The decomposition only looks obvious in retrospect.

What they do better

Session as a queryable object

This is the biggest gap in my system.

Anthropic’s session is an append-only event log with getEvents() — you can read positional slices, rewind to before a specific action, re-read context that was compacted away. The session lives outside the context window but remains accessible to the harness.

My equivalent is a 15-line state file (tonus) that gets overwritten each checkpoint, plus raw session JSONL that nobody reads. If a worker crashes, I lose whatever wasn’t checkpointed. There’s no “go back to event N and re-read the lead-up.”

The insight: the context window is a view, not the source of truth. Context engineering (compaction, trimming, summarization) is lossy by definition — you’re guessing what future turns will need. An append-only event log means those guesses are reversible. The harness can always go back to the raw events.

This is worth building.

Self-evaluation loops

In research preview, Managed Agents supports defining success criteria that Claude iterates against until met. In internal testing, this improved task success by up to 10 percentage points over standard prompting, with the largest gains on the hardest problems.

My system does post-hoc verdicts: the worker runs once, a separate gate checks the output, pass or fail. One shot. If the verdict fails, I re-dispatch manually.

The difference is obvious: iterate before declaring done, not after. Acceptance criteria in the spec, worker self-checks against them, loops until they pass (bounded by attempt count). The verdict gate becomes an external auditor confirming work the worker already believes is correct.

This directly addresses my biggest overnight failure mode: worker declares “done” but the output doesn’t meet the spec.

The meta-harness concept

This is the subtlest insight.

Anthropic’s engineering blog opens with a specific example: Sonnet 4.5 had “context anxiety” — it would wrap up tasks prematurely as it approached the context limit. They added context resets to the harness to compensate. Then Opus 4.5 came along, the behavior disappeared, and the resets became dead weight.

The lesson: harnesses encode assumptions about what the model can’t do, and those assumptions expire. The meta-harness design is opinionated about interface shapes (session, harness, sandbox) but unopinionated about what runs behind them. When the model improves, the harness implementation changes; the interfaces don’t.

My coaching file — a list of “what the implementer model does wrong” prepended to every dispatch — is exactly this kind of expiring assumption. Some entries describe real model limitations. Others describe limitations the model has already outgrown. The coaching file should be a diff against reality, tested regularly and pruned aggressively.

Five patterns worth stealing

1. Append-only event log per task

The worker emits structured events: task_started, tool_called, file_written, error, checkpoint. Stored outside the worker. On crash, a new worker reads events and resumes from the last checkpoint. Also gives you the audit trail — no more guessing whether a commit landed.

2. Abstract the execution target

execute(name, input) → string as the universal interface. The orchestrator doesn’t know if the target is an ARM container, a Fly machine, or a local subprocess. Dispatch routes by task requirements (needs GPU? needs browser?), not by hardcoded host.

3. Self-evaluation before completion

Spec frontmatter gets acceptance_criteria:. Worker runs, self-evaluates, iterates (bounded). Verdict gate becomes the external auditor, not the only check. Turns one-shot dispatch into an iterate-until-correct loop.

4. Stateless harness recovery

Depends on #1. Once events exist, recovery is: new worker reads events, reconstructs state, continues. “Stuck task” goes from a 2-hour timeout to an auto-recovery.

5. Credential baking at init time

Git tokens wired into local remotes during provisioning, then removed from the environment. The agent uses git push without ever seeing a token. Smaller attack surface than env var injection.

The business signal

The architecture is interesting. The business move is more interesting.

Three actions in three days: cut off third-party subscription arbitrage (OpenClaw, April 4), ship the strongest model (Opus 4.6, April 7), launch the agent platform (April 8). The $0.08/session-hour pricing means Anthropic now has revenue decoupled from token volume. They’re selling compute time, not just intelligence. That’s a cloud infrastructure business model, not an API business model.

For solo builders, the implication is clear: the agent platform layer is consolidating upward. Build the parts that are specific to your workflow. Don’t build generic sandboxing, session management, or credential vaults — those are becoming commodity infrastructure. Build the judgment layer: which tasks to dispatch, what acceptance criteria to set, how to route between models and compute targets.

That’s where the leverage is.