skip to content

Recovery Is Not Control


The easiest mistake in software is to confuse faster repair with stronger control. If a team can detect a failure quickly, roll back quickly, patch quickly, and ship again quickly, the system starts to feel safer than it is. Speed creates confidence. Confidence becomes permission. Permission becomes architectural debt with better dashboards.

AI makes that mistake more tempting because it lowers the cost of recovery. A coding agent can write the fix. Another agent can add the test. A third can scan for similar patterns. The incident queue shrinks, the bug count improves, and the team can point to more automation than it had last quarter. None of that is worthless. Recovery matters. But recovery is not the same thing as control.

Control is partly about what happens before the failure. Is the blast radius bounded. Is the dependency graph still legible. Does anyone understand why the change is safe. Can the system be reasoned about without replaying a hundred local patches. Are critical paths owned by people who can explain them, not just agents that can modify them. A system can have excellent recovery mechanics and still be decaying as a system.

This is the old infrastructure lesson in a new costume. Mean time to recovery became a powerful idea because it accepted reality: failures happen, and resilient systems must recover. But the mature lesson was never that mean time between failures no longer mattered. It was that reliability needed both sides. Design so that failure is survivable, and design so that the system is not constantly inviting failure in the first place.

AI engineering culture can forget the second half because the first half has become so cheap. If agents can repair bugs quickly, bugs feel less expensive. If tests can be generated quickly, test coverage feels like understanding. If incidents can be triaged quickly, incident volume feels less threatening. The organisation starts measuring the places where automation is strongest and stops noticing the places where architecture is getting weaker.

The dangerous failures are not always the visible ones. A visible bug creates a ticket. A failing test creates a red mark. A broken deployment creates an alarm. The quieter failure is semantic: the code still runs, the tests still pass, the dashboard still looks healthy, but fewer people understand the shape of the system. Changes become locally correct and globally unintelligible. The machine recovers faster from the wounds it can see while becoming less able to notice the disease it is growing.

This matters most in systems where the cost of being wrong is not just engineering time. In high-trust environments, recovery speed cannot be the argument for weakening prevention. A system that moves money, grants access, produces evidence, changes records, routes risk, or supports a decision needs more than a fast repair loop. It needs tiering, review, rollback authority, provenance, entitlement boundaries, and a design that keeps human accountability attached to the parts that matter.

The same point applies inside ordinary software teams. The question is not whether agents should be used. They should be. The question is what they are allowed to substitute for. Agents can substitute for repetitive implementation, search, translation, scaffolding, test generation, and first-pass repair. They should not substitute for architectural ownership. They should not become an excuse to stop asking whether the system is still coherent.

Good AI engineering will probably look less like removing gates and more like moving them. Put automation where it compresses toil. Put checks where mistakes become expensive. Let agents repair the small things quickly, but keep prevention around the places where local correctness can still produce global damage. Treat test coverage and bug counts as signals, not absolution.

The real control question is not “how fast can we fix it.” It is “what are we becoming willing to break.”