skip to content
Terry Li

The new generation of personal AI agents sells a seductive promise: the agent improves itself. It watches what you do, auto-generates skills from complex tasks, curates its own memory, and gets better the longer it runs. Zero maintenance. You install it, configure your messaging accounts, and walk away. The learning loop handles the rest.

I have been running a system like this for fourteen months, and I can tell you exactly where that loop breaks down.

The auto-generated skill captures the procedure. It records that you ran five commands in sequence and got a result. What it cannot capture is judgment — when to skip step three, what the error in step four actually means, why this procedure fails silently on Thursdays. Procedure without judgment is a script. Scripts are fine until the environment shifts, and environments always shift. The skill that was auto-generated from one successful run becomes a confident, wrong playbook the moment the underlying API changes or the edge case finally appears.

The self-curated memory is worse. One system I evaluated limits memory to 2,200 characters — about 800 tokens. The agent decides what to remember and what to forget. This sounds efficient until you realise that the hardest-won corrections are the ones that feel least important in the moment. “Never run this migration without checking the merge base first” looks like a minor note until the day you lose a branch. An agent optimising for relevance will eventually consolidate away the correction that prevents the catastrophe, because catastrophes are rare and rare things look irrelevant to a system that optimises for frequency.

My system has over 150 typed correction files accumulated over fourteen months. Each one has metadata — where the correction came from, how durable it is, whether it has been confirmed by subsequent incidents. Some are protected and cannot be auto-pruned. The memory is not a flat scratchpad the agent manages. It is a curated knowledge base with explicit lifecycle rules, and the human decides what gets protected status. This is more maintenance. It is also why the system has not repeated the same mistake twice since March.

The real difference is not feature depth. It is the difference between a garden and a weed patch. Both grow. A weed patch grows faster because nobody tends it. A garden grows slower because someone decides what stays and what gets pulled. But a weed patch never produces a harvest. It produces volume that looks like growth until you try to find something useful in it.

Auto-improvement is a local maximum. The system gets better quickly at first — it learns your name, your timezone, your project structure. Then it plateaus, because the remaining improvements require judgment that the system cannot make about itself. Is this correction important enough to protect? Is that skill still valid after last week’s refactor? Should this memory be promoted to a permanent rule or left to decay? These are editorial decisions, and editorial decisions require understanding the consequences of being wrong, which requires experience the agent does not have.

The agents that compound over months and years are the ones someone tends. Not constantly — the maintenance should shrink over time as patterns get crystallised into deterministic checks. But the human in the loop is not a weakness to be engineered away. The human is the editorial function that prevents the knowledge base from silting up with confident, stale, auto-generated noise.

If you are choosing an AI agent for a weekend project, pick the one with the learning loop. It will be great for a month. If you are choosing one to run for a year, pick the one that lets you curate. The loop is a feature. The curation is the architecture.

· · ·

Keep reading