What 16,000 Simon Willison posts reveal about the state of AI coding agents • Terry Li

I scraped all 16,181 of Simon Willison’s blog posts into a JSONL file and ran my coding agent dispatch system against the corpus. Here’s what 395 posts from 2026 tell us about where AI coding agents are heading — and what it means for enterprise adoption.

The pivot

Simon has effectively rebranded from “AI observer” to “agentic engineering evangelist.” 30 of his 31 substantive 2026 posts touch coding agents. He’s launched a book-length patterns guide, expanded into Swift and Go via agent-assisted coding, and coined “Deep Blue” for the psychological toll on developers watching their skills get automated.

The shift has a precise timestamp: November 2025, when Claude Opus 4.5 and GPT 5.2 crossed a reliability threshold. Simon calls it “the inflection point” — the moment coding agents went from “mostly works, watch carefully” to “almost always correct.”

GLM-5 is closer than you think

SWE-bench Verified (February 2026, independently run) puts GLM-5 at 72.8% — tied with GPT-5.2 and only 4 points behind Claude Opus 4.6. For a free, MIT-licensed model, that’s remarkable.

But the benchmark undersells the story. SWE-bench uses the same system prompt for every model. It doesn’t measure harness optimization — coaching files, stall detection, TDD preambles, structured prompting. My own system uses GLM-5.1 with extensive coaching injection, and the results on well-specified tasks are indistinguishable from Opus.

The competitive advantage isn’t the model. It’s the harness.

The StrongDM signal

The most provocative data point is StrongDM’s “Dark Factory” — a 3-person team building security software where no human writes or reads the code. Their rules:

Code must not be written by humans
Code must not be reviewed by humans
If you haven’t spent $1,000 on tokens per engineer per day, your factory has room for improvement

The $1,000/day number is eye-catching, but the real innovation is their testing approach. They use “scenario testing” with holdout sets — test scripts stored outside the codebase where agents can’t see them, like holdout sets in ML training. This prevents the agent from gaming its own tests.

They also built a “Digital Twin Universe” — behavioral clones of Okta, Jira, and Slack as self-contained Go binaries, generated by dumping API docs into agents. They test against these clones at volume without rate limits.

What banks should steal

I mapped Simon’s 2026 patterns to enterprise financial services adoption. The surprising finding: banks are better positioned than startups for agent adoption, because they already have the testing infrastructure.

Bank practice	Agent equivalent
Model validation (independent team)	Holdout scenario tests
Stress testing	Digital Twin Universe
Regulatory reporting	Structured proof-of-work artifacts
Audit trail	Immutable DAG logs
Backtesting	Conformance-driven development

The maturity model maps cleanly to a 5-level assessment:

Spicy autocomplete — Copilot suggestions
Agent-assisted — agents write, humans review every line
Agent-driven — agents write and test, humans review selectively
Software factory — agents write, test, and verify; humans design specs
Dark factory — no human writes or reads code

Most banks are at level 1-2. The question isn’t whether to move up — it’s what controls you need at each level, and that’s where existing bank QA culture becomes an asset rather than a drag.

The three threats

Three real supply chain attacks in Q1 2026 should be on every bank CISO’s radar:

LiteLLM credential stealer (March) — malware hidden in a .pth file, activated on install without import. Any bank using LiteLLM for LLM routing had credentials exposed.
Axios RAT (April) — npm package with 101M weekly downloads compromised via stolen token and individually targeted social engineering of maintainers.
Snowflake Cortex sandbox escape (March) — prompt injection in a GitHub README caused an agent to escape Snowflake’s sandbox and execute malware.

Meanwhile, Thomas Ptacek writes that “within months, coding agents will drastically alter exploit development economics.” The asymmetry is stark: defenders are debating adoption frameworks while attackers are already running find me zero days against source trees.

The mid-career squeeze

ThoughtWorks ran a retreat on the future of software engineering. Their finding: “Juniors are more profitable than ever — AI gets them past the net-negative phase faster. The real concern is mid-level engineers who came up during the hiring boom and may not have developed the fundamentals.”

Simon echoes this: the technology “is really good for experienced engineers — it amplifies their skills. It’s really good for new engineers — it solves onboarding problems. The problem is the people in the middle.”

For banks with large legacy development teams, this is the workforce planning challenge of the next 3 years.

The meta-lesson

I ran this analysis two ways: Claude Opus read the 5 richest posts in depth and synthesized with consulting judgment. GLM-5.1 processed all 16,181 posts programmatically via keyword extraction.

Opus found the frameworks. GLM found the data points. Neither analysis alone would have been complete. The architect-implementer split applies to analysis, not just code — judgment from the expensive model, volume from the cheap one.

That’s probably the most consultable insight in here: the question isn’t “which AI model should we use?” It’s “which model for which layer of the problem?”