The experiment loop isn't about code

Shopify open-sourced pi-autoresearch this week. The results are impressive: 300x faster unit tests, 65 percent CI build time reduction, 20 percent faster React component mounting. The pattern is simple — propose a change, benchmark it, keep what improves the metric, revert what does not, repeat forever. Everyone is talking about the performance numbers. I think they are missing the point.

The loop is: hypothesise, run, measure, keep or discard, log, repeat. Nothing in that sequence requires code. Nothing requires a compiler, a test suite, or a benchmark harness. It requires exactly one thing: a number that goes up or down. I have been running this same loop on sleep quality. The benchmark is an Oura readiness score. The experiments are bedtime windows — 10:30pm versus 11:30pm versus midnight, one week each. The metric is a seven-day rolling average. The decision rule is identical: better than current best, keep; worse, discard. I have also run it on consulting slide structure, where the benchmark is a clarity rating after a cold read and the experiments are section orderings.

What makes it work is not the automation. Pi-autoresearch runs autonomously — the AI agent loops without human intervention. That is useful for code, where the feedback cycle is seconds. But the mechanism that actually produces results is the discipline. One variable per experiment. Multi-variable changes produce uninterpretable results. This is obvious in a lab and ignored everywhere else. Most people optimising their morning routine change three things simultaneously and cannot tell what helped. A locked metric that you pick before you start and do not change. The temptation to add a second metric when the first one looks bad is the single most common way experiments fail. An append-only log where every experiment gets recorded, including the failures. The log is not a side effect — it is the deliverable. After ten experiments, the log tells you more than the winning configuration does, because it also contains the seven things that did not work and the three that were ambiguous. And a hard budget — you decide upfront how many experiments you will run and you stop when you hit the number. Exploration without a stopping rule is not science. It is gambling.

The real contribution from pi-autoresearch is the confidence scoring. It computes a score using Median Absolute Deviation. After three experiments, it tells you whether your best improvement is signal or noise. Green means the improvement is twice the noise floor. Red means it is within noise — you might be celebrating benchmark jitter. This matters far more outside code than inside it. A benchmark that runs in 0.3 seconds has low noise. An Oura readiness score has high noise. A human clarity rating on a scale of one to five has enormous noise. Knowing whether your improvement is real or just random variation is the difference between learning and self-deception.

The interesting future is not AI optimising your code while you sleep. It is the experiment loop becoming a standard tool for any decision where you can measure the outcome. Hiring criteria. Pricing tiers. Email subject lines. Meeting formats. The infrastructure is one script that runs a command and records a number. The discipline is four rules that fit on a card. Shopify built a performance tool. What they actually shipped is a decision-making protocol.

Keep reading