skip to content

I Built 200 CLIs for My AI. Here's What Actually Matters.


A Chinese article by 归藏 went around this week arguing that CLI is becoming the de facto AI plugin format. It surveys Lark CLI, Google Workspace CLI, Stripe CLI, ElevenLabs CLI and concludes: bundle execution, MCP protocol, and docs in one cross-platform package. Ship a binary, not a server.

I read it and felt validated. I run a personal AI system with 442 capabilities: 201 CLI effectors, 39 MCP tools, 202 skills. CLI-first is not a trend I am following. It is a constraint I have been building under for months, with a written decision tree that determines when CLI wins and when it does not. The article gets the direction right. But “CLI is good” is the easy part. The hard part is knowing exactly when CLI is wrong, how to make CLI output work for agents, and what the article does not mention at all: the skill layer.

My system has a three-step binary test for every new tool. Does the tool need cross-invocation mutable state that cannot live on the filesystem — in-memory sessions, persistent connections, open browser tabs, streaming channels? If yes: MCP. Does the input schema have nested objects or arrays of structured records, not strings and flags but JSON objects inside JSON objects? If yes: MCP. Otherwise: CLI plus skill. Always. Three binary tests, no judgment calls. In practice, 201 of my 240 tools are CLIs. The 39 that are MCP are genuinely stateful. The ratio is not accidental. The tiebreaker: CLI wraps into MCP with three lines of subprocess.run. MCP does not unwrap into CLI.

Having 200 CLIs is not inherently useful. Most of them would be garbage if they just printed text. My CLIs use porin, a library I built for structured agent-facing output. Every response is a JSON envelope with ok, result, error, fix, and next_actions. The next_actions array is the key — each tool response tells the agent exactly what commands to run next. The agent never constructs a command from memory. It follows suggestions. This kills the entire class of “agent does not know the right flags” failures. The honest number though: only 51 of 201 effectors have the full porin envelope today. That is 25 percent. The rest work because the model is good at parsing text. But “the model compensates” is not an architecture. The gap is real and closing it is ongoing work.

What the article misses entirely is skills as auto-triggering documentation. Each of my 202 skills is a SKILL.md file with a description in its frontmatter. When a user asks for something, the agent matches the request against skill descriptions and auto-loads the relevant ones. No manual configuration. The user says “search the web” and the agent loads the rheotaxis skill, which tells it to use my rheotaxis CLI with specific flags and patterns. Skills are not static documentation. They are runtime behaviour. A skill can tell the agent to always pass certain flags, retry once on rate limits, pipe through jq before returning, never use a flag in production. The agent reads the skill at invocation time and follows it. The skill changes, the behaviour changes. No code deployment required. This is the layer that turns 200 independent CLIs into a coherent system. Without it, you have a toolbox. With it, you have judgment about when and how to use each tool.

The article celebrates companies packaging their APIs as CLIs. That is correct for companies. What it implies but does not say: most personal tools should not be packages. I designed an elegant brew install flow for my system once. Then I listed the hard problems and none of them had good answers because none of them needed to exist. My distribution test has four binary criteria. A tool gets its own package only if strangers would actually install it, it has an independent release cycle, different dependencies from the host system, and a clean import boundary. Fail any one and it is a directory in the monorepo. Of my 201 effectors, exactly three are standalone packages on PyPI. The other 198 are folders. That ratio is correct.

Three things are still actively painful. Discovery is grep — at 200 tools it works, at 500 it will not. Structured output coverage is low at 25 percent. And skill trigger matching is fragile, based on fuzzy description matching that sometimes loads the wrong skill.

If you are building CLI tooling for AI agents, the output contract matters more than the execution. Skills are the missing layer between tools and judgment. Do not package what does not need packaging. And pick the reversible direction — CLI wraps to MCP cheaply, the reverse is a rewrite.