skip to content
Terry Li

A Chinese article by 归藏 (归藏的AI工具箱) went around this week: “为什么一夜之间大家都在做CLI?” It surveys Lark CLI, Google Workspace CLI, Stripe CLI, ElevenLabs CLI and concludes that CLI is becoming the de facto AI plugin format. Bundle execution, MCP protocol, and docs in one cross-platform package. Ship a binary, not a server.

I read it and felt validated. I run a personal AI system with 442 capabilities: 201 CLI effectors, 39 MCP tools, 202 skills. CLI-first is not a trend I’m following. It’s a constraint I’ve been building under for months, with a written decision tree that determines when CLI wins and when it doesn’t.

The article gets the direction right. But “CLI is good” is the easy part. The hard part is knowing exactly when CLI is wrong, how to make CLI output work for agents, and what the article doesn’t mention at all: the skill layer.

The decision tree

My system has a three-step binary test for every new tool. No spectrums, no “it depends.”

Step 1. Does the tool need cross-invocation mutable state that can’t live on the filesystem? In-memory sessions, persistent connections, open browser tabs, streaming channels. If yes: MCP. A persistent browser tab dies if the process dies. You need a long-running server.

Step 2. Does the input schema have nested objects or arrays of structured records? Not strings and flags, but JSON objects inside JSON objects. If --operations='[{"type":"transfer","amount":100,"metadata":{...}}]' is the natural input shape: MCP. The typed schema earns its keep.

Step 3. Otherwise: CLI + skill. Always.

That’s it. Three binary tests. No judgment calls. In practice, 201 of my 240 tools are CLIs. The 39 that are MCP are genuinely stateful (browser automation, live log streaming, daemon processes). The ratio is not accidental.

The tiebreaker rule: CLI wraps into MCP with three lines of subprocess.run. MCP does not unwrap into CLI. That asymmetry is the whole argument. I wrote about this in The reversible direction.

What makes CLI actually work for agents

Having 200 CLIs is not inherently useful. Most of them would be garbage if they just printed text. The article covers execution and packaging but doesn’t address output.

My CLIs use porin, a library I built for structured agent-facing output. Every response is a JSON envelope: ok, result, error, fix, next_actions. The next_actions array is the key. Each tool response tells the agent exactly what commands to run next. The agent never constructs a command from memory. It follows suggestions. This kills the entire class of “agent doesn’t know the right flags” failures.

Here’s the honest number though: only 51 of 201 effectors have porin structured output today. That’s 25%. The rest still emit text that the agent parses fuzzily. It works because the model is good at parsing text. But “the model compensates” is not an architecture. The gap between my aspirational design and the actual system is real, and closing it is ongoing work.

What the article misses: skills as auto-triggering documentation

The article frames CLI as “execution + MCP + docs.” The docs part gets one sentence. In my system, the docs are the most important layer.

Each of my 202 skills is a SKILL.md file with a description in its frontmatter. When a user asks for something, the agent matches the request against skill descriptions and auto-loads the relevant ones. No manual configuration. No “install this plugin.” The user says “search the web” and the agent loads the rheotaxis skill, which tells it to use my rheotaxis CLI with specific flags and patterns.

Skills are not static documentation. They’re runtime behavior. A skill can tell the agent “always pass --json”, “retry once on 429”, “pipe through jq before returning”, “never use this flag in production.” The agent reads the skill at invocation time and follows it. The skill changes, the behavior changes. No code deployment required.

This is the layer that turns 200 independent CLIs into a coherent system. Without it, you have a toolbox. With it, you have judgment about when and how to use each tool.

The distribution test nobody applies

The article celebrates companies packaging their APIs as CLIs. That’s correct for companies. What it implies but doesn’t say: most personal tools should NOT be packages.

I designed an elegant brew install flow for my system once. Then I listed the hard problems: polyglot packaging across Python/Rust/Go, absolute hook paths that break on install, bootstrapping private repos for strangers, merge conflicts on upgrade. None of these had good answers because none of them needed to exist. No one had asked for it. I wrote about this in Why I didn’t package my AI organism.

My distribution test has four binary criteria. A tool gets its own package ONLY if: (1) strangers would actually install it, not just theoretically could; (2) it has an independent release cycle; (3) different dependencies from the host system; (4) clean import boundary. Fail any one and it’s a directory in the monorepo.

Of my 201 effectors, exactly 3 are standalone packages on PyPI. The other 198 are folders. That ratio is correct.

What’s still broken

I don’t want to paint this as a solved problem. Three things are actively painful.

Discovery is grep. When I need to find a tool, I grep. There’s a proteome search command that scans effectors, but the underlying mechanism is string matching against file metadata. No semantic search, no usage-based ranking, no “tools similar to X.” At 200+ tools, grep still works. At 500, it won’t.

Structured output coverage is low. 25% of effectors have the full porin envelope. The rest work because Claude Code is forgiving with text parsing. This is technical debt with a clear paydown path (add porin to each CLI as I touch it) but the gap is real.

Skill trigger matching is fragile. Skills auto-load based on description matching against the user’s request. This is fundamentally fuzzy. Sometimes the wrong skill loads. Sometimes the right skill doesn’t. I’ve added a hook-based nudge layer to compensate, but it’s a patch on an imprecise foundation.

What actually matters

If you’re building CLI tooling for AI agents, the article is right that CLI is the format. Here’s what I’d add from building 200 of them:

The output contract matters more than the execution. A CLI that prints text works today because models are good at parsing. A CLI that returns a JSON envelope with next_actions works tomorrow because the contract is explicit. Build the envelope.

Skills are the missing layer. A CLI without a skill is a tool without judgment. The skill tells the agent when to use it, how to use it, and what to do with the result. This is not documentation. It’s runtime behavior that auto-loads on context match.

Don’t package what doesn’t need packaging. The urge to share is strong. Apply the distribution test first. Most tools are personal infrastructure. That’s fine. A folder in a monorepo is the right answer 98% of the time.

Pick the reversible direction. If you’re choosing between CLI and MCP and you genuinely can’t decide, CLI wins. Not because it’s better. Because you can change your mind.