skip to content
Terry Li

Note (same-day correction): An earlier version of this post framed “knowledge as code” as a paradigm shift and claimed nobody treats skills with engineering rigor. Both framings were wrong. The prompt engineering industry has been doing this for years, and the skill-specific tooling is already comprehensive. I left the original observations in place but rewrote the conclusions to match what actually exists. The embarrassment is deserved.

Agent skills are having a moment. Repositories are clearing 100k stars. Every coding agent supports the SKILL.md format. I spent a day cloning the top repositories and reading their implementations. Here is what I took away.

Stars do not correlate with quality. The repository with 138k stars had a flagship feature that was a print statement logging a suggestion to stderr. The repository with 33k stars had a production-grade evaluation framework with proper statistical confidence intervals. The two numbers are measuring different things. Always read the implementation.

Structure beats volume. Sixty-nine rules with YAML frontmatter, a build pipeline, and a structural validator are worth more than 248 unvalidated markdown files. The difference is not aesthetic. Structured rules can be filtered by impact, validated in CI, diffed individually in git, and compiled into different formats for different consumers. Unstructured files can only be read top to bottom.

Atomic rule files with frontmatter are a genuinely useful pattern at scale. Vercel’s React best-practices skill has 69 rules, each in its own markdown file with impact tags, compiled into a flat document by a TypeScript build script. This enables selective inclusion, structural validation, and test-case extraction. I adopted this pattern for my own coaching rules and it works. But it is one specific pattern for one specific problem — rule-heavy reference skills — not a paradigm shift.

The ecosystem for treating prompts as engineering artifacts is already mature. I thought I was writing about a missing tooling layer. I was not. Promptfoo is an open-source CLI used by OpenAI and Anthropic for prompt evaluation with CI/CD integration. Braintrust blocks merges when eval metrics fail. Langfuse provides OpenTelemetry-based LLM observability. DeepEval covers the evaluation framework space. The “prompts as pull requests” workflow is standard practice in 2026. PromptOps is a named discipline. Seventy-five percent of enterprises are expected to integrate generative AI this year.

Skill-specific tooling also exists. Anthropic’s skill-creator is the official tool for writing and evaluating skills. Minko Gechev’s skillgrade provides “unit tests for your agent skills.” Phil Schmid has a practical guide to testing methodology. OpenAI’s blog covers testing agent skills systematically with evals. The Anthropic-managed claude-plugins-official directory and the Claude Code Plugin Marketplace with plugin.json manifests are the distribution layer. SkillsMP claims 700,000+ agent skills.

Linters already exist. agnix has 385 rules covering CLAUDE.md, SKILL.md, AGENTS.md, hooks, and MCP configs, with auto-fix and LSP servers for VS Code, JetBrains, Neovim, and Zed. claudelint covers similar ground as an npm package. SkillCheck-Free has 30+ structural checks. These are real tools. I ran agnix against my 188-skill collection and it found 71 errors that my custom linter missed — unclosed XML tags, YAML parse errors, missing agent fields in frontmatter, shell-escaped backslashes flagged as Windows path separators. An hour of config tuning dropped it to zero.

Generators exist too. SkillForge provides wizard-based skill creation with template processing, documentation fetching via Context7, pattern learning from usage, and unused skill detection. It covers the “create new skills from your preferences” space comprehensively.

The right question is not “how do I treat skills with engineering rigor” — that question has been answered by an entire industry. The right question is “which existing tools do I use and how do they compose for my specific workflow.” For me that meant: agnix for validation (immediately useful, caught real bugs), Promptfoo for eval pipelines (future work when I add Incorrect/Correct code examples to my rules), and the official plugin marketplace format for anything I might want to share. The specific tools I built for myself — a rules-to-document compiler and a shared motif layer — are narrow conveniences for my particular workflow, not contributions to the ecosystem. That was a useful correction.

The meta-lesson is about the danger of writing confidently from partial information. I read five skill repositories carefully and thought I had the landscape. I did not. The landscape is one web search deeper, and the web search I should have done earlier. Before claiming that a tooling layer is missing, check that it is actually missing. Before framing something as a paradigm shift, check whether the paradigm already shifted. The embarrassment of having to correct a post same-day is cheaper than the embarrassment of having the post stand.