What I Found Evaluating 5 Agent Skill Repos
/ 5 min read
Note (same-day correction): The original version of this post concluded that skill repositories needed more engineering rigor and that the tooling to provide it was largely missing. I evaluated five repositories and missed the broader ecosystem that surrounds them. Prompt engineering as a discipline already has mature testing frameworks, linters, eval platforms, and distribution infrastructure. The paragraphs below describe what I found in the repositories I actually read. The concluding sections acknowledge what I should have known before writing.
Agent skills are having a moment. The agentskills.io spec launched, repos are clearing 100k stars, every AI coding tool supports SKILL.md files. I spent a day cloning the top repositories and reading their implementations.
Stars do not correlate with quality. The repo with the best architecture had 33k stars. The one with the most marketing had a flagship feature that was a print statement logging a suggestion to stderr.
Vercel’s atomic rule files
The most useful pattern I found was in Vercel’s agent-skills repo. Their React best-practices skill has 69 rules, each in its own markdown file with YAML frontmatter declaring a title, impact level, and tags. A TypeScript build script compiles these into a flat document.
This lets you filter rules by impact, validate structure automatically, extract test cases from Incorrect/Correct code blocks, and diff individual rules in git. I adopted this pattern for my own coaching rules and it worked well for that specific workflow. But as I discovered later, this is one useful pattern for one specific problem — rule-heavy reference skills — not a general answer to how skills should be engineered.
The deploy skill is a design masterclass
Vercel’s deploy-to-vercel skill does something that should be standard: it gathers state before deciding what to do. Step 1 runs four checks in parallel. Is there a git remote? Is the project linked? Is the CLI authenticated? Which teams exist? Step 2 is a decision matrix mapping every combination of results to a specific deploy method, each self-contained with exact commands.
No ambiguity about which path to take. No “if you have X try Y but if that doesn’t work…” prose. Just exhaustive branching on known state.
PluginEval
The standout innovation across the five repos came from wshobson/agents, which includes PluginEval — a three-layer quality framework for measuring whether skills actually work. Layer 1 is static analysis. Layer 2 is an LLM judge evaluating trigger accuracy, orchestration fitness, output quality, and scope calibration. Layer 3 is Monte Carlo testing across 50-100 varied prompts with proper confidence intervals.
The composite score combines all three layers. The implementation is clean Python with proper statistics. This is the most rigorous skill quality assessment framework I found in any of the five repos. It is also, I later realized, one of several such frameworks in the broader ecosystem.
What’s oversold
Everything-claude-code has 138k stars and markets a feature called /evolve that supposedly auto-extracts recurring session patterns into new skills. The implementation counts user messages and logs a suggestion to stderr. No extraction. No skill generation. No learning loop.
The same repo has a beautifully designed SQLite schema for tracking session state, skill runs, decisions, and governance events. The schema exists. The hooks that should populate the database write to markdown temp files instead.
Microsoft’s skills repo claims 132 skills with language-suffix naming. Only ten actually exist in the repository. The rest are broken symlinks to plugins.
Star counts measure discovery. Implementation quality measures utility.
What I missed
The five repositories I evaluated are not the ecosystem. They are a slice of it. The ecosystem includes:
Promptfoo, an open-source CLI used by OpenAI and Anthropic for prompt evaluation with full CI/CD integration. Braintrust, a SaaS platform that blocks pull request merges when evaluation metrics fail. Langfuse, open-source LLM observability built on OpenTelemetry. DeepEval, a comprehensive evaluation framework. These exist in the broader prompt engineering space. PromptOps is a named discipline in 2026.
For skills specifically: agnix has 385 rules across CLAUDE.md, SKILL.md, AGENTS.md, hooks, and MCP configs, with auto-fix and LSP servers for every major editor. claudelint provides similar validation as an npm package. Anthropic ships skill-creator as the official tool for writing and evaluating skills with a built-in eval framework. skillgrade provides unit tests for agent skills. The Claude Code Plugin Marketplace with plugin.json manifests is the official distribution layer. SkillsMP claims 700,000+ agent skills.
The meta-lesson
I read the repositories carefully. I did not read the ecosystem. Those are different things, and I conflated them.
The useful observations from this evaluation stand. Atomic rule files with impact frontmatter are a good pattern for knowledge-heavy skills. State-gather-then-branch is a good pattern for multi-path execution. PluginEval’s three-layer approach is a rigorous quality framework. The specific repositories I evaluated vary widely in implementation quality and some of their marketing exceeds what the code delivers.
What I should have concluded is different from what I originally wrote. The prompt engineering ecosystem has mature tooling for testing, validation, observability, and distribution. The skill-specific tooling is comprehensive and growing. The right question for a practitioner is not “how do I build engineering rigor into skills” — that question has been answered by an entire industry. The right question is which existing tools compose well for your specific workflow.
The direct lesson for me was an hour of running agnix against my own 188-skill collection, catching 71 errors I had not seen, and fixing them. That was the highest-leverage thing I did that day, and I almost missed it because I was writing about a problem I thought was unsolved. Read the landscape before you describe it.