What I Found Evaluating 5 Agent Skill Repos

5 Apr 2026 · 3 min read ·

A same-day correction before we begin. The original version of this post concluded that skill repositories needed more engineering rigour and that the tooling to provide it was largely missing. I evaluated five repositories and missed the broader ecosystem that surrounds them. Prompt engineering as a discipline already has mature testing frameworks, linters, eval platforms, and distribution infrastructure. What follows describes what I found in the repositories I actually read. The concluding sections acknowledge what I should have known before writing.

Agent skills are having a moment. The agentskills.io spec launched, repos are clearing 100k stars, every AI coding tool supports SKILL.md files. I spent a day cloning the top repositories and reading their implementations. Stars do not correlate with quality. The repo with the best architecture had 33k stars. The one with the most marketing had a flagship feature that was a print statement logging a suggestion to stderr.

The most useful pattern I found was in Vercel’s agent-skills repo. Their React best-practices skill has sixty-nine rules, each in its own markdown file with YAML frontmatter declaring a title, impact level, and tags. A TypeScript build script compiles these into a flat document. This lets you filter rules by impact, validate structure automatically, extract test cases from code blocks, and diff individual rules in git. I adopted this pattern for my own coaching rules and it worked well for that specific workflow, though as I discovered later it is one useful pattern for one specific problem — rule-heavy reference skills — not a general answer.

Vercel’s deploy skill does something that should be standard: it gathers state before deciding what to do. Step one runs four checks in parallel — is there a git remote, is the project linked, is the CLI authenticated, which teams exist. Step two is a decision matrix mapping every combination of results to a specific deploy method, each self-contained with exact commands. No ambiguity about which path to take. No prose fallback chains. Just exhaustive branching on known state.

The standout innovation across the five repos came from wshobson/agents, which includes PluginEval — a three-layer quality framework for measuring whether skills actually work. Static analysis, then an LLM judge evaluating trigger accuracy and output quality, then Monte Carlo testing across fifty to a hundred varied prompts with proper confidence intervals. The composite score combines all three layers. The implementation is clean Python with proper statistics.

What is oversold: everything-claude-code has 138k stars and markets a feature called /evolve that supposedly auto-extracts recurring session patterns into new skills. The implementation counts user messages and logs a suggestion to stderr. No extraction. No skill generation. No learning loop. The same repo has a beautifully designed SQLite schema for tracking session state that is never populated — the hooks write to markdown temp files instead. Microsoft’s skills repo claims 132 skills with language-suffix naming. Only ten actually exist in the repository. The rest are broken symlinks to plugins. Star counts measure discovery. Implementation quality measures utility.

The five repositories I evaluated are not the ecosystem. They are a slice of it. The ecosystem includes Promptfoo, Braintrust, Langfuse, DeepEval, agnix, claudelint, Anthropic’s official skill-creator, skillgrade, and the Claude Code Plugin Marketplace. Mature tooling exists for testing, validation, observability, and distribution.

The direct lesson for me was an hour of running agnix against my own 188-skill collection, catching seventy-one errors I had not seen, and fixing them. That was the highest-leverage thing I did that day, and I almost missed it because I was writing about a problem I thought was unsolved. Read the landscape before you describe it.