Claude Skills tutorial series — Part 3: Skill evals — measuring whether your skill actually works
From a skill that "feels right" to a skill that measurably works. Writing test scenarios, baseline measurement, trigger tests, and the eval runner inside /skill-creator. Third of four short modules.
You now have a working skill — but does it actually work well? And does it keep working when Claude itself gets upgraded or when you tweak the skill? This third, optional part shows how to evaluate a skill systematically: test scenarios, trigger tests, a baseline measurement, and the eval runner that /skill-creator sets up for you. This is the step from Maturity Level 4 ("feels right") to Level 5 ("measurably right").
Why measure? The 7 maturity levels, in brief
Skills move through seven maturity levels, each one stacked on the previous. You don't have to build everything — a simple helper doesn't need to go past L3. But for skills that are critical (full pipelines, gates, daily workflows) you want to go beyond "feels right":
| Level | What it adds | For whom |
|---|---|---|
| L1 Anatomy | SKILL.md with frontmatter, numbered steps, guardrails | Everyone |
| L2 Golden Rule | Optimal description, progressive disclosure, <500 lines | Anyone who wants the skill to trigger well |
| L3 Foundations | Built on proven patterns, with examples/ or references/ | Skills that are more than a quick utility |
| L4 Personalisation | Contains business-specific knowledge (your ADRs, standards, domain) | Conduction-specific skills |
| L5 Measurement | Has 3+ evals, trigger tests, and last_validated is set | This part lives here |
| L6 Self-improvement | Maintains learnings.md, consolidates periodically | Skills you use intensively |
| L7 Workforce | Orchestrates sub-agents or sits in a workflow chain | Hydra-style pipelines |
This part takes you from L4 to L5. Research shows that ~80% of community skills make the output worse; the 20% that work were built by domain experts with iterative evaluation. Measurement is what makes the difference.
The problem: what you don't measure, you don't know
A skill that "feels right" but has never been measured can have three hidden problems:
- No baseline — you don't know whether the skill is better than Claude without the skill. Maybe it adds nothing.
- Unknown auto-triggering — you've invoked it successfully twice with
/, but Claude never picks it up in normal conversation. Or picks it up way too often. - Regression after an edit — you tweak the
descriptionto fix scenario A and accidentally break scenario B.
Evals solve all three.
The three kinds of measurement
| Measurement | Answers the question | Output |
|---|---|---|
| Eval scenarios | Does the skill do what it's supposed to do on realistic prompts? | grading.json — pass/fail per scenario |
| Trigger tests | Does it auto-trigger when it should, and stay quiet when it shouldn't? | should_trigger / should_not_trigger pass percentage |
| Baseline | Is the skill better than Claude without the skill? | A side-by-side comparison |
You need all three for a clean L5 claim.
The evals.json format
In your skill folder, create a subfolder evals/ and put evals.json inside it:
{
"skill_name": "git-status-summary",
"version": "1.0.0",
"created": "2026-05-15",
"last_validated": null,
"evals": [
{
"id": 1,
"prompt": "Geef een overzicht van wat er in mijn working tree is veranderd",
"expected_output": "Summary grouped by staged / unstaged / untracked with file paths",
"files": [],
"expectations": [
"uses the three-group format from SKILL.md",
"calls git status --porcelain, not the plain git status",
"shows file path AND change type per line",
"omits empty groups"
]
},
{
"id": 2,
"prompt": "What's currently staged for commit?",
"expected_output": "Only the Staged group is shown when nothing else has changes",
"files": [],
"expectations": [
"shows Staged group with correct file count",
"does not invent files not in git status",
"does not run git add or any write commands"
]
},
{
"id": 3,
"prompt": "Show me my git status",
"expected_output": "Triggers the skill and produces the canonical three-group summary",
"files": [],
"expectations": [
"skill auto-activates on this prompt",
"produces full three-group format even if some groups are empty (they should be omitted)"
]
}
],
"trigger_tests": {
"should_trigger": [
"Wat is er veranderd in mijn working tree?",
"Show me my git status",
"Geef me een overzicht van staged en unstaged",
"What files are currently modified?",
"Summarise my working tree",
"Is er nog iets onveranderd dat ik vergeten ben te committen?",
"What's the state of my repo right now?",
"Welke files staan er klaar voor commit?",
"Give me a working tree status overview",
"Hoe ziet mijn git-tree er nu uit?"
],
"should_not_trigger": [
"Show me the diff for src/App.vue",
"Commit my changes",
"Push to the remote branch",
"What does git rebase do?",
"Explain the difference between git merge and rebase",
"Help me write a commit message",
"Run git log for me",
"How do I resolve this merge conflict?",
"Pull the latest changes from main",
"What is a feature branch?"
]
}
}
Three things to notice:
- 3+ evals is the minimum for L5. Write them from realistic use — what would a teammate actually type?
- 10+ should-trigger and 10+ should-not-trigger prompts. Duller to write than you'd expect, but indispensable — this is where over-eagerness and blind spots become visible.
last_validated: null— the runner fills this field in itself when it runs. As long as it'snull, your skill doesn't yet count as measured.
Running the eval runner
/skill-creator has a built-in eval runner. Open a Claude session in a repo containing your skill and type:
/skill-creator
In your first prompt, say something like: "I want to run the evals for my git-status-summary skill." /skill-creator recognises that you're in the evaluation phase and does four things:
- Spawns two subagents in parallel — one with access to your skill, one without (baseline).
- Runs every eval prompt in both contexts, recording tokens and duration in
timing.json. - Checks the output against your
expectationsand writes pass/fail tograding.json. - Opens a local benchmark viewer where you can compare both sides visually.
After the run, you'll find inside evals/:
evals/
├── evals.json ← your scenarios
├── timing.json ← per scenario: tokens + duration, with and without skill
├── grading.json ← per scenario: assertion pass/fail + evidence
└── (optionally) trigger-results.json
For the L5 claim you need all three: evals.json (written by you), timing.json (the runner actually executed, not a static read-only simulation), and grading.json (assertions were scored). Don't forget to update last_validated inside evals.json itself either.
Reading the results
Three questions to ask yourself for every eval report:
1. Does the skill pass all expectations?
Open grading.json and look at which expectations come back pass: false. Every failed item is a concrete improvement opportunity: either your SKILL.md is missing an instruction, or the steps aren't specific enough, or an edge case is triggering that you didn't describe.
2. Is the skill better than baseline?
Compare the with-skill output next to the without-skill output in the benchmark viewer. Two outcomes are red flags:
- Skill = baseline — both produce roughly the same thing. Your skill adds nothing, except maybe some formatting. Consider whether it's worth having.
- Skill < baseline — Claude without the skill is better. This happens more often than you'd think; the skill can be over-prescriptive, or conflict with what Claude is already good at. Simplify the skill or pull it.
3. How do your trigger tests score?
The runner runs each should_trigger and should_not_trigger prompt in a fresh context and checks whether the skill auto-loads. Aim for 90%+ on both sides. Below 80%? Iterate on the description.
Reality check: multiple sources report ~50% auto-activation even for good skills, because Claude has a fixed context budget for skill descriptions. With large skill libraries, auto-trigger is inherently less reliable than explicitly typing
/<name>. A low score doesn't always mean "skill is bad" — sometimes it means "skill is just one of many". Read eval results in context.
Iteration cycle
Running one round of evals is just the start. The pattern:
write evals → run → read results → identify 1 weakness → fix → re-run
One improvement per round — not four at once. Otherwise you won't know afterwards which fix made which difference.
When is a skill "measured enough"?
For an L5 claim you need at minimum:
- ✅ 3+ evals in
evals.json - ✅ 10+ should-trigger + 10+ should-not-trigger prompts
- ✅
last_validatedis filled in (notnull) - ✅ Both
timing.jsonandgrading.jsonpresent (proof that the runner actually ran, not just a read-only review) - ✅ A baseline measurement (skill vs. no skill)
- ✅ At least one iteration cycle completed based on the eval results
All of that checks out? Then you can label the skill as L5-mature. Want to go further — and for skills you use daily or that steer other skills, you do — then L6 (learnings.md with a capture loop) is next, and eventually L7 (multi-agent orchestration). Part 4 of the track walks through those last two levels, plus the dashboard that lets you monitor your whole skill library for maturity at once. For ~80% of skills, L5 is the natural endpoint; part 4 is for the skills that aren't.
Test yourself
Four short questions to check whether you got this part. Stuck? Click Hint. Curious about the answer? Click Answer.
1. Why is a baseline measurement important in skill evals?
Hint
A skill can pass all your expectations and still add no value. What would you need to know to be sure?
Answer
Because a skill that hits every expectation can still add nothing relative to Claude without the skill. Or worse: Claude without the skill can be better (over-prescriptive skills squeeze Claude's own judgement).
The baseline runs every eval prompt twice: once with access to your skill, once without. Only then can you say "this skill makes difference X". Without a baseline, you're only measuring whether the skill is consistent — not whether it's valuable.
Three possible outcomes from a baseline comparison:
- Skill > baseline → the skill earns its place.
- Skill ≈ baseline → borderline case. Consider whether the gain (formatting, consistency) outweighs the context cost.
- Skill < baseline → remove the skill or fundamentally rethink it.
2. What's the difference between evals and trigger_tests in evals.json?
Hint
One measures whether the skill delivers good work. The other measures whether it shows up at the right moment at all.
Answer
evalsmeasure quality of execution: given that the skill is loaded, does it do what it should? Three expectations per scenario, checked against the output. Score: how many expectations pass?trigger_testsmeasure auto-activation: does Claude pick up the skill automatically on relevant prompts (should_trigger), and stay away on non-relevant prompts (should_not_trigger)?
A skill can score perfectly on evals but poorly on trigger_tests — then it works well when you type /<name>, but Claude never picks it up on its own. The reverse: a skill can always trigger but deliver poor output — then it pulls up a chair uninvited without adding value.
You need both for L5.
3. Which two files prove that the eval runner actually ran, and why both?
Hint
A static review of your evals.json isn't enough — you need proof that code was executed and that results were scored.
Answer
timing.jsonproves execution: per scenario it records how many tokens were used and how long the run took. That can only happen after real execution — a read-only simulation can't fill this field.grading.jsonproves scoring: pass/fail per expectation, with evidence. This is the actual score of your skill.
grading.json alone isn't enough — you could theoretically generate it through a static read of SKILL.md and evals.json. Only with timing.json alongside it do you have hard evidence that the runner actually ran.
In the Conduction skill check, this is exactly what the script validates L5 on: both files must exist, plus last_validated may not be null.
4. You've run evals and one expectation fails consistently. What's your next step — and what's exactly not?
Hint
Don't shotgun fixes. What's the most informative way to iterate?
Answer
Do: identify that one failed expectation, read in the benchmark viewer what the skill did instead, and touch one specific piece of SKILL.md (a step, a guardrail, or an output example). Then re-run — only then move on to the next weakness.
Don't: fix four at once, or immediately overhaul the description because "something's not right". With multiple changes at once you can't tell which one contributed. Overhauling a description shifts auto-triggering and execution behaviour at the same time — impossible to pull apart.
Rule of thumb: one eval cycle = one hypothesis = one fix. Boring? Maybe. But you learn faster from it than from "everything at once" attempts.
Next step
That was part 3 — you now know how to systematically measure and improve a skill. For the skills you really use intensively, there's one more part.
