Back to my writings

Claude Skill Creator Moves from Vibes to Evidence

Until recently, creating skills for Claude felt a bit like trial and error. You write a skill. Run it a few times. If the output looks right, you keep it. If it does not, you tweak the prompt and try again. That approach works, but it is informal. There is no real structure for verifying whether a skill is actually improving the result.

Anthropic recently changed that.

The latest updates to the Skill Creator introduce something that will feel familiar to anyone who builds software: testing, benchmarking, and iteration. The difference is that this workflow now exists directly inside the Skill Creator.

Testing Skills Instead of Guessing

One of the most useful additions is built-in evaluations. Instead of manually checking a few outputs, I can now define tests that verify whether a skill behaves the way I expect. These tests measure things like success rate, execution time, and token usage.

That changes the workflow. A skill is no longer something that just feels helpful. It either improves the outcome or it does not.

A/B Testing Skills

Another interesting addition is the comparator agent. This agent acts as a judge. It compares outputs from different runs without knowing which one used the skill. For example, you can run the same prompt with and without a skill and let the comparator evaluate which output is better.

Before this update, most of us assumed our skills were helping. Now we can actually measure it.

Improving Skill Triggers

Skills only help if they activate at the right time. The new system analyzes the skill description and tests how reliably it triggers. If the trigger is weak or inconsistent, the Skill Creator can rewrite and refine the description automatically.

Under the hood it generates realistic prompts, runs the skill repeatedly, and separates those prompts into training and testing sets. It then iterates until the trigger becomes more reliable.

This helps solve two common problems: skills firing when they should not, and skills failing to activate when they should.

What Happens to Old Skills

Another thing worth thinking about is what happens to older skills.

As the underlying models improve, they often learn how to perform tasks that previously required a skill. When that happens, the skill can actually start holding the model back. Instead of helping, it may restrict what the model can naturally do.

Evaluations make this visible. If the base model starts passing your tests without the skill, it is probably time to remove it.

How the Evaluation System Works

The evaluation workflow itself is fairly straightforward. First you define test prompts and specify what a successful output should look like. You can also include files and additional context if needed.

The system then launches several agents to run the tests. Some runs use the skill. Others do not. A comparator agent evaluates the results and checks how well each output matches the expectations you defined.

At the end you get a report showing pass rates, completion time and token usage. There is also an HTML report that makes it easier to review the outputs and give feedback.

A More Mature Workflow

The biggest change here is not a single feature. It is the workflow. Skill development is moving away from intuition. It is moving toward testing and evidence.

Write the skill. Test it. Measure the result. Improve it or delete it.

That loop will feel very familiar to anyone who builds software.