AI Skill Creation Tool | Automated Testing & Benchmarking

⚡️ Anthropic has expanded the functionality of its skill creation tool.

Now, the skill development process incorporates elements of engineering culture—testing, benchmarking, and iterative work—without the need to write code.

The Skill-creator has been enhanced with automated tests, comparison metrics, and A/B experiments, enabling creators to assess how well a skill performs before deploying it in real-world scenarios.

🟡 The main tool is evals (automatic quality assessments).

The creator sets test queries and describes the desired outcome. Then, the Skill-creator runs these tests simultaneously: with the skill enabled and disabled.

An independent comparison agent evaluates the responses blindly—without knowing which version is being tested—and immediately indicates whether the presence of the skill provides a noticeable performance boost.

Internal testing by Anthropic showed an increase in PDF skill accuracy from 6/8 to 7/8, and Excel skill accuracy from 6/8 to the maximum possible 8/8.

For a more detailed evaluation of each run, there is a separate benchmarking mode: it displays the percentage of successful cases, task completion time, and token consumption.

For example, with the PDF skill for handling incomplete forms and multi-page tables, success rate increased from 40% to 100%, while time spent remained unchanged.

🟡 Eval assessments are especially useful in the long term.

If the base model begins passing tests without an active skill, it signals that the skill has become integrated into its behavior and can be disabled. All results are stored locally and can be integrated into continuous integration systems.

🟡 The update also improved the trigger mechanism.

Claude now activates skills solely based on a short textual description in the system prompt.

The Skill-creator analyzes these descriptions for matches with test prompts and suggests adjustments to reduce false positives or misses.

Based on internal testing, most public skills showed an improvement in trigger accuracy—5 out of 6 cases.

All these innovations are now available via the web interface and Cowork platform. The plugin for Claude Code has been updated or can be installed manually from the repository.

Created with n8n:
https://cutt.ly/n8n

Created with syllaby:
https://cutt.ly/syllaby

AI Skill Creation Tool | Automated Testing & Benchmarking | Anthropic