# claude-agent-sdk/skill-comparison (Claude Skill Comparison) You can run this example with: ```bash npx promptfoo@latest init --example claude-agent-sdk/skill-comparison cd claude-agent-sdk/skill-comparison ``` ## Overview This example is the Claude Agent SDK companion to [the agent-skill testing guide](https://www.promptfoo.dev/docs/guides/test-agent-skills). It compares two versions of a `review-standards` skill against the same authentication review tasks. - `fixtures/v1` and `fixtures/v2` each carry their own `.claude/skills/review-standards/SKILL.md` and an identical `src/auth.ts` so any score difference is attributable to the skill text. - The `skills:` filter (Claude Agent SDK 0.2.120+) scopes the session to a single skill and auto-allows the `Skill` tool. - A YAML anchor (`&reviewSchema`) feeds the same `output_format` JSON schema into both providers, which is the Claude SDK's equivalent of Codex's `output_schema` and is what stops the model from wrapping JSON in Markdown fences. - The eval verifies `skill-used` and scores both issue recall and precision via a JavaScript assertion. Precision is what penalizes a skill that ignores the user's requested scope and reports unrelated findings. ## Run From this directory: ```bash ANTHROPIC_API_KEY=… promptfoo eval --no-cache ``` `defaultTest.options.disableVarExpansion: true` keeps each `expectedIssues` array intact instead of [fanning it into one test case per element](https://www.promptfoo.dev/docs/configuration/test-cases#passing-arrays-to-assertions). Without it the example would run 6 cases instead of 4. The two skill versions are intentionally asymmetric: v1 caps reviews at one issue and only knows password hashing, while v2 also flags timing-unsafe token comparison and is told to respect the user's requested scope. The expected outcome is v1 passing only the narrower password-handling test (it lacks the timing-unsafe rule needed for the broad test) and v2 passing both tests because its scope rule keeps it accurate on the narrower one too.