--- sidebar_position: 66 title: Test Agent Skills description: Compare Claude and Codex skill versions with Promptfoo evals that measure invocation, task quality, cost, latency, and trace evidence to choose a winner. --- # Test Agent Skills Skills are local instructions that teach an agent when and how to use a capability. When you have two versions of the same skill, you usually want to know two things: 1. Does the agent use the skill when the task calls for it? 2. Does that skill version lead to better work? Promptfoo can answer both questions by running the same tasks against each version side by side. Keep the model, task files, and permissions the same, swap only the `SKILL.md`, and compare the results. ## Start With One Comparison In this example, we're comparing two versions of a `review-standards` skill. Each version gets its own fixture directory with the same source file and a different copy of the skill: ```text skill-eval/ ├── promptfooconfig.yaml └── fixtures/ ├── v1/ │ ├── .claude/skills/review-standards/SKILL.md │ └── src/auth.ts └── v2/ ├── .claude/skills/review-standards/SKILL.md └── src/auth.ts ``` For Codex, use `.agents/skills/review-standards/SKILL.md` instead of `.claude/skills/...`. The rest of the comparison can stay the same. ## Compare Two Claude Skill Versions We'll start with the [Claude Agent SDK provider](/docs/providers/claude-agent-sdk). The prompt asks for a short JSON review so the outputs are easy to score, and the two providers differ only in `working_dir`: ```yaml title="promptfooconfig.yaml" description: Compare Claude skill versions prompts: - '{{request}}' # YAML anchor for the schema both providers enforce. The Claude Agent SDK's # `output_format` is the equivalent of Codex's `output_schema` — it returns # valid JSON without you having to ask for "JSON only" in the prompt or # strip Markdown fences from the model's reply. x-review-schema: &reviewSchema type: json_schema schema: type: object required: [summary, issues] additionalProperties: false properties: summary: { type: string } issues: type: array items: type: object required: [id, severity] additionalProperties: false properties: id: { type: string } severity: { type: string, enum: [high, medium, low] } providers: - id: anthropic:claude-agent-sdk label: review-standards-v1 config: model: claude-sonnet-4-6 working_dir: ./fixtures/v1 setting_sources: ['project'] skills: ['review-standards'] append_allowed_tools: ['Read', 'Grep', 'Glob'] output_format: *reviewSchema - id: anthropic:claude-agent-sdk label: review-standards-v2 config: model: claude-sonnet-4-6 working_dir: ./fixtures/v2 setting_sources: ['project'] skills: ['review-standards'] append_allowed_tools: ['Read', 'Grep', 'Glob'] output_format: *reviewSchema ``` `setting_sources: ['project']` discovers `SKILL.md` files under `.claude/skills/`. The `skills:` filter (added in `@anthropic-ai/claude-agent-sdk` 0.2.120) narrows the session to a single skill and auto-allows the `Skill` tool, so it no longer needs to appear in `append_allowed_tools`. Pass `skills: 'all'` if you want every discovered skill enabled. On older SDK versions, drop `skills:` and add `'Skill'` back to `append_allowed_tools`. Without `output_format`, Claude usually wraps short JSON answers in Markdown fences or a leading sentence, which makes downstream `JSON.parse()` brittle. Setting it on both providers — once via the `&reviewSchema` anchor — gives you reliable structured output without prompt gymnastics. Because everything else is held constant, any meaningful difference in the results should come from the skill text itself. Next, add a few tasks that represent how the skill will really be used: ```yaml defaultTest: # By default, Promptfoo fans each YAML-list var into one test case per # element (matrix expansion), which would split each comparison test in # two. Disable that here so `expectedIssues` reaches the assertion intact. options: disableVarExpansion: true tests: - description: Finds both auth issues vars: request: Review src/auth.ts for password handling and token comparison issues. expectedIssues: - weak-password-hash - timing-unsafe-compare - description: Focuses on password handling only vars: request: Review src/auth.ts only for password handling issues. expectedIssues: - weak-password-hash ``` The first case asks for a broad review covering both topics. The second is narrower, which helps catch a skill that finds the right problems but ignores the user's requested scope. Pairing the two together is what lets the eval reward a skill version that both knows the full set of issues _and_ respects the user's scope when asked. See [Passing Arrays to Assertions](/docs/configuration/test-cases#passing-arrays-to-assertions) for the matrix-expansion behavior `disableVarExpansion` opts out of. ## Add Assertions Assertions tell Promptfoo what "better" means for this comparison. It helps to add them one layer at a time. ### Check That the Skill Was Used Start by verifying that Claude actually invoked the skill: ```yaml defaultTest: assert: - type: skill-used value: review-standards ``` Claude exposes `Skill` tool calls directly, and Promptfoo normalizes them into the [`skill-used`](/docs/configuration/expected-outputs/deterministic/#skill-used) assertion. This distinguishes a good answer that happened without the skill from one that was produced through the workflow you intended to test. ### Score the Output Then score whether the review found the expected issues: ```yaml defaultTest: assert: - type: javascript threshold: 0.7 value: | // `output_format` hands us a parsed object; on Codex `output_schema` // returns a string, so wrap with `JSON.parse(output)` there. const expected = context.vars.expectedIssues; const found = (output.issues || []).map((issue) => issue.id); const hits = expected.filter((id) => found.includes(id)); const recall = hits.length / expected.length; return { pass: recall >= 0.75, score: recall, reason: `matched ${hits.length}/${expected.length} expected issues`, }; ``` The JavaScript assertion gives each response a score based on issue recall. In your own eval, replace this with the signal that matters for the skill: tests passed, required edits were made, policy checks were followed, or a rubric was satisfied. ### Add Secondary Signals Once correctness is working, you can add supporting signals: ```yaml defaultTest: assert: - type: cost threshold: 0.50 - type: latency threshold: 120000 ``` These checks are useful when two skill versions are both correct but one is much more expensive or slower. If you want Promptfoo to mark the strongest output for each test case, add [`max-score`](/docs/configuration/expected-outputs/model-graded/max-score): ```yaml defaultTest: assert: - type: max-score value: method: average threshold: 0.7 weights: javascript: 4 skill-used: 2 cost: 0.5 latency: 0.5 ``` That weighting makes task quality the main signal, while still rewarding the version that routes correctly and stays within practical limits. ## Use the Same Eval With Codex To run the same comparison with [OpenAI Codex SDK](/docs/providers/openai-codex-sdk), keep the tests and scoring logic, then use Codex's bare `output_schema` shape with the provider block: ```yaml x-review-schema: &reviewSchema type: object required: [summary, issues] additionalProperties: false properties: summary: { type: string } issues: type: array items: type: object required: [id, severity] additionalProperties: false properties: id: { type: string } severity: { type: string, enum: [high, medium, low] } providers: - id: openai:codex-sdk label: review-standards-v1 config: model: gpt-5.5 working_dir: ./fixtures/v1 skip_git_repo_check: true sandbox_mode: read-only enable_streaming: true output_schema: *reviewSchema cli_env: CODEX_HOME: ./codex-home - id: openai:codex-sdk label: review-standards-v2 config: model: gpt-5.5 working_dir: ./fixtures/v2 skip_git_repo_check: true sandbox_mode: read-only enable_streaming: true output_schema: *reviewSchema cli_env: CODEX_HOME: ./codex-home ``` Codex discovers project skills from `.agents/skills/` under each `working_dir`. Its `skill-used` signal is inferred from successful reads of the matching `SKILL.md`, so keep `enable_streaming: true` while developing the eval if you want Promptfoo to collect that evidence. The earlier JavaScript assertion should use `JSON.parse(output)` with Codex because `output_schema` keeps `output` as a JSON string rather than a parsed object. The runnable [`skill-comparison` example](https://github.com/promptfoo/promptfoo/tree/main/examples/openai-codex-sdk/skill-comparison) uses the same approach, with a YAML anchor to share the schema between v1 and v2. There is a [matching Claude example](https://github.com/promptfoo/promptfoo/tree/main/examples/claude-agent-sdk/skill-comparison) that uses `output_format` and the `skills:` filter so you can run the same comparison against either provider. ## Add Trace Evidence When Needed For Claude Agent SDK, `skill-used` is usually enough to prove invocation because it is based on first-class `Skill` tool calls. If you need the raw call details, inspect the recorded tool calls directly: ```yaml assert: - type: javascript value: | const calls = context.providerResponse?.metadata?.toolCalls || []; return calls.some((call) => call.name === 'Skill'); ``` For Codex SDK, trace evidence can be useful when you want to see the workflow behind the final answer. Enable tracing and assert that Codex read the skill file: ```yaml providers: - id: openai:codex-sdk config: model: gpt-5.5 working_dir: ./fixtures/v2 skip_git_repo_check: true sandbox_mode: read-only enable_streaming: true deep_tracing: true tracing: enabled: true tests: - assert: - type: trajectory:step-count value: type: command pattern: '*review-standards/SKILL.md*' min: 1 ``` Use [Codex app-server](/docs/providers/openai-codex-app-server) when you need to test app-server-specific behavior such as explicit skill input items, approvals, or plugin metadata rather than just the skill outcome. ## Run the Eval Run the config the same way you would any other Promptfoo eval: ```bash npx promptfoo@latest eval -c promptfooconfig.yaml ``` From there, add only the options that help answer your current question: - Add `--repeat 3` when you want a better sample of nondeterministic agent behavior. - Add `--no-cache` while iterating on the skill text and you want fresh runs. - Add `-o results.json` when you want to inspect or compare results outside the terminal. - Run `npx promptfoo@latest view` when the side-by-side web view is more useful than the CLI table. In the web view, the comparison is easy to inspect side by side: ![Promptfoo web UI comparing two Codex skill versions](/img/docs/codex-skill-comparison.png) ## Decide Which Skill Wins Start with the comparison that matters most for the skill. For a review skill, that might be "which version finds the right issues with the least noise?" For a code-writing skill, it might be "which version passes the tests most often?" Then use supporting checks where they add value: ```yaml # Did the agent use the intended skill? - type: skill-used value: review-standards # Did the output stay within a reasonable budget? - type: cost threshold: 0.50 # Was the answer fast enough for the workflow? - type: latency threshold: 120000 ``` If two versions are close, rerun the eval with repeats and inspect the failures before choosing one. The better skill is the one that holds up across the tasks you care about, not the one that wins a single lucky run.