--- sidebar_position: 3 title: Claude Agent SDK description: 'Use Claude Agent SDK for evals with configurable tools, permissions, MCP servers, and more' --- # Claude Agent SDK This provider makes [Claude Agent SDK](https://docs.claude.com/en/api/agent-sdk/overview) available for evals through its [TypeScript SDK](https://docs.claude.com/en/api/agent-sdk/typescript). :::info The Claude Agent SDK was formerly known as the Claude Code SDK. It's still built on top of Claude Code and exposes all its functionality. ::: ## Provider IDs You can reference this provider using either: - `anthropic:claude-agent-sdk` (full name) - `anthropic:claude-code` (alias) ## Installation The Claude Agent SDK provider requires the `@anthropic-ai/claude-agent-sdk` package to be installed separately: ```bash npm install @anthropic-ai/claude-agent-sdk ``` :::note This is an optional dependency and only needs to be installed if you want to use the Claude Agent SDK provider. Note that Anthropic has released the claude-agent-sdk library with a [proprietary license](https://github.com/anthropics/claude-agent-sdk-typescript/blob/9f51899c3e04f15951949ceac81849265d545579/LICENSE.md). ::: ## Setup The easiest way to get started is with an Anthropic API key. You can set it with the `ANTHROPIC_API_KEY` environment variable or specify the `apiKey` in the provider configuration. Create Anthropic API keys [here](https://console.anthropic.com/settings/keys). Example of setting the environment variable: ```sh export ANTHROPIC_API_KEY=your_api_key_here ``` If Claude Agent SDK will authenticate through an existing local Claude Code session instead of `ANTHROPIC_API_KEY`, disable Promptfoo's upfront API key check: ```yaml providers: - id: anthropic:claude-agent-sdk config: apiKeyRequired: false ``` This is useful when you're using a local Claude Code binary with an active session, such as Claude Code monthly plans. Promptfoo will skip its preflight API key validation, but the SDK still needs to be able to authenticate on its own. ## Other Model Providers Apart from using the Anthropic API, you can also use AWS Bedrock and Google Vertex AI. For AWS Bedrock: - Set the `CLAUDE_CODE_USE_BEDROCK` environment variable to `true`: ```sh export CLAUDE_CODE_USE_BEDROCK=true ``` - Follow the [Claude Code Bedrock documentation](https://docs.claude.com/en/docs/claude-code/amazon-bedrock) to make credentials available to Claude Agent SDK. For Google Vertex: - Set the `CLAUDE_CODE_USE_VERTEX` environment variable to `true`: ```sh export CLAUDE_CODE_USE_VERTEX=true ``` - Follow the [Claude Code Vertex documentation](https://docs.claude.com/en/docs/claude-code/google-vertex-ai) to make credentials available to Claude Agent SDK. ## Quick Start ### Basic Usage By default, Claude Agent SDK runs in a temporary directory with no tools enabled, using the `default` permission mode. This makes it behave similarly to the standard [Anthropic provider](/docs/providers/anthropic/). It has no access to the file system (read or write) and can't run system commands. ```yaml title="promptfooconfig.yaml" providers: - anthropic:claude-agent-sdk prompts: - 'Output a python function that prints the first 10 numbers in the Fibonacci sequence' ``` When your test cases finish, the temporary directory is deleted. ### With Working Directory You can specify a specific working directory for Claude Agent SDK to run in: ```yaml providers: - id: anthropic:claude-agent-sdk config: working_dir: ./src prompts: - 'Review the TypeScript files and identify potential bugs' ``` This allows you to prepare a directory with files or sub-directories before running your tests. Relative `working_dir` values are resolved from the directory containing the config file. By default, when you specify a working directory, Claude Agent SDK is given read-only access to the directory. ### With Side Effects You can also allow Claude Agent SDK to write to files, run system commands, call MCP servers, and more. Here's an example that will allow Claude Agent SDK to both read from and write to files in the working directory. It uses `append_allowed_tools` to add tools for writing and editing files to the default set of read-only tools. It also sets `permission_mode` to `acceptEdits` so Claude Agent SDK can modify files without asking for confirmation. ```yaml providers: - id: anthropic:claude-agent-sdk config: working_dir: ./my-project append_allowed_tools: ['Write', 'Edit', 'MultiEdit'] permission_mode: 'acceptEdits' prompts: - 'Refactor the authentication module to use async/await' ``` > **Note:** when using `acceptEdits` and tools that allow side effects like writing to files, you'll need to consider how you will reset the files after each test run. See the [Managing Side Effects](#managing-side-effects) section for more information. ## Supported Parameters | Parameter | Type | Description | Default | | ------------------------------------ | ---------------- | ------------------------------------------------------------------------------------------------------------ | ------------------------ | | `apiKey` | string | Anthropic API key | Environment variable | | `apiKeyRequired` | boolean | Require Promptfoo to find an Anthropic API key before calling the SDK. Set to `false` for local SDK auth. | `true` | | `working_dir` | string | Directory for file operations | Temporary directory | | `model` | string | Primary model to use (passed to Claude Agent SDK) | Claude Agent SDK default | | `fallback_model` | string | Fallback model if primary fails | Claude Agent SDK default | | `max_turns` | number | Maximum conversation turns | Claude Agent SDK default | | `max_thinking_tokens` | number | Maximum tokens for thinking | Claude Agent SDK default | | `max_budget_usd` | number | Maximum cost budget in USD for the agent execution | None | | `task_budget` | object | Token budget for pacing tool use: `{total: N}` | None | | `permission_mode` | string | Permission mode: `default`, `plan`, `acceptEdits`, `bypassPermissions`, `dontAsk`, `auto` | `default` | | `allow_dangerously_skip_permissions` | boolean | Required safety flag when using `bypassPermissions` mode | false | | `thinking` | object | Thinking config: `{type: 'adaptive'}`, `{type: 'enabled', budgetTokens: N}`, or `{type: 'disabled'}` | Model default | | `effort` | string | Response effort level: `low`, `medium`, `high`, `xhigh` (Opus 4.7+), `max` | `high` | | `agent` | string | Named agent for the main thread (must be defined in `agents` or settings) | None | | `session_id` | string | Custom session UUID (cannot be used with `continue`/`resume` unless `fork_session` is set) | Auto-generated | | `title` | string | Custom title for a new session (skips auto-generation from the first message) | Auto-generated | | `debug` | boolean | Enable verbose debug logging | false | | `debug_file` | string | Write debug logs to this file path (implicitly enables debug) | None | | `betas` | string[] | Enable beta features (e.g., `['context-1m-2025-08-07']` for 1M context) | None | | `custom_system_prompt` | string | Replace default system prompt | None | | `append_system_prompt` | string | Append to default system prompt | None | | `exclude_dynamic_sections` | boolean | Strip per-user dynamic sections from the preset prompt so it stays cacheable across runs | false | | `tools` | array/object | Base set of built-in tools (array of names or `{type: 'preset', preset: 'claude_code'}`) | None | | `custom_allowed_tools` | string[] | Replace default allowed tools | None | | `append_allowed_tools` | string[] | Add to default allowed tools | None | | `allow_all_tools` | boolean | Allow all available tools | false | | `disallowed_tools` | string[] | Tools to explicitly block (overrides allowed) | None | | `additional_directories` | string[] | Additional directories the agent can access (beyond working_dir) | None | | `ask_user_question` | object | Automated handling for AskUserQuestion tool (see [Handling AskUserQuestion](#handling-askuserquestion-tool)) | None | | `mcp` | object | MCP server configuration | None | | `strict_mcp_config` | boolean | Only allow configured MCP servers | true | | `cache_mcp` | boolean | Enable caching when MCP is configured (for deterministic MCP tools) | false | | `setting_sources` | string[] | Where SDK looks for settings, CLAUDE.md, and slash commands | None (disabled) | | `plugins` | array | Local [plugins](#plugins) to load for the session | None | | `skills` | string[]/`'all'` | Filter which discovered [skills](#testing-skills) load into the session (auto-allows the `Skill` tool) | None (no filtering) | | `plan_mode_instructions` | string | Custom workflow instructions when `permission_mode` is `plan` | None | | `output_format` | object | Structured output configuration with JSON schema | None | | `agents` | object | Programmatic agent definitions for custom subagents | None | | `hooks` | object | Event hooks for intercepting tool calls and other events | None | | `include_partial_messages` | boolean | Include partial/streaming messages in response | false | | `include_hook_events` | boolean | Include hook lifecycle events in output stream | false | | `forward_subagent_text` | boolean | Forward subagent text/thinking blocks (default: only tool_use/tool_result are emitted) | false | | `tool_config` | object | Per-tool configuration (e.g., `askUserQuestion.previewFormat`) | None | | `prompt_suggestions` | boolean | Enable AI-predicted next prompts after each turn | false | | `agent_progress_summaries` | boolean | Enable periodic AI progress summaries for subagents | false | | `settings` | string/object | Additional [settings](#settings) (file path or inline object) | None | | `managed_settings` | object | Policy-tier [settings](#settings) the SDK loads above user/project layers (for embedders enforcing policy) | None | | `can_use_tool` | function | Callback forwarded to Claude Agent SDK's `canUseTool` option (programmatic only) | None | | `on_elicitation` | function | Callback for MCP elicitation requests (programmatic only) | Auto-decline | | `resume` | string | Resume from a specific session ID | None | | `fork_session` | boolean | Fork from an existing session instead of continuing | false | | `continue` | boolean | Continue an existing session | false | | `enable_file_checkpointing` | boolean | Track file changes for rewinding to previous states | false | | `persist_session` | boolean | Save session to disk for later resumption | true | | `sandbox` | object | Sandbox settings for command execution isolation | None | | `permission_prompt_tool_name` | string | MCP tool name to use for permission prompts | None | | `executable` | string | JavaScript runtime: `node`, `bun`, or `deno` | Auto-detected | | `executable_args` | string[] | Arguments to pass to the JavaScript runtime | None | | `extra_args` | object | Additional CLI arguments (keys without `--`, values as strings or null for flags) | None | | `env` | object | Extra environment variables to forward to the SDK subprocess (e.g. `OTEL_*`, `CLAUDE_CODE_ENABLE_TELEMETRY`) | None | | `path_to_claude_code_executable` | string | Path to a custom Claude Code executable | Built-in | | `spawn_claude_code_process` | function | Custom spawn function for VMs/containers (programmatic only) | Default spawn | ## Models Model selection is optional, since Claude Agent SDK uses sensible defaults. When specified, models are passed directly to the Claude Agent SDK. ```yaml providers: - id: anthropic:claude-agent-sdk config: model: claude-opus-4-6 fallback_model: claude-sonnet-4-5-20250929 ``` Claude Agent SDK also supports a number of [model aliases](https://docs.claude.com/en/docs/claude-code/model-config#model-aliases), which can also be used in the configuration. ```yaml providers: - id: anthropic:claude-agent-sdk config: model: sonnet fallback_model: haiku ``` Claude Agent SDK also supports configuring models through [environment variables](https://docs.claude.com/en/docs/claude-code/model-config#environment-variables). When using this provider, any environment variables you set will be passed through to the Claude Agent SDK. ## System Prompt Unless you specify a `custom_system_prompt`, the default Claude Code system prompt will be used. You can append additional instructions to it with `append_system_prompt`. Set `exclude_dynamic_sections: true` to strip per-user context (working directory, auto-memory, git status) from the preset prompt. This keeps the prompt-caching prefix static across runs, which matters for high-volume evals. The stripped context is re-injected as the first user message. Has no effect when `custom_system_prompt` is set. :::info Note that this differs slightly from the Claude Agent SDK's behavior when used independently of Promptfoo. The Agent SDK will _not_ use the Claude Code system prompt by default unless it's specified—it will instead use an empty system prompt if none is provided. If you want to use an empty system prompt with this provider, set `custom_system_prompt` to an empty string. ::: ## Tools and Permissions ### Default Tools If no `working_dir` is specified, Claude Agent SDK runs in a temporary directory with no access to tools by default. By default, when a `working_dir` is specified, Claude Agent SDK has access to the following read-only tools: - `Read` - Read file contents - `Grep` - Search file contents - `Glob` - Find files by pattern - `LS` - List directory contents ### Permission Modes Control Claude Agent SDK's permissions for modifying files and running system commands: | Mode | Description | | ------------------- | --------------------------------------------------------------------- | | `default` | Standard permissions | | `plan` | Planning mode | | `acceptEdits` | Allow file modifications | | `bypassPermissions` | No restrictions (requires `allow_dangerously_skip_permissions: true`) | | `dontAsk` | Deny permissions that aren't pre-approved (no prompts) | | `auto` | Use a model classifier to approve or deny permission prompts | :::warning Using `bypassPermissions` requires setting `allow_dangerously_skip_permissions: true` as a safety measure: ```yaml providers: - id: anthropic:claude-agent-sdk config: permission_mode: bypassPermissions allow_dangerously_skip_permissions: true ``` ::: When using `permission_mode: plan`, you can replace the default code-implementation workflow body with custom instructions via `plan_mode_instructions`. The CLI keeps the read-only enforcement preamble and ExitPlanMode protocol footer; only the workflow body changes: ```yaml providers: - id: anthropic:claude-agent-sdk config: permission_mode: plan plan_mode_instructions: | Produce a step-by-step migration plan only. Do not propose code edits; each step should describe the action and the file path it affects. ``` ### Tool Configuration Customize available tools for your use case: ```yaml # Use all default Claude Code tools via preset providers: - id: anthropic:claude-agent-sdk config: tools: type: preset preset: claude_code # Specify exact base tools providers: - id: anthropic:claude-agent-sdk config: tools: - Bash - Read - Edit - Write # Disable all built-in tools providers: - id: anthropic:claude-agent-sdk config: tools: [] # Add tools to defaults providers: - id: anthropic:claude-agent-sdk config: append_allowed_tools: ['Write', 'Edit'] # Replace default tools entirely providers: - id: anthropic:claude-agent-sdk config: custom_allowed_tools: ['Read', 'Grep', 'Glob', 'Write', 'Edit', 'MultiEdit', 'Bash', 'WebFetch', 'WebSearch'] # Block specific tools providers: - id: anthropic:claude-agent-sdk config: disallowed_tools: ['Delete', 'Run'] # Allow all tools (use with caution) providers: - id: anthropic:claude-agent-sdk config: allow_all_tools: true ``` The `tools` option specifies the base set of available built-in tools, while `custom_allowed_tools`/`append_allowed_tools` and `disallowed_tools` filter from that base. ⚠️ **Security Note**: Some tools allow Claude Agent SDK to modify files, run system commands, search the web, and more. Think carefully about security implications before using these tools. [Here's a full list of available tools.](https://docs.claude.com/en/docs/claude-code/settings#tools-available-to-claude) ## MCP Integration Unlike the standard Anthropic provider, Claude Agent SDK handles MCP (Model Context Protocol) connections directly. Configuration is forwarded to the Claude Agent SDK: ```yaml providers: - id: anthropic:claude-agent-sdk config: mcp: servers: # HTTP-based server - url: https://api.example.com/mcp name: api-server headers: Authorization: 'Bearer token' # Process-based server - command: node args: ['mcp-server.js'] name: local-server strict_mcp_config: true # Only use configured servers (true by default) ``` For detailed MCP configuration, see [Claude Code MCP documentation](https://docs.claude.com/en/docs/claude-code/mcp). ## Setting Sources By default, the Claude Agent SDK provider does not look for settings files, CLAUDE.md, or slash commands. You can enable this by specifying `setting_sources`: ```yaml providers: - id: anthropic:claude-agent-sdk config: setting_sources: ['project', 'local'] ``` Available values: - `user` - User-level settings - `project` - Project-level settings - `local` - Local directory settings ## Plugins [Plugins](https://code.claude.com/docs/en/plugins) extend the agent with additional skills, agents, hooks, and MCP servers. While `setting_sources` discovers skills from the standard settings hierarchy (project/local/user), plugins are self-contained directories that bundle capabilities together and namespace their skills—mirroring how marketplace-installed plugins work. ```yaml providers: - id: anthropic:claude-agent-sdk config: working_dir: ./my-project plugins: - type: local path: ./my-plugin append_allowed_tools: ['Skill', 'Read'] ``` :::note Only the `local` type is currently supported. Relative paths in `path` resolve against the config file's directory. ::: ### Plugin Structure A plugin is a directory containing a `.claude-plugin/plugin.json` manifest: ```text my-plugin/ ├── .claude-plugin/ │ └── plugin.json └── skills/ └── code-review/ └── SKILL.md ``` The manifest defines the plugin's name and description: ```json title="my-plugin/.claude-plugin/plugin.json" { "name": "my-plugin", "description": "A plugin that provides code review skills" } ``` ### Skill Namespacing Skills from plugins are namespaced with the plugin name. For example, a `standards-check` skill in a plugin named `project-standards` becomes `project-standards:standards-check`. Use this namespaced name when asserting on skill invocations: ```yaml assert: - type: skill-used value: project-standards:standards-check ``` ### Plugins vs Setting Sources Both `plugins` and `setting_sources` can provide skills, but they serve different purposes: - **`setting_sources`**: Discovers skills from the standard settings hierarchy—project, local, and user-level `.claude/skills/` directories. Skills are not namespaced. - **`plugins`**: Loads self-contained plugin directories, mirroring how marketplace-installed plugins work. Skills are namespaced with the plugin name (`plugin:skill`). You can use both together — skills from both sources are available in the same session. ## Testing Skills [Agent Skills](https://platform.claude.com/docs/en/agent-sdk/skills) are reusable capabilities that extend Claude's functionality. They are defined as `SKILL.md` files and can be tested using the Claude Agent SDK provider. Skills can be loaded via `setting_sources` (from the standard settings hierarchy) or from [plugins](#plugins). ### Enabling Skills There are two steps to enable skills: **discovery** and **filtering**. **Discovery** brings skills into the session — load them either via `setting_sources` (scans the `.claude/skills/` directory hierarchy) or via [`plugins`](#plugins). Skills aren't discovered by default. **Filtering** narrows which discovered skills are usable. Use the `skills` option (added in SDK 0.2.120) — pass `'all'` to use every discovered skill, or a string array to allow only specific names. Setting `skills` also auto-allows the `Skill` tool, so you don't need to add it to `allowed_tools`/`append_allowed_tools` yourself. ```yaml providers: - id: anthropic:claude-agent-sdk config: working_dir: ./my-project setting_sources: ['project'] # Discover from .claude/skills/ skills: all # Or: ['code-review', 'test-generator'] ``` If you don't set `skills`, you must add `'Skill'` to allowed tools manually for the model to invoke them: ```yaml providers: - id: anthropic:claude-agent-sdk config: working_dir: ./my-project setting_sources: ['project'] append_allowed_tools: ['Skill'] ``` `skills` is a context filter, not a sandbox: unlisted skills are hidden from the model's listing and rejected by the `Skill` tool, but their files remain on disk and stay reachable via `Read`/`Bash`. Don't store secrets in skill files. ### How Skills Are Discovered Skills are automatically discovered at startup from the configured `setting_sources` directories. The SDK scans for `SKILL.md` files in subdirectories of `.claude/skills/`: ```text my-project/ └── .claude/ └── skills/ ├── code-review/ │ └── SKILL.md └── test-generator/ └── SKILL.md ``` Claude automatically invokes the relevant skill when a task matches the skill's description in its frontmatter. ### Testing Skill Invocation Promptfoo normalizes Claude `Skill` tool invocations into `response.metadata.skillCalls`, so skill evals can use the same `skill-used` assertion style as Codex. The underlying `Skill` tool calls are still available in [`response.metadata.toolCalls`](#tool-call-tracking) when you need the raw tool payload. ```yaml providers: - id: anthropic:claude-agent-sdk config: working_dir: ./my-project setting_sources: ['project'] skills: ['code-review'] # Only allow this skill; auto-allows the Skill tool append_allowed_tools: ['Read', 'Write'] prompts: - 'Review the authentication module for security issues' tests: - assert: # Check that a specific skill was invoked - type: skill-used value: code-review ``` ### Checking Available Skills You can verify skills are loaded by asking Claude to list them. Note that this relies on Claude's free-text response, so use a flexible assertion: ```yaml prompts: - 'List all available skills by name' tests: - assert: - type: icontains value: 'code-review' # Expected skill name ``` :::note Because the response is free-text, `contains` assertions may be fragile. For more reliable testing, check tool calls instead (see [Testing Skill Invocation](#testing-skill-invocation)). ::: ### Testing Restrictions for CI For consistent testing in CI/CD environments, restrict to project-level skills only: ```yaml providers: - id: anthropic:claude-agent-sdk config: working_dir: ./my-project setting_sources: ['project'] # Only team-shared skills, exclude personal append_allowed_tools: ['Skill', 'Read', 'Bash'] permission_mode: 'acceptEdits' ``` This ensures tests don't depend on user-specific skills that may not be present in CI. ### Example: Complete Skills Testing Configuration ```yaml title="promptfooconfig.yaml" providers: - id: anthropic:claude-agent-sdk config: working_dir: ./my-project setting_sources: ['project'] append_allowed_tools: ['Skill', 'Read', 'Write', 'Bash'] permission_mode: 'acceptEdits' prompts: - 'Generate unit tests for the UserService class' tests: - assert: # Verify the test-generator skill was invoked - type: skill-used value: test-generator # Verify tests were generated - type: icontains value: 'describe(' ``` For more information about creating skills, see the [Claude Code skills documentation](https://code.claude.com/docs/en/skills). ## Budget Control Limit the maximum cost of an agent execution with `max_budget_usd`: ```yaml providers: - id: anthropic:claude-agent-sdk config: max_budget_usd: 0.50 ``` The agent will stop execution if the cost exceeds the specified budget. ## Task Budget Control how the model paces its tool use within a token budget using `task_budget`: ```yaml providers: - id: anthropic:claude-agent-sdk config: task_budget: total: 50000 ``` The `total` field sets the token budget for the task. The model uses this to pace its tool use—for example, being more selective about which tools to invoke as the budget is consumed. ## Additional Directories Grant the agent access to directories beyond the working directory: ```yaml providers: - id: anthropic:claude-agent-sdk config: working_dir: ./project additional_directories: - /shared/libs - /data/models ``` ## Structured Output Get validated JSON responses by specifying an output schema: ```yaml providers: - id: anthropic:claude-agent-sdk config: output_format: type: json_schema schema: type: object properties: analysis: type: string confidence: type: number required: [analysis, confidence] ``` When `output_format` is configured, the response will include structured output that conforms to the schema. The structured output is available in: - `output` - The parsed structured output (when available) - `metadata.structuredOutput` - The raw structured output value :::tip For evals that depend on parsing JSON from the model's reply, prefer `output_format` over asking for JSON in the prompt and then running `is-json` / `JSON.parse()`. Without it, Claude commonly wraps short JSON answers in Markdown fences or a leading sentence, which makes downstream parsers brittle. With it, the response arrives as a parsed object, so a JavaScript assertion can read fields directly. This is the Claude Agent SDK's analogue of the OpenAI Codex SDK's [`output_schema`](/docs/providers/openai-codex-sdk#structured-output) — same idea, slightly different wrapper shape (`{type: 'json_schema', schema: {...}}` here vs a bare schema object on Codex). The [Test Agent Skills guide](/docs/guides/test-agent-skills) shows both side by side. ::: ## Session Management Continue or fork existing sessions for multi-turn interactions: ```yaml providers: - id: anthropic:claude-agent-sdk config: # Continue an existing session resume: 'session-id-from-previous-run' continue: true # Or fork from an existing session resume: 'session-id-to-fork' fork_session: true ``` Session IDs are returned in the response and can be used to continue conversations across eval runs. ### Disabling Session Persistence By default, sessions are saved to disk (`~/.claude/projects/`) and can be resumed later. For ephemeral or automated workflows where session history is not needed, disable persistence: ```yaml providers: - id: anthropic:claude-agent-sdk config: persist_session: false ``` ## File Checkpointing Track file changes during the session to enable rewinding to previous states: ```yaml providers: - id: anthropic:claude-agent-sdk config: enable_file_checkpointing: true working_dir: ./my-project append_allowed_tools: ['Write', 'Edit'] ``` When file checkpointing is enabled, the SDK creates backups of files before they are modified. This allows programmatic restoration to any previous state in the conversation. ## Beta Features Enable experimental features using the `betas` parameter: ```yaml providers: - id: anthropic:claude-agent-sdk config: betas: - context-1m-2025-08-07 ``` Currently available betas: | Beta | Description | | ----------------------- | -------------------------------------------------- | | `context-1m-2025-08-07` | Enable 1M token context window (Sonnet 4/4.5 only) | See the [Anthropic beta headers documentation](https://docs.anthropic.com/en/api/beta-headers) for more information. ## Sandbox Configuration Run commands in an isolated sandbox environment for additional security: ```yaml providers: - id: anthropic:claude-agent-sdk config: sandbox: enabled: true autoAllowBashIfSandboxed: true network: allowLocalBinding: true allowedDomains: - api.example.com ``` Available sandbox options: | Option | Type | Description | | ----------------------------- | -------- | ---------------------------------------------------- | | `enabled` | boolean | Enable sandboxed execution | | `autoAllowBashIfSandboxed` | boolean | Auto-allow bash commands when sandboxed | | `allowUnsandboxedCommands` | boolean | Allow commands that can't be sandboxed | | `enableWeakerNestedSandbox` | boolean | Enable weaker sandbox for nested environments | | `excludedCommands` | string[] | Commands to exclude from sandboxing | | `failIfUnavailable` | boolean | Fail closed when sandbox dependencies are missing | | `ignoreViolations` | object | Map of command patterns to violation types to ignore | | `network.allowedDomains` | string[] | Domains allowed for network access | | `network.allowLocalBinding` | boolean | Allow binding to localhost | | `network.allowUnixSockets` | string[] | Specific Unix sockets to allow | | `network.allowAllUnixSockets` | boolean | Allow all Unix socket connections | | `network.httpProxyPort` | number | HTTP proxy port for network access | | `network.socksProxyPort` | number | SOCKS proxy port for network access | | `ripgrep.command` | string | Path to custom ripgrep executable | | `ripgrep.args` | string[] | Additional arguments for ripgrep | When `sandbox.enabled` is `true`, Claude Agent SDK defaults `failIfUnavailable` to `true`; set it to `false` only if you want the SDK to degrade gracefully when sandbox dependencies or platform support are missing. See the [Claude Code sandbox documentation](https://docs.anthropic.com/en/docs/claude-code/settings#sandbox-settings) for more details. ## Settings Apply additional settings via a file path or inline object. These load into the "flag settings" layer, which has the highest priority among user-controlled settings: ```yaml providers: - id: anthropic:claude-agent-sdk config: settings: permissions: allow: - 'Bash(*)' - 'Read(*)' ``` Or reference a settings file: ```yaml providers: - id: anthropic:claude-agent-sdk config: settings: /path/to/settings.json ``` For embedders that derive lockdown configuration from an enterprise policy, use `managed_settings` to load policy-tier settings above user/project layers (so user/project settings cannot widen restrictions set here): ```yaml providers: - id: anthropic:claude-agent-sdk config: managed_settings: sandbox: network: allowManagedDomainsOnly: true ``` ## Per-Tool Configuration Customize built-in tool behavior with `tool_config`: ```yaml providers: - id: anthropic:claude-agent-sdk config: tool_config: askUserQuestion: previewFormat: html # 'markdown' (default) or 'html' ``` ## Progress Summaries and Prompt Suggestions Enable AI-generated progress summaries for running subagents and predicted next prompts: ```yaml providers: - id: anthropic:claude-agent-sdk config: agent_progress_summaries: true # periodic summaries for subagents prompt_suggestions: true # AI-predicted next prompts after each turn ``` ## Advanced Runtime Configuration ### JavaScript Runtime Specify which JavaScript runtime to use: ```yaml providers: - id: anthropic:claude-agent-sdk config: executable: bun # or 'node' or 'deno' executable_args: - '--smol' ``` ### Extra CLI Arguments Pass additional arguments to Claude Code: ```yaml providers: - id: anthropic:claude-agent-sdk config: extra_args: verbose: null # boolean flag (adds --verbose) timeout: '30' # adds --timeout 30 ``` ### Custom Executable Path Use a specific Claude Code installation: ```yaml providers: - id: anthropic:claude-agent-sdk config: path_to_claude_code_executable: /custom/path/to/claude-code ``` ### Custom Spawn Function (Programmatic Only) For running Claude Code in VMs, containers, or remote environments, you can provide a custom spawn function when using the provider programmatically: ```typescript import { loadApiProvider } from 'promptfoo'; const provider = await loadApiProvider('anthropic:claude-agent-sdk', { options: { config: { spawn_claude_code_process: (options) => { // Custom spawn logic for VM/container execution // options contains: command, args, cwd, env, signal return myVMProcess; // Must satisfy SpawnedProcess interface }, }, }, }); ``` This option is only available when using the provider programmatically, not via YAML configuration. ## Programmatic Agents Define custom subagents with specific tools and permissions: ```yaml providers: - id: anthropic:claude-agent-sdk config: agents: code-reviewer: name: Code Reviewer description: Reviews code for bugs and style issues tools: [Read, Grep, Glob] test-runner: name: Test Runner description: Runs tests and reports results tools: [Bash, Read] ``` ## Handling AskUserQuestion Tool The `AskUserQuestion` tool allows Claude to ask the user multiple-choice questions during execution. In automated evaluations, there's no human to answer these questions, so you need to configure how they should be handled. ### Using the Convenience Option The simplest approach is to use the `ask_user_question` configuration: ```yaml providers: - id: anthropic:claude-agent-sdk config: append_allowed_tools: ['AskUserQuestion'] ask_user_question: behavior: first_option ``` Available behaviors: | Behavior | Description | | -------------- | -------------------------------------- | | `first_option` | Always select the first option | | `random` | Randomly select from available options | | `deny` | Deny the tool use | ### Programmatic Usage For custom answer selection logic when using the provider programmatically, you can provide your own `canUseTool` callback: ```typescript import { loadApiProvider } from 'promptfoo'; const provider = await loadApiProvider('anthropic:claude-agent-sdk', { options: { config: { append_allowed_tools: ['AskUserQuestion'], can_use_tool: async (toolName, input) => { if (toolName !== 'AskUserQuestion') { return { behavior: 'allow', updatedInput: input }; } return { behavior: 'allow', updatedInput: { ...input, answers: { 'Which environment?': 'Staging' }, }, }; }, }, }, }); ``` The `canUseTool` callback receives the tool name and input, and returns an answer: ```typescript async function canUseTool(toolName, input, options) { if (toolName !== 'AskUserQuestion') { return { behavior: 'allow', updatedInput: input }; } const answers = {}; for (const q of input.questions) { // Custom selection logic - prefer options marked as recommended const preferred = q.options.find((o) => o.description.toLowerCase().includes('recommended')); answers[q.question] = preferred?.label ?? q.options[0].label; } return { behavior: 'allow', updatedInput: { questions: input.questions, answers, }, }; } ``` See the [Claude Agent SDK permissions documentation](https://platform.claude.com/docs/en/agent-sdk/permissions) for more details on `canUseTool`. :::tip If you're testing scenarios where the agent asks questions, consider what answer would lead to the most interesting test case. Using `random` behavior can help discover edge cases. ::: ## Hooks Promptfoo forwards the `hooks` option to the Claude Agent SDK unchanged, so callbacks receive the SDK's native input shape and return values are honored as documented upstream. Hooks are programmatic-only — define them in a JS/TS provider file rather than YAML. The `PostToolUse` event lets you rewrite tool output before the model sees it. Return `updatedToolOutput` to replace the result for any tool (built-in or MCP): ```ts title="provider.mjs" export default { id: 'anthropic:claude-agent-sdk', config: { hooks: { PostToolUse: [ { matcher: 'Bash', hooks: [ async (input) => ({ hookEventName: 'PostToolUse', updatedToolOutput: redact(input.tool_response), }), ], }, ], }, }, }; ``` `updatedMCPToolOutput` (MCP-only) is deprecated in favor of `updatedToolOutput`, which works for every tool. See the [SDK hook reference](https://docs.claude.com/en/docs/claude-code/hooks) for the full list of events and return shapes. ## Tool Call Tracking The Claude Agent SDK provider captures all tool calls made during the agentic session and exposes them in `response.metadata.toolCalls`. This allows you to assert on tool usage in your evaluations. Each tool call entry contains: | Field | Type | Description | | ----------------- | -------------- | ------------------------------------------------------------ | | `id` | string | Unique tool call ID | | `name` | string | Tool name (e.g., `Read`, `Bash`, `Grep`) | | `input` | unknown | Arguments passed to the tool | | `output` | unknown | Tool result content (undefined if not available) | | `is_error` | boolean | Whether the tool call resulted in an error | | `parentToolUseId` | string \| null | Parent tool use ID for sub-agent calls, `null` for top-level | ### Asserting on Tool Usage Use JavaScript assertions to check which tools were called: ```yaml assert: - type: javascript value: | const toolCalls = context.providerResponse?.metadata?.toolCalls || []; const readCalls = toolCalls.filter(t => t.name === 'Read'); return readCalls.length > 0; ``` Check that a specific command was run: ```yaml assert: - type: javascript value: | const toolCalls = context.providerResponse?.metadata?.toolCalls || []; const bashCalls = toolCalls.filter(t => t.name === 'Bash'); return bashCalls.some(t => t.input?.command?.includes('npm test')); ``` Verify tool output content: ```yaml assert: - type: javascript value: | const toolCalls = context.providerResponse?.metadata?.toolCalls || []; const grepCall = toolCalls.find(t => t.name === 'Grep'); return grepCall?.output?.includes('expected match'); ``` For skill evals specifically, prefer the deterministic [`skill-used`](/docs/configuration/expected-outputs/deterministic/#skill-used) assertion over raw JavaScript when possible. Promptfoo derives `metadata.skillCalls` from these `Skill` tool calls automatically. By default, only subagent `tool_use` and `tool_result` blocks reach `metadata.toolCalls` — the subagent's text and thinking are summarised away. Set `forward_subagent_text: true` to forward the full subagent transcript so consumers can render or assert against the nested conversation: ```yaml providers: - id: anthropic:claude-agent-sdk config: forward_subagent_text: true ``` ## Tracing When [tracing](/docs/tracing/) is enabled, every provider call emits an OpenTelemetry span using the GenAI semantic conventions (`gen_ai.system`, `gen_ai.request.model`, `gen_ai.usage.*`, `gen_ai.response.model`, `gen_ai.response.finish_reasons`, etc.) plus a child span per completed tool call (`tool {name}` with `tool.input`, `tool.output`, `tool.is_error`). Spans are parented to the evaluation trace so they appear grouped in the Traces tab. The W3C `TRACEPARENT` environment variable is propagated to the SDK subprocess so telemetry it exports attaches to the same trace: ```yaml providers: - id: anthropic:claude-agent-sdk config: env: CLAUDE_CODE_ENABLE_TELEMETRY: '1' OTEL_EXPORTER_OTLP_ENDPOINT: 'http://127.0.0.1:4318' OTEL_EXPORTER_OTLP_PROTOCOL: 'http/protobuf' tracing: enabled: true otlp: http: enabled: true port: 4318 ``` ### Deep tracing (SDK-internal events) To also capture Claude Code's internal events — API requests, tool decisions, tool results — set `OTEL_LOGS_EXPORTER=otlp` and use the JSON logs protocol. Each log record becomes a child span on the provider span. ```yaml config: env: CLAUDE_CODE_ENABLE_TELEMETRY: '1' OTEL_LOGS_EXPORTER: otlp OTEL_EXPORTER_OTLP_ENDPOINT: 'http://127.0.0.1:4318' OTEL_EXPORTER_OTLP_PROTOCOL: 'http/json' ``` The receiver's `/v1/logs` endpoint accepts JSON only. The provider automatically injects `OTEL_RESOURCE_ATTRIBUTES=promptfoo.trace_id=...,promptfoo.parent_span_id=...` so logs link to the correct evaluation trace even though the SDK's logs signal doesn't natively inherit `TRACEPARENT`. ## Caching Behavior This provider automatically caches responses, and will read from the cache if the prompt, configuration, and files in the working directory (if `working_dir` is set) are the same as a previous run. When MCP servers are configured, caching is disabled by default because MCP tools typically interact with external state (APIs, file systems, databases), making cached responses unreliable. To opt back into caching for deterministic MCP tools (e.g., code search, static knowledge bases), set `cache_mcp: true`: ```yaml providers: - id: anthropic:claude-agent-sdk config: cache_mcp: true mcp: servers: - command: npx args: ['-y', '@my/deterministic-mcp-server'] name: my-server ``` To disable caching globally: ```bash export PROMPTFOO_CACHE_ENABLED=false ``` You can also include `bustCache: true` in the configuration to prevent reading from the cache. ## Managing Side Effects When using Claude Agent SDK with configurations that allow side effects, like writing to files, running system commands, or calling MCP servers, you'll need to consider: - How to reset after each test run - How to ensure tests don't interfere with each other (like writing to the same files concurrently) This increases complexity, so first consider if you can achieve your goal with a read-only configuration. If you do need to test with side effects, here are some strategies that can help: - **Serial execution**: Set `evaluateOptions.maxConcurrency: 1` in your config or use `--max-concurrency 1` CLI flag - **Hooks**: Use promptfoo [extension hooks](/docs/configuration/reference/#extension-hooks) to reset the environment after each test run - **Wrapper scripts**: Handle setup/cleanup outside of promptfoo - **Use git**: If you're using a custom working directory, you can use git to reset the files after each test run - **Use a container**: Run tests that might run commands in a container to protect the host system ## Examples Here are a few complete example implementations: - [Basic usage](https://github.com/promptfoo/promptfoo/tree/main/examples/claude-agent-sdk#basic-usage) - Basic usage with no tools - [Working directory](https://github.com/promptfoo/promptfoo/tree/main/examples/claude-agent-sdk#working-directory) - Read-only access to a working directory - [Advanced editing](https://github.com/promptfoo/promptfoo/tree/main/examples/claude-agent-sdk#advanced-editing) - File edits and working directory reset in an extension hook - [MCP integration](https://github.com/promptfoo/promptfoo/tree/main/examples/claude-agent-sdk#mcp-integration) - Read-only MCP server integration with weather API - [Structured output](https://github.com/promptfoo/promptfoo/tree/main/examples/claude-agent-sdk#structured-output) - JSON schema validation for agent responses - [Advanced options](https://github.com/promptfoo/promptfoo/tree/main/examples/claude-agent-sdk#advanced-options) - Sandbox, runtime configuration, and CLI arguments - [AskUserQuestion handling](https://github.com/promptfoo/promptfoo/tree/main/examples/claude-agent-sdk#askuserquestion-handling) - Automated handling of user questions in evaluations - [Skills testing](https://github.com/promptfoo/promptfoo/tree/main/examples/claude-agent-sdk#skills-testing) - Testing Agent Skills with the SDK - [Plugins](https://github.com/promptfoo/promptfoo/tree/main/examples/claude-agent-sdk#plugins) - Loading plugins to extend agent capabilities ## See Also - [Claude Agent SDK documentation](https://docs.claude.com/en/api/agent-sdk) - [Agent Skills in the SDK](https://platform.claude.com/docs/en/agent-sdk/skills) - Testing and using skills with the SDK - [Claude Code skills documentation](https://code.claude.com/docs/en/skills) - Creating custom skills - [Claude Code plugins](https://code.claude.com/docs/en/plugins) - Creating and using plugins - [Standard Anthropic provider](/docs/providers/anthropic/) - For text-only interactions