# Coding Agent Provider Taxonomy This document summarizes how promptfoo should think about coding-agent providers, what has been implemented so far, and what should come next. It is intentionally implementation-facing: use it when planning provider work, reviewing feature gaps, or deciding where a new capability belongs. ## Scope This taxonomy covers providers that run an agentic coding runtime, not ordinary single-turn model APIs. A coding-agent provider usually has some combination of: - A workspace or project directory. - Tool use for files, shell commands, MCP, search, or app connectors. - A session, thread, or server lifecycle. - Permission, sandbox, or approval controls. - Rich metadata beyond final assistant text. The main providers in this family today are: | Provider family | Provider IDs | Runtime boundary | | ----------------------- | ----------------------------------------------------- | ------------------------------------------ | | OpenAI Codex SDK | `openai:codex-sdk`, `openai:codex` | `@openai/codex-sdk` library | | OpenAI Codex app-server | `openai:codex-app-server`, `openai:codex-desktop` | Local `codex app-server` JSON-RPC process | | Claude Agent SDK | `anthropic:claude-agent-sdk`, `anthropic:claude-code` | `@anthropic-ai/claude-agent-sdk` library | | OpenCode SDK | `opencode:sdk`, `opencode` | OpenCode SDK plus local or existing server | Standard OpenAI, Anthropic, Bedrock, Azure, and other model providers still matter for grading and comparison, but they are outside this taxonomy unless they expose a stateful coding-agent runtime. ## Taxonomy Axes ### 1. Runtime Boundary The first question is where promptfoo stops and the agent runtime starts. | Boundary | Meaning | Current examples | | ------------------------ | ------------------------------------------------------------- | ---------------------------------------- | | In-process SDK | promptfoo calls a package API directly. | Codex SDK, Claude Agent SDK | | Managed local server | promptfoo starts a server, then talks to it through a client. | OpenCode when `baseUrl` is unset | | Existing server | promptfoo connects to a runtime it does not configure. | OpenCode with `baseUrl` | | Local app-server process | promptfoo starts a rich-client protocol server over stdio. | Codex app-server | | Desktop UI process | Human-facing native app process. | Codex Desktop app, not directly attached | This distinction matters because it controls what promptfoo can guarantee. If promptfoo starts the runtime, it can set env vars, working directories, sandbox options, tracing, and cleanup behavior. If promptfoo attaches to an existing server, that server owns authentication, installed tools, app connectors, and runtime state. ### 2. Session and Thread State Coding agents are rarely stateless. Each provider needs explicit semantics for: - Ephemeral sessions: safe default for independent eval rows. - Persistent sessions: useful for memory, multi-turn tasks, or regression suites. - User-supplied sessions: resume an existing thread/session by id. - Pooling: preserve concurrency without cross-contaminating rows. - Cleanup: unsubscribe, archive, delete temporary directories, or leave state alone. The important design rule is that session reuse must be opt-in or very clearly scoped. Reusing state silently makes eval results order-dependent. ### 3. Workspace and Side Effects Agent evals should separate filesystem access, network access, and shell access. Those are different risks. | Surface | Safe default | Higher-risk mode | | ----------- | ------------------------------------------ | ------------------------------------------------------- | | Filesystem | Temporary directory or read-only workspace | Workspace write or full filesystem access | | Shell | Disabled or approval-gated | Allowed command execution | | Network | Disabled unless explicitly requested | Host allow-lists, live web/search, package installs | | App/plugin | Not installed or not invoked by default | App connectors, plugin installs, config writes | | Environment | Minimal env | Inherited process env with secrets and local auth state | Provider docs should make clear that `danger-full-access` is not the same as network access, and read-only filesystem mode does not automatically sanitize env vars. Each surface should have its own option and its own tests. ### 4. Permission and Interaction Model Promptfoo evals are non-interactive by default. Agent runtimes often expect a human to answer approval prompts, permission requests, or clarification questions. Providers should convert those into deterministic policies. Common policy categories: - Command approval. - File-change approval. - Permission grants. - User-input or ask-user-question tools. - MCP elicitation. - Dynamic tool calls. - Plugin or app connector requests. Default policy should decline, cancel, or return empty answers unless a config opts into side effects. Every accepted side effect should be visible in metadata. ### 5. Inputs The baseline input is a prompt string. Coding-agent providers increasingly need structured inputs: - Text items. - Local images or image URLs. - Skills or plugin references. - Mentions/app connector references. - File, diff, or workspace context. Provider-specific JSON input arrays are acceptable when the underlying runtime has typed input items. Unknown JSON shapes should usually degrade to plain text rather than crashing an eval row, unless the provider docs promise strict input parsing. ### 6. Outputs and Metadata All coding-agent providers should return a normal promptfoo provider result: - `output`: final assistant-facing text. - `sessionId`: session/thread id when available. - `tokenUsage`: runtime usage when available. - `cost`: estimate when usage and model pricing are known. - `metadata`: normalized agent metadata. - `raw`: raw or summarized protocol data when useful and safe. Provider-specific metadata is still valuable, but consumers need a shared shape for cross-provider assertions and dashboards. A future shared schema should include: - Runtime family and version. - Workspace and sandbox settings. - Session/thread/turn identifiers. - Tool trajectories. - Approval decisions. - File changes and command executions. - MCP and dynamic tool calls. - Skill/plugin/app connector usage. - Trace ids and span links. ### 7. Observability Tracing should answer two questions: - What did the model decide? - What did the agent runtime do? Provider tracing should include the top-level `callApi` span, item/tool-level spans where possible, and sanitized attributes for prompts, commands, tool inputs, file paths, and outputs. Deep tracing should be opt-in when it requires injecting OpenTelemetry env vars into a child process. ## What Is Implemented So Far ### Shared Building Blocks The coding-agent providers already share several practical patterns: - Optional dependencies are loaded lazily so normal promptfoo installs do not need every agent SDK. - Working directories are validated or created before the agent call. - Temporary workspaces are cleaned up after evals. - Session or thread caches are keyed by provider config. - Provider-level config is stricter than prompt-level merged config. - Tool and skill usage are surfaced through metadata where possible. - Tracing is supported for Codex and is partially shared through OpenAI agent tracing helpers. Useful files: - `src/providers/agentic-utils.ts` - `src/providers/claude-agent-sdk.ts` - `src/providers/opencode-sdk.ts` - `src/providers/openai/codex-sdk.ts` - `src/providers/openai/codex-app-server.ts` - `src/providers/registry.ts` ### OpenAI Codex SDK Status: implemented and documented. Provider IDs: - `openai:codex-sdk` - `openai:codex-sdk:` - `openai:codex` - `openai:codex:` Implemented capabilities: - Lazy loading for `@openai/codex-sdk`. - API-key and local Codex login authentication paths. - Working directory and Git repository safety checks. - Sandbox, network, web search, approval, and reasoning controls. - Thread resume and persistent thread pooling. - JSON schema output. - Text and local-image input items. - Skill usage heuristics from `SKILL.md` reads. - Token usage and cost estimation for known Codex models. - Streaming aggregation for metadata and tracing. - Deep tracing into the Codex runtime. - Default-provider support for grading when Codex credentials are available. Important limits: - It is the right default for CI and automation, but it does not expose every rich app-server protocol event. - Skill detection is heuristic. - Promptfoo still receives a final provider response, not live partial output in assertions. Docs and examples: - `site/docs/providers/openai-codex-sdk.md` - `examples/openai-codex-sdk/` ### OpenAI Codex App Server Status: implemented, documented, and validated with mocked protocol tests plus a real local eval. Provider IDs: - `openai:codex-app-server` - `openai:codex-app-server:` - `openai:codex-desktop` - `openai:codex-desktop:` Implemented capabilities: - Spawns `codex app-server --listen stdio://`. - Drives the app-server JSON-RPC handshake. - Starts, resumes, unsubscribes, and archives threads according to config. - Starts turns with model, cwd, sandbox, approval, reasoning, personality, collaboration mode, service tier, output schema, and instructions. - Accepts plain text plus JSON arrays of app-server input items. - Aggregates streamed item notifications into final text and metadata. - Captures item counts, command/file/MCP/tool/web-search/reasoning trajectories. - Handles server requests deterministically through `server_request_policy`. - Uses safe defaults: read-only sandbox, no approvals, ephemeral threads, minimal env. - Supports thread pooling and persistent thread cache invalidation. - Supports raw events, token usage, cost estimation, request timeouts, turn timeouts, cleanup, aborts, and process failure handling. - Supports deep tracing by creating a fresh app-server process per row and injecting OpenTelemetry env vars. - Differentiates the app-server protocol from the Codex Desktop app: promptfoo starts its own app-server child process and does not attach to a running Desktop app UI process. Important limits: - WebSocket transport is not implemented; stdio is the supported transport. - Live partial output is not exposed to assertions. - Direct attachment to a running Codex Desktop app is not implemented. - High-risk protocol requests such as config writes, plugin installs, and arbitrary filesystem writes should remain unavailable or explicitly policy-gated. Docs and examples: - `site/docs/providers/openai-codex-app-server.md` - `docs/agents/codex-app-server-provider-notes.md` - `examples/openai-codex-app-server/` ### Claude Agent SDK Status: implemented and documented. Provider IDs: - `anthropic:claude-agent-sdk` - `anthropic:claude-code` Implemented capabilities: - Lazy loading for `@anthropic-ai/claude-agent-sdk`. - Anthropic API key, Bedrock, Vertex, and local Claude Code auth flows. - Temporary or configured working directories. - Built-in tool controls, allowed/disallowed tools, and permission modes. - Explicit unsafe permission skip flag. - MCP server configuration and optional MCP caching. - AskUserQuestion handling. - Plugin and local skill support. - Custom agents, hooks, system prompt overrides, betas, thinking, and effort options. - Session resume, fork, continue, persistence, and file checkpointing options. - Sandbox and executable configuration. - Usage/cost controls such as max budget. Important limits: - The SDK owns many semantics, so promptfoo must keep docs aligned with SDK changes. - Side-effectful modes require external workspace reset discipline. Docs and examples: - `site/docs/providers/claude-agent-sdk.md` - `examples/claude-agent-sdk/` ### OpenCode SDK Status: implemented and documented. Provider IDs: - `opencode:sdk` - `opencode` Implemented capabilities: - Lazy loading for `@opencode-ai/sdk` v1 or v2. - Starts an OpenCode server when `baseUrl` is unset. - Connects to an existing OpenCode server when `baseUrl` is provided. - Supports provider/model selection, variants, workspaces, and custom agents. - Supports temporary or configured working directories. - Uses read-only default tools for configured workspaces. - Supports write/edit/bash tools with explicit permission config. - Supports JSON Schema structured output through OpenCode `format`. - Supports sessions and persistent session caching. - Supports MCP configuration and optional MCP caching when promptfoo starts the server. Important limits: - When using `baseUrl`, the existing server owns auth, MCP setup, installed agents, and server-side configuration. - OpenCode model support is delegated to OpenCode/models.dev rather than promptfoo's normal provider model tables. Docs and examples: - `site/docs/providers/opencode-sdk.md` - `examples/provider-opencode-sdk/` ## Current Naming Guidance Use provider IDs that encode the runtime boundary, not just the model vendor. - `openai:codex` should continue to mean the Codex SDK alias because that is the best default for automation. - `openai:codex-app-server` should mean the app-server JSON-RPC protocol. - `openai:codex-desktop` should remain an alias for app-server behavior unless or until promptfoo can actually attach to the Desktop app process. - `anthropic:claude-code` should remain an alias for Claude Agent SDK because the SDK is still built on Claude Code. - `opencode` can remain a convenience alias for `opencode:sdk`. Avoid adding unscoped top-level aliases such as `codex:desktop` until the OpenAI scoped names are stable and the docs can clearly explain the difference between SDK, app-server, and Desktop UI attachment. ## What To Implement Next ### 1. Shared Agent Metadata Schema Create a cross-provider `metadata.agent` shape while preserving provider-specific metadata namespaces such as `metadata.codexAppServer`. Proposed fields: - `runtime`: `codex-sdk`, `codex-app-server`, `claude-agent-sdk`, `opencode`. - `runtimeVersion`: runtime-reported version when available. - `sessionId`, `threadId`, `turnId`. - `workingDir`, `sandbox`, `network`, `approvalPolicy`. - `tools`: normalized tool calls and outcomes. - `commands`: normalized shell command executions. - `fileChanges`: normalized file write/edit/delete attempts. - `approvals`: normalized approval prompts and decisions. - `mcp`: normalized MCP calls and elicitations. - `skills`: confirmed and attempted skill usage. - `trace`: trace ids and span ids when available. Acceptance criteria: - Existing provider-specific metadata remains backward compatible. - Cross-provider assertions can target the same metadata path. - Tests cover at least Codex SDK, Codex app-server, Claude Agent SDK, and OpenCode. ### 2. Shared Coding-Agent Provider Test Contract Add a reusable test contract for coding-agent providers. Each provider can implement the same scenarios with its own mocked runtime. Core scenarios: - Missing optional dependency. - API key/env precedence. - Working directory validation. - Prompt-level config merge. - Safe default sandbox and permissions. - Structured output. - Session persistence and cleanup. - Timeout and abort. - Runtime process/server failure. - Tool approval decline by default. - Metadata normalization. - Trace sanitization. Acceptance criteria: - New providers cannot skip lifecycle and safety cases. - Mock implementations reset hoisted mocks in `beforeEach`. - Concurrency tests prove one session/process failure cannot fail unrelated rows. ### 3. Provider Capability Matrix Add a docs page or generated table that compares coding-agent provider capabilities. Suggested columns: - Provider IDs. - Runtime boundary. - Optional dependency or CLI requirement. - Auth modes. - Workspace model. - Sandbox controls. - Shell/file/network controls. - MCP support. - Skills/plugins/custom agents. - Structured output. - Session persistence. - Tracing. - Raw protocol metadata. - Existing-server attachment. Acceptance criteria: - The matrix links to each provider doc. - It clearly says that Codex Desktop attachment is not currently supported. - It is tested in the docs build. ### 4. Real Eval QA Matrix Create a small set of real eval examples that can be run selectively by maintainers. Minimum scenarios: - Read-only repo review. - Structured JSON output. - Sandbox denial for attempted writes. - Workspace-write side effect in a disposable fixture. - Session persistence across two rows. - Skill or plugin invocation. - MCP tool call with a deterministic local MCP server. - Deep tracing smoke test. Acceptance criteria: - Each scenario records expected pass/fail/error behavior. - Each scenario can run with `--no-cache`. - Side-effectful scenarios use disposable workspaces. - CI can run a lightweight subset without requiring every optional agent runtime. ### 5. Security and Redteam Coverage Add adversarial tests for the risky parts of agent runtimes. High-value cases: - Prompt injection that asks the agent to reveal env vars. - Path traversal through input items, tool args, or MCP responses. - Approval bypass attempts. - Plugin install/config-write attempts. - Network enablement attempts when network should be off. - Symlink writes out of the workspace. - Malicious `SKILL.md` or plugin instructions. - Tool output containing secrets that must be redacted from traces. Acceptance criteria: - Providers decline or isolate side effects by default. - Redaction is tested for command output, MCP arguments, tool arguments, prompts, and final metadata. - Any accepted dangerous operation requires explicit config and is visible in metadata. ### 6. Codex App Server Follow-Up Features The app-server provider now covers the core eval path. The next app-server-specific work should focus on rare protocol features and product integration boundaries. Candidates: - `model/list`, `skills/list`, `plugin/list`, `plugin/read`, and `app/list` metadata discovery. - `review/start` support for native review flows. - `turn/steer` and `turn/interrupt` tests for cancellation and mid-turn control. - Optional WebSocket transport only if upstream stabilizes it. - Better raw event snapshots for protocol regression tests. - Stronger docs around Desktop alias semantics and why promptfoo starts a separate app-server process. Acceptance criteria: - Rare features are opt-in and deterministic. - High-risk operations are policy-gated. - Protocol additions have mocked tests and at least one documented example when user-facing. ### 7. Side-Effect Harness Build a reusable harness for tests and examples that need write access. It should provide: - Disposable workspace creation. - Git repository initialization when needed. - Snapshot before and after agent runs. - Symlink and path traversal fixtures. - Cleanup guarantees. - Helpers for expected file diffs. Acceptance criteria: - No side-effectful provider test mutates the source checkout. - Tests can assert exact changed files. - Harness works on macOS, Linux, and Windows CI. ### 8. Documentation Cleanup Improve the information architecture around coding-agent providers. Recommended docs: - A public provider comparison matrix. - A "Choosing a coding-agent provider" guide. - A "Managing side effects in agent evals" guide. - Provider-specific "SDK vs app-server vs Desktop app" sections for Codex. - Example READMEs that all share the same run, auth, and safety structure. Acceptance criteria: - The public docs answer which provider to use for CI, Desktop-like protocol evals, Claude Code compatibility, and OpenCode multi-provider setups. - Example configs validate locally. - Docs state safe defaults and side-effect responsibilities near every write-capable example. ## Review Checklist For New Coding-Agent Provider Work Before merging a provider in this family, verify: - Provider IDs are explicit about runtime boundary. - Optional dependencies fail with actionable install guidance. - API keys and local-auth modes are documented. - Env precedence is tested. - Working directory, sandbox, network, and shell controls are documented and tested. - Prompt-level config merges do not drop nested provider defaults. - Hoisted mocks with implementations reset in `beforeEach`. - Session/thread reuse is opt-in or explicitly scoped. - Runtime failures do not poison unrelated sessions or rows. - Approval/user-input/MCP requests have deterministic non-interactive defaults. - Metadata includes session ids, tool/command/file activity, and approval decisions. - Trace attributes are sanitized. - Examples validate and at least one real eval has been run when the runtime is locally available. ## Open Questions - Should promptfoo expose one public `metadata.agent` schema now, or keep it internal until at least two providers use it in docs examples? - Should Codex app-server discovery operations be exposed as provider metadata on every call, or only behind an explicit config flag? - Should any provider support a hard "no side effects" verifier that snapshots the workspace and fails if files changed? - Should remote/existing-server modes be marked as less reproducible in promptfoo output metadata? - Should top-level aliases like `codex:app-server` wait for a broader provider naming cleanup?