--- title: Evaluate OpenAI Agents (Python SDK) description: Evaluate the Python openai-agents SDK with Promptfoo tracing, SandboxAgent workflows, trace assertions, and agent red teams. sidebar_position: 26 --- # Evaluate OpenAI Agents (Python SDK) Use the Python `openai-agents` SDK with Promptfoo by wrapping your agent as a Python provider. This gives you full control over agent code, tools, sessions, and framework-specific tracing, while still letting Promptfoo score outputs and assert on the traced workflow. :::note The built-in [`openai:agents:*` provider](/docs/providers/openai-agents) is for the JavaScript `@openai/agents` SDK. For the Python SDK, use the Python provider path described here. ::: ## Quick Start ```bash npx promptfoo@latest init --example openai-agents cd openai-agents python3 -m venv .venv source .venv/bin/activate pip install -r requirements.txt export OPENAI_API_KEY=your_api_key_here # Run the eval npx promptfoo@latest eval -c promptfooconfig.yaml --no-cache # Optional: also emit a provider-level Python OpenTelemetry span PROMPTFOO_ENABLE_OTEL=true npx promptfoo@latest eval -c promptfooconfig.yaml --no-cache npx promptfoo@latest view ``` ## What The Example Covers - multi-turn execution over a persistent `SQLiteSession` - SDK 0.14 `SandboxAgent` execution over a staged Unix-local Python workspace - local-shell skill mounting with `ShellTool(environment={"type": "local", "skills": [...]})` - specialist handoffs between a triage agent, an FAQ agent, and a seat-booking agent - Promptfoo trace ingestion of the SDK's internal spans - assertions on tool usage, tool arguments, sandbox commands, agent spans, tool order, and overall task success ## How The Tracing Works Promptfoo can only assert on tool paths if it receives the agent's internal spans. The example does that by installing a custom `TracingProcessor` for the OpenAI Agents SDK and exporting those spans to Promptfoo's OTLP receiver. At a high level: 1. Promptfoo enables tracing and injects a W3C `traceparent` into the Python provider context. 2. The example parses that trace context and configures a custom OpenAI Agents tracing processor. 3. The processor converts OpenAI Agents spans into OTLP JSON. 4. Promptfoo ingests those spans and makes them available in the Trace Timeline and `trajectory:*` assertions. If you skip this exporter, Promptfoo will not see the SDK's tool and handoff spans, so `trajectory:*` assertions will not have the trace data they need. If you also enable Promptfoo's Python OpenTelemetry wrapper instrumentation with `PROMPTFOO_ENABLE_OTEL=true`, the example will emit a provider-level Python span as well. The custom SDK spans will inherit that active OpenTelemetry span as their parent. The example config accepts both OTLP JSON and OTLP/protobuf because the SDK bridge emits JSON while the wrapper exporter uses protobuf by default. SDK 0.14 adds custom spans for sandbox lifecycle work, and the SandboxAgent's shell tool emits `exec_command` function-tool spans. The example bridge maps SDK custom spans into normal OTLP attributes such as `sandbox.operation`, `command`, and `process.exit.code`, while Promptfoo normalizes OpenAI Agents `exec_command` tool spans as command trajectory steps. The same mapping also exposes command spans emitted by the SDK's experimental Codex tool as `command` and `codex.command`. ## Assertion Pattern The example config asserts on the agent's actual behavior instead of only the final message: ```yaml vars: steps_json: | [ "My name is Ada Lovelace and my confirmation number is ABC123.", "Move me to seat 14C.", "Also, what is the baggage allowance?" ] assert: - type: trajectory:tool-used value: - lookup_reservation - update_seat - faq_lookup - type: trajectory:tool-args-match value: name: update_seat args: confirmation_number: ABC123 new_seat: 14C mode: partial - type: trajectory:tool-sequence value: steps: - lookup_reservation - update_seat - faq_lookup - type: trajectory:step-count value: type: span pattern: 'agent *' min: 3 - type: trace-error-spans value: max_count: 0 ``` Use `trajectory:goal-success` when you want a judge model to decide whether the traced workflow actually completed the task, not just whether it hit the right tool path. ## Long-Horizon Tasks The example turns one eval row into a long-horizon task by passing a JSON-encoded list of user turns in `vars.steps_json`. The provider parses that JSON and executes the turns sequentially against a shared `SQLiteSession`, which lets the SDK preserve working memory across turns inside a single Promptfoo test case. The example also returns `tokenUsage.numRequests`, cached-input tokens, and reasoning-token detail from the SDK's raw model responses. That preserves the real multi-call footprint of handoffs and tool/model loops instead of collapsing every eval row to one request. That pattern is useful when you want to evaluate: - multi-step workflows that need memory - agent handoffs over time - task completion after several intermediate actions - regressions in tool usage across longer trajectories Promptfoo does not infer a dollar `cost` for this path automatically. A Python provider can mix models, hosted tools, and custom backends inside one agent graph, while the SDK's aggregate usage objects do not identify the priced model for each request. Return `cost` from your provider only when you can account for every billed model and hosted tool used by the run. ## Sandbox Agents OpenAI Agents SDK 0.14 introduced `SandboxAgent`, `Manifest`, and `SandboxRunConfig` for agents that need a live filesystem. Promptfoo does not need a special provider for this path: keep using a Python provider and pass a sandbox run config to the SDK. The bundled example follows the same shape as the SDK's official sandbox coding examples: stage a small repo with a task file, source file, tests, and maintainer instructions; force the agent to inspect the workspace through shell commands; then assert on both the answer and the trace. ```python from agents import ModelSettings, Runner from agents.run import RunConfig from agents.sandbox import Manifest, SandboxAgent, SandboxRunConfig from agents.sandbox.entries import File from agents.sandbox.sandboxes.unix_local import UnixLocalSandboxClient agent = SandboxAgent( name="Workspace analyst", model="gpt-5.4-mini", instructions="Inspect the workspace with shell before answering.", default_manifest=Manifest( entries={ "repo/task.md": File(content=b"Find the high-severity issue."), } ), model_settings=ModelSettings(include_usage=True), ) result = Runner.run_sync( agent, "Inspect the staged repo and summarize the issue.", run_config=RunConfig( sandbox=SandboxRunConfig(client=UnixLocalSandboxClient()), ), ) ``` The bundled example includes a `sandbox-workflow` provider label and a sandbox test that asserts the agent reported the staged ticket, ran the requested unittest command, and emitted the expected sandbox trace shape: ```yaml assert: - type: trace-span-count value: pattern: tool exec_command min: 2 - type: trace-span-count value: pattern: sandbox.start min: 1 - type: trace-span-count value: pattern: response * min: 2 - type: trajectory:step-count value: type: command pattern: '*unittest*' min: 1 ``` Use `UnixLocalSandboxClient` for local development, `DockerSandboxClient` when you need container isolation, and hosted sandbox clients when your application already depends on managed execution. Keep credentials and secrets out of staged `Manifest` files unless the sandbox backend and trace redaction policy are appropriate for that data. ## Skills The Python SDK exposes Agent Skills through shell environments rather than through Codex-style ambient discovery. For a local, reproducible eval, mount the skill on `ShellTool` explicitly. The bundled example also defines a small `SkillShellExecutor` that runs those local shell commands: ```python from pathlib import Path from agents import Agent, ShellTool discount_review_skill = { "name": "discount-review", "description": "Inspect the discount policy fixture with the bundled checklist.", "path": "/path/to/skills/discount-review", } agent = Agent( name="Local Skill Analyst", instructions="Use the discount-review skill for discount-policy review tasks.", tools=[ ShellTool( environment={ "type": "local", "skills": [discount_review_skill], }, executor=SkillShellExecutor(cwd=Path(__file__).parent), ) ], ) ``` The bundled `skill-workflow` example keeps the task small on purpose: it mounts a `discount-review` skill, asks the agent to inspect a local fixture repo, and has the skill run a helper script before answering. ```yaml assert: - type: trajectory:step-count value: type: command pattern: '*discount-review/SKILL.md*' min: 1 - type: trajectory:step-count value: type: command pattern: '*analyze_discount_policy.py*' min: 1 - type: contains value: return discount_percent >= 20 - type: not-contains value: 'stderr:' ``` Today, the Python SDK does not expose a first-class skill invocation event that Promptfoo can normalize into `skill-used`. For Python SDK skill evals, assert on the observable workflow instead: the skill file was read, the helper command ran cleanly, and the final answer reflects the skill's result. If your application already tracks selected skills, you can also return `metadata.skillCalls` from the Python provider yourself and use Promptfoo's [`skill-used`](/docs/configuration/expected-outputs/deterministic/#skill-used) assertion on top of that. Hosted shell follows the same eval idea, but the attachment shape changes from a local path to a hosted `skill_reference`. Keep local shell for examples you want users to run from a fresh clone; use hosted shell when your product already depends on uploaded, versioned skills. ## Experimental Codex Tool The Python SDK's Codex integration is available as `codex_tool` from `agents.extensions.experimental.codex`. It lets a regular Python SDK agent delegate a bounded workspace task to Codex during a tool call: ```python from agents import Agent from agents.extensions.experimental.codex import ThreadOptions, TurnOptions, codex_tool agent = Agent( name="Repo assistant", instructions="Use Codex for repository inspection tasks.", tools=[ codex_tool( sandbox_mode="workspace-write", working_directory="/path/to/repo", default_thread_options=ThreadOptions( model="gpt-5.4", model_reasoning_effort="low", approval_policy="never", web_search_mode="disabled", ), default_turn_options=TurnOptions(idle_timeout_seconds=60), ) ], ) ``` Evaluate that agent through the same Python provider pattern. The example tracing bridge exposes Codex command execution spans as `command` and `codex.command`, so Promptfoo's trajectory assertions can verify that Codex actually inspected files or ran commands. If Codex itself is the system under test, prefer Promptfoo's dedicated [`openai:codex-sdk`](/docs/providers/openai-codex-sdk) or [`openai:codex-app-server`](/docs/providers/openai-codex-app-server) providers. The app-server provider supports `approvals_reviewer: guardian_subagent`; the Python `openai-agents` SDK 0.14.1 package does not expose a public `Guardian`/`guardian` API. ## Red Team The Agent The example includes two red-team configs. `promptfooconfig.redteam.yaml` targets the Python SDK airline agent with trace capture enabled. `promptfooconfig.redteam.coding.yaml` targets the `SandboxAgent` coding workflow and exercises coding-agent risks such as repository prompt injection, terminal-output injection, synthetic secret reads, sandbox write escapes, network egress, delayed CI exfiltration, generated vulnerabilities, automation poisoning, steganographic exfiltration, and verifier sabotage. ```bash npx promptfoo@latest redteam generate -c promptfooconfig.redteam.yaml -o redteam.generated.yaml --remote --force --strict npx promptfoo@latest redteam eval -c redteam.generated.yaml --no-cache --no-share -j 1 -o redteam-results.json npx promptfoo@latest redteam generate -c promptfooconfig.redteam.coding.yaml -o redteam.coding.generated.yaml --remote --force --strict npx promptfoo@latest redteam eval -c redteam.coding.generated.yaml --no-cache --no-share -j 1 -o redteam-coding-results.json ``` Both configs use only `jailbreak:meta` and `jailbreak:hydra` strategies; Promptfoo also includes the generated baseline/direct probes that those strategies transform. The target returns only the user-visible final answer, but each generated test inherits trace assertions so you can catch internal tool-path failures even when the final answer looks like a refusal. For example, the airline red team forbids traced `update_seat` calls during adversarial probes. Keep generated corpora and result JSON files as local run artifacts unless you intentionally want to commit a fixed adversarial corpus. This sample is not production-hardened, so useful red-team runs should find some real breaks. Inspect failures alongside the Trace Timeline to separate output-only policy failures from internal tool-use or sandbox-boundary failures. ## Multimodal Input The Python provider runs your own function, so you can pass structured multimodal input directly to `Runner.run_sync()` instead of a plain string: ```python result = Runner.run_sync( agent, [ { "role": "user", "content": [ {"type": "input_text", "text": "What is in this image?"}, {"type": "input_image", "image_url": f"data:image/jpeg;base64,{image_b64}"}, ], } ], ) ``` Python SDK image input items use `image_url`; the JavaScript SDK examples use `image`. ## Telemetry After the eval finishes, open the web UI and inspect the **Trace Timeline** for any row. You should see: - a provider-level Python span when `PROMPTFOO_ENABLE_OTEL=true` - agent spans - handoff spans - generation spans - function-tool spans with tool names and arguments - sandbox lifecycle spans such as `sandbox.start` and `sandbox.running` when using `SandboxAgent` - shell command spans such as `tool exec_command`, normalized as command trajectory steps - Codex command custom spans when using the SDK's experimental `codex_tool` That same trace data powers `trace-span-*` and `trajectory:*` assertions. ## Related Docs - [Python Provider](/docs/providers/python) - [Tracing](/docs/tracing) - [OpenAI Agents (JavaScript SDK)](/docs/providers/openai-agents)