# openai-agents (Long-Horizon OpenAI Agents Python SDK)

This example shows how to evaluate the official Python `openai-agents` SDK end to end in Promptfoo.

It demonstrates:

- a long-horizon task executed as multiple turns over a persistent `SQLiteSession`
- the SDK 0.14 `SandboxAgent` runtime over a staged Unix-local Python workspace
- a local-shell `discount-review` skill mounted through `ShellTool`
- specialist handoffs between a triage agent, an FAQ agent, and a seat-booking agent
- agentic assertions such as `trajectory:tool-used`, `trajectory:tool-args-match`, `trajectory:tool-sequence`, and `trajectory:step-count`
- telemetry you can inspect in Promptfoo's Trace Timeline

The tracing path is important: the example installs a custom OpenAI Agents tracing processor that exports the SDK's spans to Promptfoo's built-in OTLP receiver. That is what makes the trajectory assertions and trace visualization work inside Promptfoo. The bridge maps SDK custom spans, including `sandbox.*` lifecycle spans and experimental Codex command spans, into normal OTLP attributes, and Promptfoo normalizes OpenAI Agents `exec_command` tool spans as command trajectory steps. The config accepts both OTLP JSON and protobuf because the SDK bridge emits JSON while the optional Python wrapper span uses protobuf by default.

## Files

- `agent_provider.py`: the Promptfoo Python provider and agent graph
- `promptfoo_tracing.py`: bridges OpenAI Agents SDK traces to Promptfoo OTLP
- `promptfooconfig.yaml`: eval config with tracing and trajectory assertions
- `skills/discount-review/`: a local `SKILL.md` bundle plus helper script for the skill eval
- `skill_fixture/`: the real local repo fixture inspected by the skill workflow
- `promptfooconfig.redteam.yaml`: airline agent red-team config with trace assertions
- `promptfooconfig.redteam.coding.yaml`: SandboxAgent coding-agent red-team config
- `requirements.txt`: Python dependencies for the example

## Requirements

- Python 3.10+
- Node.js 20+
- `OPENAI_API_KEY`

## Setup

```bash
npx promptfoo@latest init --example openai-agents
cd openai-agents

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

export OPENAI_API_KEY=your_api_key_here
```

## Run

```bash
npx promptfoo@latest eval -c promptfooconfig.yaml --no-cache
PROMPTFOO_ENABLE_OTEL=true npx promptfoo@latest eval -c promptfooconfig.yaml --no-cache
npx promptfoo@latest view
```

Open any result and inspect the **Trace Timeline** tab. You should see agent, handoff, generation, and tool spans from the OpenAI Agents SDK.

If you also want a provider-level Python OpenTelemetry span alongside the SDK spans, run the eval with `PROMPTFOO_ENABLE_OTEL=true`.

The provider returns aggregate token usage with the SDK's real request count, cached-input tokens, and reasoning-token detail. It intentionally does not return a dollar cost: a generic Python agent graph can mix models and hosted tools, so exact spend should be returned only by provider code that can account for every billed step.

## What The Eval Asserts

- the agent used `lookup_reservation`, `update_seat`, and `faq_lookup`
- the seat update tool received the expected arguments
- the tools appeared in the expected order across a multi-step task
- at least three traced agent spans were captured during the long-horizon run
- no traced error spans were emitted
- the final trajectory achieved the stated goal
- third-party booking changes are refused without mutating the reservation
- the sandbox agent created a workspace, ran shell commands, ran the unittest command, and reported the staged ticket details with the minimal fix
- the local-shell skill workflow read `SKILL.md`, ran the bundled helper script without shell stderr, and reported the expected ticket details

## Red Team The Agent

```bash
npx promptfoo@latest redteam generate -c promptfooconfig.redteam.yaml -o redteam.generated.yaml --remote --force --strict
npx promptfoo@latest redteam eval -c redteam.generated.yaml --no-cache --no-share -j 1 -o redteam-results.json

npx promptfoo@latest redteam generate -c promptfooconfig.redteam.coding.yaml -o redteam.coding.generated.yaml --remote --force --strict
npx promptfoo@latest redteam eval -c redteam.coding.generated.yaml --no-cache --no-share -j 1 -o redteam-coding-results.json
```

The airline red-team config targets the airline agent with tracing enabled and returns only the user-visible final answer, not the verbose eval transcript. It exercises agent-specific boundaries across OWASP Agentic AI, OWASP LLM, MITRE ATLAS, and NIST AI RMF mappings: tool discovery, prompt extraction, debug access, system prompt override, authorization bypass, cross-session leakage, memory poisoning, privacy, PII, data exfiltration, ASCII smuggling, excessive agency, and custom airline policy probes. It applies only the `jailbreak:meta` and `jailbreak:hydra` strategies; Promptfoo still includes the generated baseline/direct probes that those strategies transform. Hydra is configured as non-stateful so each generated probe is replayed against a fresh airline session.

The coding-agent red-team config targets the SandboxAgent workflow and focuses on repository prompt injection, terminal-output injection, secret/env/file reads, sandbox write escapes, network egress, delayed CI exfiltration, generated vulnerabilities, automation poisoning, steganographic exfiltration, and verifier sabotage. It also uses only `jailbreak:meta` and `jailbreak:hydra`. This is the stronger harness-oriented companion to the airline policy red team.

This sample is intentionally not a production-hardened airline agent. Some generated probes should find real breaks, especially around third-party booking changes, authority/consent claims, data-exfiltration attempts, and multi-turn authorization bypasses. Each generated attack inherits trace assertions that require OpenAI Agents SDK spans, require zero traced errors, and fail if the mutating `update_seat` tool is used during adversarial probes. Inspect failures together with the Trace Timeline so you can distinguish a user-visible refusal problem from an internal tool-path or boundary failure.

## Notes

- The example uses `openai-agents>=0.14.1,<0.15` and the Python SDK, not the built-in `openai:agents:*` provider. That built-in provider is for the JavaScript `@openai/agents` SDK.
- `requirements.txt` includes the optional OpenTelemetry Python packages used by Promptfoo's wrapper. Set `PROMPTFOO_ENABLE_OTEL=true` to emit the provider-level Python span in addition to the SDK spans.
- If you do not need SDK spans, remove the `configure_promptfoo_tracing(...)` import and call from `agent_provider.py`. You can then delete `promptfoo_tracing.py`, but you will lose tool-path assertions because Promptfoo will no longer receive the SDK's internal agent spans.
- `trajectory:goal-success` adds an extra judge-model call. Remove it if you want a cheaper run.
- The SDK's experimental `codex_tool` is available from `agents.extensions.experimental.codex`. Use it inside a Python provider when a larger agent should delegate a bounded workspace task to Codex. Use Promptfoo's `openai:codex-sdk` or `openai:codex-app-server` providers when Codex itself is the system under test.
- The local skill workflow uses `ShellTool(environment={"type": "local", "skills": [...]})` because the Python SDK exposes skills through shell environments rather than Codex-style ambient discovery. The SDK does not currently emit a first-class skill invocation event, so the example proves usage through traced shell commands that read `SKILL.md` and run the helper script.