---
sidebar_position: 1
description: 'Configure OpenAI models including GPT-5.5, GPT-5.4, GPT-4.1, o-series reasoning, embeddings, and assistants for comprehensive AI evals'
---

# OpenAI

To use the OpenAI API, set the `OPENAI_API_KEY` environment variable, specify via `apiKey` field in the configuration file or pass the API key as an argument to the constructor.

Example:

```sh
export OPENAI_API_KEY=your_api_key_here
```

The OpenAI provider supports the following model formats:

- `openai:<model name>` - auto-routes known OpenAI model IDs to their supported promptfoo provider (chat, realtime, or responses); unknown model names default to Chat Completions. Use an explicit endpoint prefix when you need deterministic routing.
- `openai:chat:<model name>` - uses chat models against the `/v1/chat/completions` endpoint
- `openai:responses:<model name>` - uses responses API models over HTTP connections
- `openai:assistant:<assistant id>` - use an assistant
- `openai:chat` - defaults to `gpt-4.1-2025-04-14`
- `openai:responses` - defaults to `gpt-4.1-2025-04-14`
- `openai:chat:ft:gpt-5-mini:company-name:ID` - example of a fine-tuned chat completion model
- `openai:completion` - defaults to `gpt-3.5-turbo-instruct`
- `openai:completion:<model name>` - uses any model name against the `/v1/completions` endpoint
- `openai:embedding:<model name>` / `openai:embeddings:<model name>` - uses any model name against the `/v1/embeddings` endpoint
- `openai:moderation:<model name>` - uses moderation models (default: `omni-moderation-latest`)
- `openai:image:<model name>` - uses image generation models
- `openai:transcription:<model name>` - uses audio transcription models
- `openai:realtime:<model name>` - uses realtime API models over WebSocket connections
- `openai:video:<model name>` - uses Sora video generation models
- `openai:agents:<agent name>` - runs agentic workflows via OpenAI Agents SDK
- `openai:chatkit:<workflow_id>` - runs ChatKit workflows
- `openai:codex-sdk` / `openai:codex` - runs agentic coding workflows via OpenAI Codex SDK, with optional inline model selection like `openai:codex:gpt-5.5`
- `openai:codex-app-server` / `openai:codex-desktop` - runs the experimental Codex app-server protocol for rich-client event, approval, sandbox, skill, plugin, and thread lifecycle evals

The `openai:<endpoint>:<model name>` construction is useful if OpenAI releases a new model,
or if you have a custom model.
For example, if OpenAI releases `gpt-5` chat completion,
you could begin using it immediately with `openai:chat:gpt-5`.

```yaml title="GPT-5 only: verbosity and lowest reasoning"
providers:
  - id: openai:chat:gpt-5
    config:
      verbosity: high # low | medium | high
      reasoning_effort: minimal # GPT-5.5 uses none instead
  # For the Responses API, use a nested reasoning object:
  - id: openai:responses:gpt-5
    config:
      reasoning:
        effort: minimal # GPT-5.5 uses none instead
```

The OpenAI provider supports a handful of [configuration options](https://github.com/promptfoo/promptfoo/blob/main/src/providers/openai/types.ts#L112-L185), such as `temperature`, `max_tokens`, `max_completion_tokens`, `functions`, and `tools`, which can be used to customize model behavior like so:

```yaml title="promptfooconfig.yaml"
providers:
  - id: openai:chat:gpt-4.1-mini
    config:
      temperature: 0
      max_tokens: 1024
  - id: openai:chat:gpt-5.5
    config:
      max_completion_tokens: 1024
```

> **Note:** OpenAI models can also be accessed through [Azure OpenAI](/docs/providers/azure/), which offers additional enterprise features, compliance options, and regional availability.

## Formatting chat messages

For information on setting up chat conversation, see [chat threads](/docs/configuration/chat).

## Configuring parameters

The `providers` list takes a `config` key that allows you to set parameters like `temperature` for non-reasoning models, `max_tokens`, `max_completion_tokens` for GPT-5 family chat models, and [others](https://platform.openai.com/docs/api-reference/chat/create). For example:

```yaml title="promptfooconfig.yaml"
providers:
  - id: openai:chat:gpt-4.1-mini
    config:
      temperature: 0
      max_tokens: 128
  - id: openai:chat:gpt-5.5
    config:
      max_completion_tokens: 128
      apiKey: sk-abc123
```

Supported parameters include:

| Parameter                | Description                                                                                                                                                                                                                                                                                                  |
| ------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `apiBaseUrl`             | The base URL of the OpenAI API, please also read `OPENAI_BASE_URL` below.                                                                                                                                                                                                                                    |
| `apiHost`                | The hostname of the OpenAI API, please also read `OPENAI_API_HOST` below.                                                                                                                                                                                                                                    |
| `apiKey`                 | Your OpenAI API key, equivalent to `OPENAI_API_KEY` environment variable                                                                                                                                                                                                                                     |
| `apiKeyEnvar`            | An environment variable that contains the API key                                                                                                                                                                                                                                                            |
| `best_of`                | Controls the number of alternative outputs to generate and select from.                                                                                                                                                                                                                                      |
| `frequency_penalty`      | Applies a penalty to frequent tokens, making them less likely to appear in the output.                                                                                                                                                                                                                       |
| `function_call`          | Controls whether the AI should call functions. Can be either 'none', 'auto', or an object with a `name` that specifies the function to call.                                                                                                                                                                 |
| `functions`              | Allows you to define custom functions. Each function should be an object with a `name`, optional `description`, and `parameters`.                                                                                                                                                                            |
| `functionToolCallbacks`  | A map of function tool names to function callbacks. Each callback should accept a string and return a string or a `Promise<string>`.                                                                                                                                                                         |
| `headers`                | Additional headers to include in the request.                                                                                                                                                                                                                                                                |
| `cost`                   | Legacy per-token override applied to both input and output pricing in promptfoo cost estimates.                                                                                                                                                                                                              |
| `inputCost`              | Override input token pricing in promptfoo cost estimates.                                                                                                                                                                                                                                                    |
| `outputCost`             | Override output token pricing in promptfoo cost estimates.                                                                                                                                                                                                                                                   |
| `audioCost`              | Legacy per-token override applied to both audio input and audio output pricing in promptfoo cost estimates.                                                                                                                                                                                                  |
| `audioInputCost`         | Override audio input token pricing in promptfoo cost estimates.                                                                                                                                                                                                                                              |
| `audioOutputCost`        | Override audio output token pricing in promptfoo cost estimates.                                                                                                                                                                                                                                             |
| `max_tokens`             | Controls maximum output length for non-reasoning requests. Not used by reasoning-capable models (o-series, `codex-mini-latest`, and GPT-5 family). Use `max_completion_tokens` (Chat Completions) or `max_output_tokens` (Responses API) instead.                                                            |
| `maxRetries`             | Maximum number of retry attempts for failed API requests. Defaults to 4. Set to 0 to disable retries. Hard-quota responses (`insufficient_quota`, `billing_hard_limit_reached`, `access_terminated`, etc.) are never retried regardless of this setting — retrying an exhausted account only amplifies load. |
| `metadata`               | Key-value pairs for request tagging and organization.                                                                                                                                                                                                                                                        |
| `omitDefaults`           | Omits hardcoded defaults for `temperature` and `max_tokens`/`max_output_tokens` unless values are explicitly set via config or environment variables. Supported by `openai:chat` and `openai:responses`.                                                                                                     |
| `organization`           | Your OpenAI organization key.                                                                                                                                                                                                                                                                                |
| `passthrough`            | A flexible object that allows passing arbitrary parameters directly to the OpenAI API request body. Useful for experimental, new, or provider-specific parameters not yet explicitly supported in promptfoo. This parameter is merged into the final API request and can override other settings.            |
| `presence_penalty`       | Applies a penalty to new tokens (tokens that haven't appeared in the input), making them less likely to appear in the output.                                                                                                                                                                                |
| `prompt_cache_key`       | Stable key for repeated prompts with shared prefixes. Use it consistently to improve prompt-cache hit rates. Supported by Chat Completions and Responses.                                                                                                                                                    |
| `prompt_cache_retention` | Prompt-cache retention policy. Use `24h` for extended retention or `in_memory` for default in-memory retention where supported. GPT-5.5, GPT-5.5 Pro, and future Responses models require extended retention, so `in_memory` will be rejected there.                                                         |
| `reasoning`              | Reasoning configuration object for reasoning-capable models. In practice, use this with the Responses API (`openai:responses:*`) for o-series and GPT-5 family models. `effort` supports `none`, `low`, `medium`, `high`, and model-specific values such as `xhigh` or `minimal`, with optional `summary`.   |
| `response_format`        | Specifies the desired output format, including `json_object` and `json_schema`. Can also be specified in the prompt config. If specified in both, the prompt config takes precedence.                                                                                                                        |
| `seed`                   | Seed used for deterministic output.                                                                                                                                                                                                                                                                          |
| `stop`                   | Defines a list of tokens that signal the end of the output.                                                                                                                                                                                                                                                  |
| `store`                  | Whether to store the conversation for future retrieval (boolean).                                                                                                                                                                                                                                            |
| `temperature`            | Controls the randomness of the AI's output for non-reasoning models. Promptfoo omits it for reasoning-capable models (o-series, `codex-mini-latest`, and GPT-5 family) because OpenAI ignores it there.                                                                                                      |
| `tool_choice`            | Controls whether the AI should use a tool. See [OpenAI Tools documentation](https://platform.openai.com/docs/api-reference/chat/create#chat-create-tools)                                                                                                                                                    |
| `tools`                  | Allows you to define custom tools. See [OpenAI Tools documentation](https://platform.openai.com/docs/api-reference/chat/create#chat-create-tools)                                                                                                                                                            |
| `top_p`                  | Controls the nucleus sampling, a method that helps control the randomness of the AI's output.                                                                                                                                                                                                                |
| `user`                   | A unique identifier representing your end-user, for tracking and abuse prevention.                                                                                                                                                                                                                           |
| `max_completion_tokens`  | Maximum number of tokens for reasoning-capable Chat Completions models (o-series and GPT-5 family). For Responses API, use `max_output_tokens` instead.                                                                                                                                                      |

Use `inputCost` and `outputCost` when a model has different prompt and completion rates.
The legacy `cost` option remains a shared fallback. For audio-capable models,
`audioInputCost` and `audioOutputCost` take precedence over `audioCost`.

Here are the type declarations of `config` parameters:

```typescript
interface OpenAiConfig {
  // Completion parameters
  temperature?: number;
  max_tokens?: number;
  max_completion_tokens?: number;
  reasoning?: {
    effort?: 'none' | 'minimal' | 'low' | 'medium' | 'high' | 'xhigh' | null;
    summary?: 'auto' | 'concise' | 'detailed' | null;
  };
  top_p?: number;
  frequency_penalty?: number;
  presence_penalty?: number;
  best_of?: number;
  functions?: OpenAiFunction[];
  function_call?: 'none' | 'auto' | { name: string };
  tools?: OpenAiTool[];
  tool_choice?: 'none' | 'auto' | 'required' | { type: 'function'; function?: { name: string } };
  response_format?: { type: 'json_object' | 'json_schema'; json_schema?: object };
  stop?: string[];
  seed?: number;
  user?: string;
  metadata?: Record<string, string>;
  store?: boolean;
  prompt_cache_key?: string;
  prompt_cache_retention?: 'in_memory' | '24h' | null;
  passthrough?: object;

  // Function tool callbacks
  functionToolCallbacks?: Record<
    OpenAI.FunctionDefinition['name'],
    (arg: string) => Promise<string>
  >;

  // General OpenAI parameters
  apiKey?: string;
  apiKeyEnvar?: string;
  apiHost?: string;
  apiBaseUrl?: string;
  organization?: string;
  cost?: number;
  inputCost?: number;
  outputCost?: number;
  audioCost?: number;
  audioInputCost?: number;
  audioOutputCost?: number;
  headers?: { [key: string]: string };
  maxRetries?: number;
}
```

### Generating Multiple Responses

Use `passthrough` to set OpenAI's `n` parameter for generating multiple responses in a single request:

```yaml
providers:
  - id: openai:chat:gpt-4o
    config:
      passthrough:
        n: 3 # Generate 3 responses
```

When `n > 1`, the primary `output` contains the first choice's content, and all generated choices are available in the response metadata under `metadata.choices`. Each choice includes the full response object with `message`, `finish_reason`, and `index`.

### Reducing Embedding Dimensions

Use `passthrough` to send raw [Embeddings API](https://platform.openai.com/docs/api-reference/embeddings) fields such as `dimensions`. OpenAI supports `dimensions` on `text-embedding-3` and later models when you want a smaller vector size:

```yaml
providers:
  - id: openai:embedding:text-embedding-3-large
    config:
      passthrough:
        dimensions: 1024
```

## Models

OpenAI updates aliases, dated snapshots, and pricing frequently. Promptfoo supports explicit
endpoint syntax like `openai:chat:<model>` and `openai:responses:<model>` for newly released
models right away, while the tables below call out the common model IDs promptfoo knows about
for routing and cost estimation. Check the official
[OpenAI models docs](https://platform.openai.com/docs/models) and
[pricing](https://openai.com/pricing) for the latest availability and rates.

### GPT-4.1

GPT-4.1 is OpenAI's flagship model for complex tasks with a 1,047,576 token context window and 32,768 max output tokens. Available in three variants with different price points:

| Model        | Description                                  | Input Price         | Output Price        |
| ------------ | -------------------------------------------- | ------------------- | ------------------- |
| GPT-4.1      | Flagship model for complex tasks             | $2.00 per 1M tokens | $8.00 per 1M tokens |
| GPT-4.1 Mini | More affordable, strong general capabilities | $0.40 per 1M tokens | $1.60 per 1M tokens |
| GPT-4.1 Nano | Most economical, good for high-volume tasks  | $0.10 per 1M tokens | $0.40 per 1M tokens |

All variants support text and image input with text output and have a May 31, 2024 knowledge cutoff.

#### Usage Examples

Standard model:

```yaml
providers:
  - id: openai:chat:gpt-4.1 # or openai:responses:gpt-4.1
    config:
      temperature: 0.7
```

More affordable variants:

```yaml
providers:
  - id: openai:chat:gpt-4.1-mini # or -nano variant
```

Specific snapshot versions are also available:

```yaml
providers:
  - id: openai:chat:gpt-4.1-2025-04-14 # Standard
  - id: openai:chat:gpt-4.1-mini-2025-04-14 # Mini
  - id: openai:chat:gpt-4.1-nano-2025-04-14 # Nano
```

### GPT-5.1

GPT-5.1 is a GPT-5 family model that emphasizes coding, agentic tasks, and more steerable output behavior.

#### Available Models

| Model               | Description                                        | Best For                                    |
| ------------------- | -------------------------------------------------- | ------------------------------------------- |
| gpt-5.1             | Primary GPT-5.1 model                              | Complex reasoning and broad world knowledge |
| gpt-5.1-2025-11-13  | Dated snapshot version                             | Locked behavior for production              |
| gpt-5.1-mini        | Cost-optimized reasoning                           | Balanced speed, cost, and capability        |
| gpt-5.1-nano        | High-throughput model                              | Simple instruction-following tasks          |
| gpt-5.1-codex       | Specialized for coding tasks in Codex environments | Agentic coding workflows                    |
| gpt-5.1-codex-max   | Frontier agentic coding model with compaction      | Long-running coding tasks and refactors     |
| gpt-5.1-chat-latest | Chat-optimized alias                               | Conversational applications                 |

#### Key Features

GPT-5.1 introduces several improvements over GPT-5:

- **`none` reasoning mode**: New lowest reasoning setting for low-latency interactions (default setting)
- **Increased steerability**: Better control over personality, tone, and output format
- **Configurable verbosity**: Control output length with `low`, `medium`, or `high` settings (default: `medium`)

#### Usage Examples

Fast, low-latency responses:

```yaml title="promptfooconfig.yaml"
providers:
  - id: openai:responses:gpt-5.1
    config:
      reasoning:
        effort: 'none' # Default setting - no reasoning tokens
      verbosity: 'low' # Concise outputs
```

Complex coding and reasoning tasks:

```yaml title="promptfooconfig.yaml"
providers:
  - id: openai:responses:gpt-5.1
    config:
      reasoning:
        effort: 'high' # Maximum reasoning for complex tasks
      verbosity: 'medium' # Balanced output length
      max_output_tokens: 4096
```

#### Reasoning Modes

GPT-5.1 supports four reasoning effort levels:

- **`none`** (default): No reasoning tokens, fastest responses, similar to non-reasoning models
- **`low`**: Minimal reasoning for straightforward tasks
- **`medium`**: Balanced reasoning for moderate complexity
- **`high`**: Maximum reasoning for complex problem-solving

#### Migration from GPT-5

GPT-5.1 with default settings (`none` reasoning) is designed as a drop-in replacement for GPT-5. Key differences:

- GPT-5.1 defaults to `none` reasoning effort (GPT-5 defaulted to `low`)
- GPT-5.1 has better-calibrated reasoning token consumption
- Improved instruction-following and output formatting

For tasks requiring reasoning, start with `medium` effort and increase to `high` if needed.

### GPT-5.1-Codex-Max

GPT-5.1-Codex-Max is OpenAI's frontier agentic coding model, built on an updated foundational reasoning model trained on agentic tasks across software engineering, math, research, and more. It's designed for long-running, detailed coding work.

#### Key Capabilities

- **Compaction**: First model natively trained to operate across multiple context windows through compaction, coherently working over millions of tokens in a single task
- **Long-running tasks**: Supports project-scale refactors, deep debugging sessions, and multi-hour agent loops
- **Token efficiency**: 30% fewer thinking tokens compared to GPT-5.1-Codex at the same reasoning effort level
- **Windows support**: First model trained to operate in Windows environments
- **Improved collaboration**: Better performance as a coding partner in CLI environments

#### Usage Examples

```yaml title="promptfooconfig.yaml"
providers:
  - id: openai:responses:gpt-5.1-codex-max
    config:
      reasoning:
        effort: 'medium' # Recommended for most tasks
      max_output_tokens: 25000 # Reserve space for reasoning and outputs
```

For latency-insensitive tasks requiring maximum quality:

```yaml title="promptfooconfig.yaml"
providers:
  - id: openai:responses:gpt-5.1-codex-max
    config:
      reasoning:
        effort: 'xhigh' # Extra high reasoning for best results
      max_output_tokens: 40000
```

:::warning
GPT-5.1-Codex-Max is only available through the Responses API (`openai:responses:`). It does not work with the Chat Completions API (`openai:chat:`).
:::

#### Reasoning Effort Levels

- **`low`**: Minimal reasoning for straightforward tasks
- **`medium`**: Balanced reasoning, recommended as daily driver
- **`high`**: Maximum reasoning for complex problem-solving
- **`xhigh`**: Extra high reasoning for non-latency-sensitive tasks requiring best results

#### Best Practices

- Use for agentic coding tasks in Codex or Codex-like environments
- Reserve at least 25,000 tokens for reasoning and outputs when starting
- Start with `medium` reasoning effort for most tasks
- Use `xhigh` effort only for complex tasks where latency is not a concern
- Review agent work before deploying to production

:::note
GPT-5.1-Codex-Max is recommended for use only in agentic coding environments and is not a general-purpose model like GPT-5.1.
:::

### GPT-5.2

GPT-5.2 is a GPT-5 family model for coding and agentic tasks, with both standard and pro variants.

#### Available Models

| Model                  | Description                     | Best For                           |
| ---------------------- | ------------------------------- | ---------------------------------- |
| gpt-5.2                | Standard GPT-5.2 model          | Complex reasoning and coding tasks |
| gpt-5.2-2025-12-11     | Snapshot version                | Locked behavior for production     |
| gpt-5.2-chat-latest    | Chat-optimized alias            | Conversational applications        |
| gpt-5.2-codex          | GPT-5.2 coding variant          | Agentic coding workflows           |
| gpt-5.2-pro            | Premium GPT-5.2 model           | Highest-quality reasoning tasks    |
| gpt-5.2-pro-2025-12-11 | Snapshot version of GPT-5.2-pro | Locked behavior for production     |

#### Key Specifications

- **Context window**: 400,000 tokens
- **Max output tokens**: 128,000 tokens
- **Reasoning support**: Full reasoning token support with configurable effort levels
- **Pricing (`gpt-5.2`, `gpt-5.2-chat-latest`, `gpt-5.2-codex`)**: $1.75 per 1M input tokens, $14 per 1M output tokens
- **Pricing (`gpt-5.2-pro`)**: $15 per 1M input tokens, $120 per 1M output tokens

#### Usage Examples

Standard GPT-5.2 variants are available via both the Chat Completions API and Responses API:

**Chat Completions API:**

```yaml title="promptfooconfig.yaml"
providers:
  - id: openai:chat:gpt-5.2-chat-latest
    config:
      max_completion_tokens: 4096

  # With reasoning effort
  - id: openai:chat:gpt-5.2
    config:
      reasoning_effort: 'medium'
      max_completion_tokens: 4096
```

**Responses API:**

```yaml title="promptfooconfig.yaml"
providers:
  - id: openai:responses:gpt-5.2-codex
    config:
      max_output_tokens: 4096

  # With reasoning effort (nested format)
  - id: openai:responses:gpt-5.2
    config:
      reasoning:
        effort: 'medium'
      max_output_tokens: 4096
```

Fast, low-latency responses (no reasoning):

```yaml title="promptfooconfig.yaml"
providers:
  # Chat API
  - id: openai:chat:gpt-5.2
    config:
      reasoning_effort: 'none'
      max_completion_tokens: 2048

  # Responses API
  - id: openai:responses:gpt-5.2
    config:
      reasoning:
        effort: 'none'
      max_output_tokens: 2048
```

GPT-5.2-pro (including dated snapshots) is best used via the Responses API:

```yaml title="promptfooconfig.yaml"
providers:
  - id: openai:responses:gpt-5.2-pro
    config:
      max_output_tokens: 8192
      reasoning:
        effort: 'high'
```

#### Key Improvements over GPT-5.1

- **Reduced deception**: Significantly lower deception rates in production traffic
- **Better safety compliance**: Improved cyber safety policy compliance
- **Improved prompt injection resistance**: Enhanced robustness to known prompt injection attacks
- **Enhanced sensitive topic handling**: Better performance on mental health and emotional reliance evaluations

#### Reasoning Effort Levels

- **`none`**: No reasoning tokens, fastest responses
- **`low`**: Minimal reasoning for straightforward tasks
- **`medium`**: Balanced reasoning for moderate complexity
- **`high`**: Maximum reasoning for complex problem-solving

### GPT-5.3 Instant

GPT-5.3 Instant is exposed as `gpt-5.3-chat-latest`. Promptfoo also supports GPT-5.3 coding variants for agentic/code workflows.

#### Available Models

| Model               | Description                          | Pricing (Input / Output)  |
| ------------------- | ------------------------------------ | ------------------------- |
| gpt-5.3-chat-latest | Chat-optimized alias                 | $1.75 / $14 per 1M tokens |
| gpt-5.3-codex       | GPT-5.3 coding model                 | $1.75 / $14 per 1M tokens |
| gpt-5.3-codex-spark | Faster/cost-efficient coding variant | $0.50 / $4 per 1M tokens  |

#### Key Specifications

- **Endpoint support**: Chat Completions API and Responses API
- **Limits and pricing**: The `-latest` alias can move over time. Check [OpenAI model docs](https://platform.openai.com/docs/models) and [pricing](https://openai.com/pricing) for current context limits and rates.

#### Usage Examples

```yaml title="promptfooconfig.yaml"
providers:
  - id: openai:chat:gpt-5.3-chat-latest
    config:
      max_completion_tokens: 2048

  - id: openai:responses:gpt-5.3-codex
    config:
      reasoning:
        effort: 'high'
      max_output_tokens: 4096

  - id: openai:responses:gpt-5.3-chat-latest
    config:
      max_output_tokens: 2048
```

### GPT-5.5

GPT-5.5 is the latest GPT-5 family model for high-capability reasoning, professional work, and agentic workflows.

#### Available Models

| Model                  | Description                   | Pricing (Input / Output)    |
| ---------------------- | ----------------------------- | --------------------------- |
| gpt-5.5                | Standard GPT-5.5 model        | $5.00 / $30 per 1M tokens   |
| gpt-5.5-2026-04-23     | Dated snapshot of gpt-5.5     | $5.00 / $30 per 1M tokens   |
| gpt-5.5-pro            | Premium GPT-5.5 pro model     | $30.00 / $180 per 1M tokens |
| gpt-5.5-pro-2026-04-23 | Dated snapshot of gpt-5.5-pro | $30.00 / $180 per 1M tokens |

#### Key Specifications

- **Long-context pricing**: `gpt-5.5` uses $10.00 input / $45.00 output per 1M tokens when prompts exceed 272,000 input tokens.
- **Context window**: `gpt-5.5` and `gpt-5.5-pro` support 1,050,000 tokens.
- **Max output tokens**: 128,000 tokens.
- **Reasoning effort**: `gpt-5.5` supports `none`, `low`, `medium`, `high`, and `xhigh`. In Chat Completions, set `reasoning_effort`; in Responses API, set `reasoning.effort`.
- **Endpoint support**: `gpt-5.5` supports Chat Completions and Responses API. `gpt-5.5-pro` is Responses API only and supports Batch API.
- **Cached input**: `gpt-5.5` cached input tokens are $0.50 per 1M. `gpt-5.5-pro` has no cached-input discount.
- **Cost estimates**: Promptfoo uses returned usage metadata for GPT-5.5 pricing and infers Batch, Flex, or Priority rates when the API response or configured `service_tier` identifies that tier.
- **Long-running requests**: `gpt-5.5-pro` automatically receives the same 10-minute timeout as other GPT-5 pro models.

#### Usage Examples

```yaml title="promptfooconfig.yaml"
providers:
  - id: openai:chat:gpt-5.5
    config:
      max_completion_tokens: 4096
      reasoning_effort: 'low'
      verbosity: 'medium'

  - id: openai:responses:gpt-5.5
    config:
      reasoning:
        effort: 'high'
      max_output_tokens: 4096

  - id: openai:responses:gpt-5.5-pro
    config:
      reasoning:
        effort: 'xhigh'
      max_output_tokens: 8192
```

### GPT-5.4

GPT-5.4 is a GPT-5 family model for complex professional work, agentic coding, and tool-heavy workflows.

#### Available Models

| Model                   | Description                    | Pricing (Input / Output)    |
| ----------------------- | ------------------------------ | --------------------------- |
| gpt-5.4                 | Standard GPT-5.4 model         | $2.50 / $15 per 1M tokens   |
| gpt-5.4-2026-03-05      | Dated snapshot of gpt-5.4      | $2.50 / $15 per 1M tokens   |
| gpt-5.4-mini            | Smaller GPT-5.4 model          | $0.75 / $4.50 per 1M tokens |
| gpt-5.4-mini-2026-03-17 | Dated snapshot of gpt-5.4-mini | $0.75 / $4.50 per 1M tokens |
| gpt-5.4-nano            | Lowest-cost GPT-5.4 model      | $0.20 / $1.25 per 1M tokens |
| gpt-5.4-nano-2026-03-17 | Dated snapshot of gpt-5.4-nano | $0.20 / $1.25 per 1M tokens |
| gpt-5.4-pro             | Premium GPT-5.4 pro model      | $30.00 / $180 per 1M tokens |
| gpt-5.4-pro-2026-03-05  | Dated snapshot of gpt-5.4-pro  | $30.00 / $180 per 1M tokens |

#### Key Specifications

- **Context window**: `gpt-5.4` and `gpt-5.4-pro` support 1,050,000 tokens. `gpt-5.4-mini` and `gpt-5.4-nano` support 400,000 tokens.
- **Long-context pricing**: `gpt-5.4` and `gpt-5.4-pro` use higher long-context rates when prompts exceed 272,000 input tokens.
- **Max output tokens**: 128,000 tokens
- **Reasoning effort**: `gpt-5.4`, `gpt-5.4-mini`, and `gpt-5.4-nano` support `none`, `low`, `medium`, `high`, `xhigh`. `gpt-5.4-pro` supports `medium`, `high`, `xhigh`.
- **Endpoint support**: `gpt-5.4`, `gpt-5.4-mini`, and `gpt-5.4-nano` support Chat Completions and Responses API. `gpt-5.4-pro` is Responses API only. Promptfoo's Codex SDK provider supports `gpt-5.4`, `gpt-5.4-pro`, and the newer GPT-5.5 line.
- **Cached input**: `gpt-5.4` cached input tokens $0.25 per 1M, `gpt-5.4-mini` $0.075 per 1M, and `gpt-5.4-nano` $0.02 per 1M. `gpt-5.4-pro` has no cached-input discount.

#### Usage Examples

```yaml title="promptfooconfig.yaml"
providers:
  - id: openai:chat:gpt-5.4-mini
    config:
      max_completion_tokens: 2048
      reasoning_effort: 'none'
      verbosity: 'low'

  - id: openai:chat:gpt-5.4
    config:
      max_completion_tokens: 4096
      reasoning_effort: 'low'

  - id: openai:responses:gpt-5.4-nano
    config:
      reasoning:
        effort: 'none'
      max_output_tokens: 1024

  - id: openai:responses:gpt-5.4
    config:
      reasoning:
        effort: 'high'
      max_output_tokens: 4096

  - id: openai:responses:gpt-5.4-mini
    config:
      reasoning:
        effort: 'medium'
      max_output_tokens: 4096

  - id: openai:responses:gpt-5.4-pro
    config:
      reasoning:
        effort: 'xhigh'
      max_output_tokens: 8192
```

### Reasoning Models (o1, o3, o3-pro, o3-mini, o4-mini)

Reasoning models, like `o1`, `o3`, `o3-pro`, `o3-mini`, and `o4-mini`, are large language models trained with reinforcement learning to perform complex reasoning. These models excel in complex problem-solving, coding, scientific reasoning, and multi-step planning for agentic workflows.

When using reasoning models, there are important differences in how tokens are handled:

```yaml title="promptfooconfig.yaml"
providers:
  - id: openai:o1
    config:
      reasoning:
        effort: 'medium' # Can be "low", "medium", or "high"
      max_completion_tokens: 25000 # Can also be set via OPENAI_MAX_COMPLETION_TOKENS env var
```

Unlike standard models that use `max_tokens`, reasoning models use:

- `max_completion_tokens` to control the total tokens generated (both reasoning and visible output)
- `reasoning` to control how thoroughly the model thinks before responding (with `effort`: none, low, medium, high; some GPT-5 family models also support `minimal` or `xhigh`)

#### How Reasoning Models Work

Reasoning models "think before they answer," generating internal reasoning tokens that:

- Are not visible in the output
- Count towards token usage and billing
- Occupy space in the context window

Both `o1` and `o3-mini` models have a 128,000 token context window, while `o3-pro` and `o4-mini` have a 200,000 token context window. OpenAI recommends reserving at least 25,000 tokens for reasoning and outputs when starting with these models.

## Images

### Sending images in prompts

You can include images in the prompt by using content blocks. For example, here's an example config:

```yaml title="promptfooconfig.yaml"
# yaml-language-server: $schema=https://promptfoo.dev/config-schema.json
prompts:
  - file://prompt.json

providers:
  - openai:gpt-5

tests:
  - vars:
      question: 'What do you see?'
      url: 'https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg'
  # ...
```

And an example `prompt.json`:

```json title="prompt.json"
[
  {
    "role": "user",
    "content": [
      {
        "type": "text",
        "text": "{{question}}"
      },
      {
        "type": "image_url",
        "image_url": {
          "url": "{{url}}"
        }
      }
    ]
  }
]
```

See the [OpenAI vision example](https://github.com/promptfoo/promptfoo/tree/main/examples/openai-vision).

### Generating images

OpenAI supports image generation via `openai:image:<model>`. Supported models include:

- `gpt-image-2` - OpenAI's latest image generation model with flexible custom sizes
- `gpt-image-1.5` - High-quality GPT Image model with strong instruction following
- `gpt-image-1` - High-quality image generation model
- `gpt-image-1-mini` - Cost-efficient version of GPT Image 1

`dall-e-3` and `dall-e-2` remain available for backward compatibility, but use `gpt-image-2`, `gpt-image-1.5`, `gpt-image-1`, or `gpt-image-1-mini` for new evals.

The `openai:image` provider uses the Image API generations endpoint. It supports text-to-image generation; image edit/reference inputs (`image`, `mask`, `input_fidelity`), streaming (`stream`/`partial_images`), and variations are not implemented in this provider.

See the [OpenAI image generation example](https://github.com/promptfoo/promptfoo/tree/main/examples/openai-images).

#### GPT Image 2

GPT Image 2 is OpenAI's latest image generation model. It supports the standard GPT Image output controls plus custom sizes that satisfy OpenAI's dimensional constraints.

```yaml title="promptfooconfig.yaml"
providers:
  - id: openai:image:gpt-image-2
    config:
      size: 1024x1024 # auto, common sizes, or custom WIDTHxHEIGHT
      quality: low # low, medium, high, or auto
      background: opaque # opaque or auto
      output_format: webp # png, jpeg, or webp
      output_compression: 80 # 0-100, only set with jpeg/webp
      moderation: auto # auto or low
      n: 1 # 1-10 images
      user: promptfoo-user # optional end-user identifier
```

| Parameter            | Description                        | Options                                                                                     |
| -------------------- | ---------------------------------- | ------------------------------------------------------------------------------------------- |
| `size`               | Image dimensions                   | `auto`, common sizes like `1024x1024`, `1024x1536`, `1536x1024`, or valid custom dimensions |
| `quality`            | Rendering quality                  | `low`, `medium`, `high`, `auto`                                                             |
| `background`         | Background handling                | `opaque`, `auto` (`transparent` is not supported)                                           |
| `output_format`      | Output image format                | `png`, `jpeg`, `webp`                                                                       |
| `output_compression` | Compression level (jpeg/webp only) | `0-100`                                                                                     |
| `moderation`         | Content moderation strictness      | `auto`, `low`                                                                               |
| `n`                  | Number of images to generate       | `1-10`                                                                                      |
| `user`               | Optional end-user identifier       | Any string                                                                                  |

For custom `size` values, both dimensions must be multiples of 16, the maximum edge must be no larger than 3840px, the long edge to short edge ratio must be at most 3:1, and total pixels must be between 655,360 and 8,294,400.

**Pricing:**

| Quality | 1024x1024 | 1024x1536 | 1536x1024 |
| ------- | --------- | --------- | --------- |
| Low     | $0.006    | $0.005    | $0.005    |
| Medium  | $0.053    | $0.041    | $0.041    |
| High    | $0.211    | $0.165    | $0.165    |

These are output image estimates. Input text tokens may also apply, and OpenAI may return usage data for the request. For GPT Image 2 `quality: auto`, omitted quality, or custom sizes, promptfoo leaves `cost` unset and preserves the returned usage in `tokenUsage`/`metadata.usage` instead of guessing.

#### GPT Image 1.5

GPT Image 1.5 is a high-quality image generation model with strong instruction following, prompt adherence, and photorealistic quality. It uses token-based pricing for more flexible cost control.

```yaml title="promptfooconfig.yaml"
providers:
  - id: openai:image:gpt-image-1.5
    config:
      size: 1024x1024 # 1024x1024, 1024x1536, 1536x1024, or auto
      quality: low # low, medium, high, or auto
      background: transparent # transparent, opaque, or auto
      output_format: webp # png, jpeg, or webp
      output_compression: 80 # 0-100, only set with jpeg/webp
      moderation: auto # auto or low
```

| Parameter            | Description                             | Options                                       |
| -------------------- | --------------------------------------- | --------------------------------------------- |
| `size`               | Image dimensions                        | `1024x1024`, `1024x1536`, `1536x1024`, `auto` |
| `quality`            | Rendering quality                       | `low`, `medium`, `high`, `auto`               |
| `background`         | Background transparency (png/webp only) | `transparent`, `opaque`, `auto`               |
| `output_format`      | Output image format                     | `png`, `jpeg`, `webp`                         |
| `output_compression` | Compression level (jpeg/webp only)      | `0-100`                                       |
| `moderation`         | Content moderation strictness           | `auto`, `low`                                 |

**Pricing:**

GPT Image 1.5 uses token-based pricing at $5/1M input text tokens, $10/1M output text tokens, $8/1M input image tokens, and $32/1M output image tokens. Estimated costs per image:

| Quality | 1024x1024 | 1024x1536 | 1536x1024 |
| ------- | --------- | --------- | --------- |
| Low     | ~$0.064   | ~$0.096   | ~$0.096   |
| Medium  | ~$0.128   | ~$0.192   | ~$0.192   |
| High    | ~$0.192   | ~$0.288   | ~$0.288   |

#### GPT Image 1

GPT Image 1 is a high-quality image generation model with superior instruction following, text rendering, and real-world knowledge.

```yaml title="promptfooconfig.yaml"
providers:
  - id: openai:image:gpt-image-1
    config:
      size: 1024x1024 # 1024x1024, 1024x1536, 1536x1024, or auto
      quality: low # low, medium, high, or auto
      background: transparent # transparent, opaque, or auto
      output_format: webp # png, jpeg, or webp
      output_compression: 80 # 0-100, only set with jpeg/webp
      moderation: auto # auto or low
```

| Parameter            | Description                             | Options                                       |
| -------------------- | --------------------------------------- | --------------------------------------------- |
| `size`               | Image dimensions                        | `1024x1024`, `1024x1536`, `1536x1024`, `auto` |
| `quality`            | Rendering quality                       | `low`, `medium`, `high`, `auto`               |
| `background`         | Background transparency (png/webp only) | `transparent`, `opaque`, `auto`               |
| `output_format`      | Output image format                     | `png`, `jpeg`, `webp`                         |
| `output_compression` | Compression level (jpeg/webp only)      | `0-100`                                       |
| `moderation`         | Content moderation strictness           | `auto`, `low`                                 |

**Pricing:**

| Quality | 1024x1024 | 1024x1536 | 1536x1024 |
| ------- | --------- | --------- | --------- |
| Low     | $0.011    | $0.016    | $0.016    |
| Medium  | $0.042    | $0.063    | $0.063    |
| High    | $0.167    | $0.25     | $0.25     |

#### GPT Image 1 Mini

GPT Image 1 Mini is a cost-efficient version of GPT Image 1 with the same capabilities at lower cost.

```yaml title="promptfooconfig.yaml"
providers:
  - id: openai:image:gpt-image-1-mini
    config:
      size: 1024x1024 # 1024x1024, 1024x1536, 1536x1024, or auto
      quality: low # low, medium, high, or auto
      background: transparent # transparent, opaque, or auto
      output_format: webp # png, jpeg, or webp
      output_compression: 80 # 0-100, only set with jpeg/webp
      moderation: auto # auto or low
```

**Pricing:**

| Quality | 1024x1024 | 1024x1536 | 1536x1024 |
| ------- | --------- | --------- | --------- |
| Low     | $0.005    | $0.006    | $0.006    |
| Medium  | $0.011    | $0.015    | $0.015    |
| High    | $0.036    | $0.052    | $0.052    |

#### Example

```yaml title="promptfooconfig.yaml"
# yaml-language-server: $schema=https://promptfoo.dev/config-schema.json
prompts:
  - 'In the style of Van Gogh: {{subject}}'
  - 'In the style of Dali: {{subject}}'

providers:
  - openai:image:gpt-image-2

tests:
  - vars:
      subject: bananas
  - vars:
      subject: new york city
```

To display images in the web viewer, wrap vars or outputs in markdown image tags like so:

```markdown
![](/path/to/myimage.png)
```

Then, enable 'Render markdown' under Table Settings.

## Video Generation (Sora)

OpenAI supports video generation via `openai:video:<model>`. Supported models include:

- `sora-2` - OpenAI's video generation model ($0.10/second)
- `sora-2-pro` - Higher quality video generation ($0.30/second)

### Basic Usage

```yaml title="promptfooconfig.yaml"
providers:
  - id: openai:video:sora-2
    config:
      size: 1280x720 # 1280x720, 720x1280, 1792x1024, or 1024x1792
      seconds: 8 # Duration: 4, 8, or 12 seconds
```

### Configuration Options

| Parameter              | Description                                                         | Default    |
| ---------------------- | ------------------------------------------------------------------- | ---------- |
| `size`                 | Video dimensions (`1280x720`, `720x1280`, `1792x1024`, `1024x1792`) | `1280x720` |
| `seconds`              | Duration in seconds (4, 8, or 12)                                   | `8`        |
| `input_reference`      | Base64 image data or file path for image-to-video                   | -          |
| `remix_video_id`       | ID of a previous Sora video to remix                                | -          |
| `poll_interval_ms`     | Polling interval for job status                                     | `10000`    |
| `max_poll_time_ms`     | Maximum time to wait for video generation                           | `600000`   |
| `download_thumbnail`   | Download thumbnail preview                                          | `true`     |
| `download_spritesheet` | Download spritesheet preview                                        | `true`     |

### Example Configuration

```yaml title="promptfooconfig.yaml"
prompts:
  - 'A cinematic shot of: {{scene}}'

providers:
  - id: openai:video:sora-2
    config:
      size: 1280x720
      seconds: 4
  - id: openai:video:sora-2-pro
    config:
      size: 720x1280
      seconds: 8

tests:
  - vars:
      scene: a cat riding a skateboard through a city
  - vars:
      scene: waves crashing on a beach at sunset
```

### Image-to-Video Generation

Generate videos starting from a source image using `input_reference`:

```yaml title="promptfooconfig.yaml"
providers:
  - id: openai:video:sora-2
    config:
      input_reference: file://assets/start-image.png
      seconds: 4

prompts:
  - 'Animate this image: the character slowly walks forward'
```

The `input_reference` accepts either a `file://` path or base64-encoded image data.

### Video Remixing

Remix an existing Sora video with a new prompt using `remix_video_id`:

```yaml title="promptfooconfig.yaml"
providers:
  - id: openai:video:sora-2
    config:
      remix_video_id: video_abc123def456

prompts:
  - 'Make the scene more dramatic with stormy weather'
```

The `remix_video_id` is the video ID returned from a previous Sora generation (found in `response.video.id`).

:::note
Remixed videos are not cached since each remix produces unique results even with the same prompt.
:::

### Viewing Generated Videos

Videos are automatically displayed in the web viewer with playback controls. The viewer shows:

- Video player with controls
- Video metadata (model, size, duration)
- Thumbnail preview (if enabled)

Videos are stored in promptfoo's media storage (`~/.promptfoo/media/`) and served via the web interface.

### Pricing

| Model      | Cost per Second |
| ---------- | --------------- |
| sora-2     | $0.10           |
| sora-2-pro | $0.30           |

## Web Search Support

The OpenAI Responses API supports both the standard `web_search` tool and the `web_search_preview` tool family. The preview tool enables the `search-rubric` assertion type and remains required for deep research models. These tools let models search the web for current information and verify facts.

### Enabling Web Search

To enable web search with the OpenAI Responses API, use the `openai:responses` provider format and add either the standard `web_search` tool or the preview `web_search_preview` tool to your configuration:

```yaml title="promptfooconfig.yaml"
providers:
  - id: openai:responses:gpt-5.1
    config:
      tools:
        - type: web_search
```

### Using Web Search Assertions

The `search-rubric` assertion type uses web search to quickly verify current information:

- Real-time data (weather, stock prices, news)
- Current events and statistics
- Time-sensitive information
- Quick fact verification

Example configuration:

```yaml title="promptfooconfig.yaml"
# yaml-language-server: $schema=https://promptfoo.dev/config-schema.json
prompts:
  - 'What is the current temperature in {{city}}?'

providers:
  - id: openai:responses:gpt-5.1
    config:
      tools:
        - type: web_search_preview

tests:
  - vars:
      city: New York
    assert:
      - type: search-rubric
        value: Current temperature in New York City
```

### Cost Considerations

:::info
Web search calls in the Responses API are billed separately from normal tokens:

- The web search tool costs **$10 per 1,000 calls** for the standard tool and **$10-25 per 1,000 calls** for preview variants, plus any search content tokens where applicable
- Only `web_search_call.action.type: search` incurs a search fee; `open_page` and `find_in_page` are observable actions but are not charged as separate searches
- Each search-rubric assertion may perform one or more searches
- Caching is enabled by default; use `--no-cache` to force fresh searches during development
- See [OpenAI's pricing page](https://openai.com/api/pricing/) for current rates
  :::

### Best Practices

1. **Use specific search queries**: More specific queries yield better verification results
2. **Use caching**: Caching is enabled by default; results are reused to avoid repeated searches
3. **Use appropriate models**: gpt-5.1-mini is recommended for cost-effective web search
4. **Monitor usage**: Track API costs, especially in CI/CD pipelines

For more details on using search-rubric assertions, see the [Search-Rubric documentation](/docs/configuration/expected-outputs/model-graded/search-rubric).

## Tool Calling

### Using tools

To set `tools` on an OpenAI provider, use the provider's `config` key. The model may return tool calls in two formats:

1. An array of tool calls: `[{type: 'function', function: {...}}]`
2. A message with tool calls: `{content: '...', tool_calls: [{type: 'function', function: {...}}]}`

Tools can be defined inline or loaded from an external file:

:::info Supported file formats

Tools can be loaded from external files in multiple formats:

```yaml
# Static data files
tools: file://./tools.yaml
tools: file://./tools.json

# Dynamic tool definitions from code (requires function name)
tools: file://./tools.py:get_tools
tools: file://./tools.js:getTools
tools: file://./tools.ts:getTools
```

Python and JavaScript files must export a function that returns the tool definitions array. The function can be synchronous or asynchronous.

**Asynchronous example:**

```javascript
// tools.js - Fetch tool definitions from API at runtime
export async function getTools() {
  const apiKey = process.env.INTERNAL_API_KEY;
  const response = await fetch('https://api.internal.com/tool-definitions', {
    headers: { Authorization: `Bearer ${apiKey}` },
  });
  const tools = await response.json();
  return tools;
}
```

:::

```yaml title="promptfooconfig.yaml"
# yaml-language-server: $schema=https://promptfoo.dev/config-schema.json
prompts:
  - file://prompt.txt
providers:
  - id: openai:chat:gpt-5.4-mini
    // highlight-start
    config:
      # Load tools from external file
      tools: file://./weather_tools.yaml
      # Or define inline
      tools: [
        {
        "type": "function",
          "function": {
            "name": "get_current_weather",
            "description": "Get the current weather in a given location",
            "parameters": {
              "type": "object",
                "properties": {
                  "location": {
                    "type": "string",
                      "description": "The city and state, e.g. San Francisco, CA"
                    },
                    "unit": {
                      "type": "string",
                      "enum": ["celsius", "fahrenheit"]
                    }
                  },
              "required": ["location"]
            }
          }
        }
      ]
      tool_choice: 'auto'
    // highlight-end

tests:
   - vars:
        city: Boston
     assert:
        - type: is-json
        - type: is-valid-openai-tools-call
        - type: javascript
          value: output[0].function.name === 'get_current_weather'
        - type: javascript
          value: JSON.parse(output[0].function.arguments).location === 'Boston, MA'

   - vars:
        city: New York
# ...
```

Sometimes OpenAI function calls don't match `tools` schemas. Use [`is-valid-openai-tools-call`](/docs/configuration/expected-outputs/deterministic/#is-valid-openai-function-call) or [`is-valid-openai-tools-call`](/docs/configuration/expected-outputs/deterministic/#is-valid-openai-tools-call) assertions to enforce an exact schema match between tools and the function definition.

To further test `tools` definitions, you can use the `javascript` assertion and/or `transform` directives. For example:

```yaml title="promptfooconfig.yaml"
tests:
  - vars:
      city: Boston
    assert:
      - type: is-json
      - type: is-valid-openai-tools-call
      - type: javascript
        value: output[0].function.name === 'get_current_weather'
      - type: javascript
        value: JSON.parse(output[0].function.arguments).location === 'Boston, MA'

  - vars:
      city: New York
      # transform returns only the 'name' property
    transform: output[0].function.name
    assert:
      - type: is-json
      - type: similar
        value: NYC
```

:::tip
Functions can use variables from test cases:

```js
{
  type: "function",
  function: {
    description: "Get temperature in {{city}}"
    // ...
  }
}
```

They can also include functions that dynamically reference vars:

```js
{
  type: "function",
  function: {
    name: "get_temperature",
    parameters: {
      type: "object",
        properties: {
          unit: {
            type: "string",
            // highlight-start
            enum: (vars) => vars.units,
            // highlight-end
          }
        },
    }
  }
}
```

:::

### Using functions

> `functions` and `function_call` is deprecated in favor of `tools` and `tool_choice`, see detail in [OpenAI API reference](https://platform.openai.com/docs/api-reference/chat/create#chat-create-function_call).

Use the `functions` config to define custom functions. Each function should be an object with a `name`, optional `description`, and `parameters`. For example:

```yaml title="promptfooconfig.yaml"
# yaml-language-server: $schema=https://promptfoo.dev/config-schema.json
prompts:
  - file://prompt.txt
providers:
  - id: openai:chat:gpt-5.4-mini
    // highlight-start
    config:
      functions:
        [
          {
            'name': 'get_current_weather',
            'description': 'Get the current weather in a given location',
            'parameters':
              {
                'type': 'object',
                'properties':
                  {
                    'location':
                      {
                        'type': 'string',
                        'description': 'The city and state, e.g. San Francisco, CA',
                      },
                    'unit': { 'type': 'string', 'enum': ['celsius', 'fahrenheit'] },
                  },
                'required': ['location'],
              },
          },
        ]
    // highlight-end
tests:
  - vars:
      city: Boston
    assert:
      // highlight-next-line
      - type: is-valid-openai-function-call
  - vars:
      city: New York
  # ...
```

Sometimes OpenAI function calls don't match `functions` schemas. Use [`is-valid-openai-function-call`](/docs/configuration/expected-outputs/deterministic#is-valid-openai-function-call) assertions to enforce an exact schema match between function calls and the function definition.

To further test function call definitions, you can use the `javascript` assertion and/or `transform` directives. For example:

```yaml title="promptfooconfig.yaml"
tests:
  - vars:
      city: Boston
    assert:
      - type: is-valid-openai-function-call
      - type: javascript
        value: output.name === 'get_current_weather'
      - type: javascript
        value: JSON.parse(output.arguments).location === 'Boston, MA'

  - vars:
      city: New York
    # transform returns only the 'name' property for this test case
    transform: output.name
    assert:
      - type: is-json
      - type: similar
        value: NYC
```

### Loading tools/functions from a file

Instead of duplicating function definitions across multiple configurations, you can reference an external YAML (or JSON) file that contains your functions. This allows you to maintain a single source of truth for your functions, which is particularly useful if you have multiple versions or regular changes to definitions.

:::tip

Tool definitions can be loaded from JSON, YAML, Python, or JavaScript files. For Python/JS files, specify a function name that returns the tool definitions: `file://tools.py:get_tools`

:::

To load your functions from a file, specify the file path in your provider configuration like so:

```yaml title="promptfooconfig.yaml"
providers:
  - file://./path/to/provider_with_function.yaml
```

You can also use a pattern to load multiple files:

```yaml title="promptfooconfig.yaml"
providers:
  - file://./path/to/provider_*.yaml
```

Here's an example of how your `provider_with_function.yaml` might look:

```yaml title="provider_with_function.yaml"
id: openai:chat:gpt-5.4-mini
config:
  functions:
    - name: get_current_weather
      description: Get the current weather in a given location
      parameters:
        type: object
        properties:
          location:
            type: string
            description: The city and state, e.g. San Francisco, CA
          unit:
            type: string
            enum:
              - celsius
              - fahrenheit
            description: The unit in which to return the temperature
        required:
          - location
```

## Using `response_format`

Promptfoo supports the `response_format` parameter, which allows you to specify the expected output format.

`response_format` can be included in the provider config, or in the prompt config.

#### Prompt config example

```yaml title="promptfooconfig.yaml"
prompts:
  - label: 'Prompt #1'
    raw: 'You are a helpful math tutor. Solve {{problem}}'
    config:
      response_format:
        type: json_schema
        json_schema: ...
```

#### Provider config example

```yaml title="promptfooconfig.yaml"
providers:
  - id: openai:chat:gpt-5.4-mini
    config:
      response_format:
        type: json_schema
        json_schema: ...
```

#### External file references

To make it easier to manage large JSON schemas, external file references are supported for `response_format` in both Chat and Responses APIs. This is particularly useful for:

- Reusing complex JSON schemas across multiple configurations
- Managing large schemas in separate files for better organization
- Version controlling schemas independently from configuration files

```yaml
config:
  response_format: file://./path/to/response_format.json
```

The external file should contain the complete `response_format` configuration object:

```json title="response_format.json"
{
  "type": "json_schema",
  "name": "event_extraction",
  "schema": {
    "type": "object",
    "properties": {
      "event_name": { "type": "string" },
      "date": { "type": "string" },
      "location": { "type": "string" }
    },
    "required": ["event_name", "date", "location"],
    "additionalProperties": false
  }
}
```

You can also use nested file references for the schema itself, which is useful for sharing schemas across multiple response formats:

```json title="response_format.json"
{
  "type": "json_schema",
  "name": "event_extraction",
  "schema": "file://./schemas/event-schema.json"
}
```

Variable rendering is supported in file paths using Nunjucks syntax:

```yaml
config:
  response_format: file://./schemas/{{ schema_name }}.json
```

For a complete example with the Chat API, see the [OpenAI Structured Output example](https://github.com/promptfoo/promptfoo/tree/main/examples/openai-structured-output) or initialize it with:

```bash
npx promptfoo@latest init --example openai-structured-output
```

For an example with the Responses API, see the [OpenAI Responses API example](https://github.com/promptfoo/promptfoo/tree/main/examples/openai-responses) and run:

```bash
npx promptfoo@latest init --example openai-responses
cd openai-responses
npx promptfoo@latest eval -c promptfooconfig.external-format.yaml
```

#### Per-test structured output

You can use different JSON schemas for different test cases using the `test.options` field. This allows a single prompt to produce different structured output formats depending on the test:

```yaml title="promptfooconfig.yaml"
prompts:
  - 'Answer this question: {{question}}'

providers:
  - openai:gpt-4o-mini

# Parse JSON output so assertions can access properties directly
defaultTest:
  options:
    transform: JSON.parse(output)

tests:
  # Math problems use math schema
  - vars:
      question: 'What is 15 * 7?'
    options:
      response_format: file://./schemas/math-response-format.json
    assert:
      - type: javascript
        value: output.answer === 105

  # Comparison questions use comparison schema
  - vars:
      question: 'Compare apples and oranges'
    options:
      response_format: file://./schemas/comparison-response-format.json
    assert:
      - type: javascript
        value: output.winner === 'item1' || output.winner === 'item2' || output.winner === 'tie'
```

Each schema file contains the complete `response_format` object. See the [per-test schema example](https://github.com/promptfoo/promptfoo/tree/main/examples/openai-structured-output/per-test-schema.yaml) for a full working configuration.

## Supported environment variables

These OpenAI-related environment variables are supported:

| Variable                       | Description                                                                                                              |
| ------------------------------ | ------------------------------------------------------------------------------------------------------------------------ |
| `OPENAI_TEMPERATURE`           | Temperature model parameter, defaults to 0. Not supported by reasoning-capable models.                                   |
| `OPENAI_MAX_TOKENS`            | `max_tokens` parameter, defaults to 1024. Used for non-reasoning requests.                                               |
| `OPENAI_MAX_COMPLETION_TOKENS` | `max_completion_tokens` parameter, defaults to 1024. Used by reasoning-capable chat/responses requests where applicable. |
| `OPENAI_API_HOST`              | Hostname to use (proxy-compatible). Takes precedence over both `OPENAI_API_BASE_URL` and `OPENAI_BASE_URL`.              |
| `OPENAI_API_BASE_URL`          | Full base URL (protocol + host + optional port/path). Takes precedence over `OPENAI_BASE_URL`.                           |
| `OPENAI_BASE_URL`              | Alternate full base URL. Used if `OPENAI_API_BASE_URL` is not set.                                                       |
| `OPENAI_API_KEY`               | OpenAI API key.                                                                                                          |
| `OPENAI_ORGANIZATION`          | The OpenAI organization key to use.                                                                                      |
| `PROMPTFOO_DELAY_MS`           | Number of milliseconds to delay between API calls. Useful if you are hitting OpenAI rate limits (defaults to 0).         |
| `PROMPTFOO_REQUEST_BACKOFF_MS` | Base number of milliseconds to backoff and retry if a request fails (defaults to 5000).                                  |

## Evaluating assistants

To test out an Assistant via OpenAI's Assistants API, first create an Assistant in the [API playground](https://platform.openai.com/playground).

Set functions, code interpreter, and files for retrieval as necessary.

Then, include the assistant in your config:

```yaml
prompts:
  - 'Write a tweet about {{topic}}'
providers:
  - openai:assistant:asst_fEhNN3MClMamLfKLkIaoIpgZ
tests:
  - vars:
      topic: bananas
  # ...
```

Code interpreter, function calls, and retrievals will be included in the output alongside chat messages. Note that the evaluator creates a new thread for each eval.

The following properties can be overwritten in provider config:

- `model` - OpenAI model to use
- `instructions` - System prompt
- `tools` - Enabled [tools](https://platform.openai.com/docs/api-reference/runs/createRun)
- `thread.messages` - A list of message objects that the thread is created with.
- `temperature` - Temperature for the model
- `toolChoice` - Controls whether the AI should use a tool
- `tool_resources` - Tool resources to include in the thread - see [Assistant v2 tool resources](https://platform.openai.com/docs/assistants/migration)
- `attachments` - File attachments to include in messages - see [Assistant v2 attachments](https://platform.openai.com/docs/assistants/migration)

Here's an example of a more detailed config:

```yaml title="promptfooconfig.yaml"
# yaml-language-server: $schema=https://promptfoo.dev/config-schema.json
prompts:
  - 'Write a tweet about {{topic}}'
providers:
  // highlight-start
  - id: openai:assistant:asst_fEhNN3MClMamLfKLkIaoIpgZ
    config:
      model: gpt-5
      instructions: "You always speak like a pirate"
      temperature: 0.2
      toolChoice:
        type: file_search
      tools:
        - type: code_interpreter
        - type: file_search
      thread:
        messages:
          - role: user
            content: "Hello world"
          - role: assistant
            content: "Greetings from the high seas"
  // highlight-end
tests:
  - vars:
      topic: bananas
  # ...
```

### Automatically handling function tool calls

You can specify JavaScript callbacks that are automatically called to create
the output of a function tool call.

This requires defining your config in a JavaScript file instead of YAML.

```js
module.exports = /** @type {import('promptfoo').TestSuiteConfig} */ ({
  prompts: 'Please add the following numbers together: {{a}} and {{b}}',
  providers: [
    {
      id: 'openai:assistant:asst_fEhNN3MClMamLfKLkIaoIpgZ',
      config: {
        model: 'gpt-5',
        instructions: 'You can add two numbers together using the `addNumbers` tool',
        tools: [
          {
            type: 'function',
            function: {
              name: 'addNumbers',
              description: 'Add two numbers together',
              parameters: {
                type: 'object',
                properties: {
                  a: { type: 'number' },
                  b: { type: 'number' },
                },
                required: ['a', 'b'],
                additionalProperties: false,
              },
              strict: true,
            },
          },
        ],
        /**
         * Map of function tool names to function callback.
         */
        functionToolCallbacks: {
          // this function should accept a JSON-parsed value, and return a string
          // or a `Promise<string>`.
          addNumbers: (parameters) => {
            const { a, b } = parameters;
            return JSON.stringify(a + b);
          },
        },
      },
    },
  ],
  tests: [
    {
      vars: { a: 5, b: 6 },
    },
  ],
});
```

## Audio capabilities

OpenAI models with audio support (like `gpt-audio-1.5`, `gpt-audio`, `gpt-audio-mini`, `gpt-4o-audio-preview` and `gpt-4o-mini-audio-preview`) can process audio inputs and generate audio outputs. This enables testing speech-to-text, text-to-speech, and speech-to-speech capabilities.

**Available audio models:**

- `gpt-audio-1.5` - Flagship audio model ($2.50/$10 per 1M text tokens, $32/$64 per 1M audio tokens)
- `gpt-audio` - General audio model ($2.50/$10 per 1M text tokens, $40/$80 per 1M audio tokens)
- `gpt-audio-mini` - Cost-efficient audio model ($0.60/$2.40 per 1M text tokens, $10/$20 per 1M audio tokens)
- `gpt-audio-mini-2025-12-15` - Dated snapshot of `gpt-audio-mini`
- `gpt-4o-audio-preview` - Preview audio model
- `gpt-4o-mini-audio-preview` - Preview mini audio model

### Using audio inputs

You can include audio files in your prompts using the following format:

```json title="audio-input.json"
[
  {
    "role": "user",
    "content": [
      {
        "type": "text",
        "text": "You are a helpful customer support agent. Listen to the customer's request and respond with a helpful answer."
      },
      {
        "type": "input_audio",
        "input_audio": {
          "data": "{{audio_file}}",
          "format": "mp3"
        }
      }
    ]
  }
]
```

With a corresponding configuration:

```yaml title="promptfooconfig.yaml"
prompts:
  - id: file://audio-input.json
    label: Audio Input

providers:
  - id: openai:chat:gpt-4o-audio-preview
    config:
      modalities: ['text'] # also supports 'audio'

tests:
  - vars:
      audio_file: file://assets/transcript1.mp3
    assert:
      - type: llm-rubric
        value: Resolved the customer's issue
```

Supported audio file formats include WAV, MP3, OGG, AAC, M4A, and FLAC.

### Audio configuration options

The audio configuration supports these parameters:

| Parameter | Description                    | Default | Options                                 |
| --------- | ------------------------------ | ------- | --------------------------------------- |
| `voice`   | Voice for audio generation     | alloy   | alloy, echo, fable, onyx, nova, shimmer |
| `format`  | Audio format to generate       | wav     | wav, mp3, opus, aac                     |
| `speed`   | Speaking speed multiplier      | 1.0     | Any number between 0.25 and 4.0         |
| `bitrate` | Bitrate for compressed formats | -       | e.g., "128k", "256k"                    |

In the web UI, audio outputs display with an embedded player and transcript. For a complete working example, see the [OpenAI audio example](https://github.com/promptfoo/promptfoo/tree/main/examples/openai-audio) or initialize it with:

```bash
npx promptfoo@latest init --example openai-audio
```

### Audio transcription

OpenAI provides dedicated transcription models for converting speech to text. These models charge per minute of audio rather than per token.

**Available transcription models:**

| Model                                  | Description                          | Cost per minute |
| -------------------------------------- | ------------------------------------ | --------------- |
| `whisper-1`                            | Original Whisper transcription model | $0.006          |
| `gpt-4o-transcribe`                    | GPT-4o optimized for transcription   | $0.006          |
| `gpt-4o-mini-transcribe`               | Faster, more cost-effective option   | $0.003          |
| `gpt-4o-mini-transcribe-2025-12-15`    | Dated mini transcription snapshot    | $0.003          |
| `gpt-4o-transcribe-diarize`            | Identifies different speakers        | $0.006          |
| `gpt-4o-transcribe-diarize-2025-10-15` | Dated diarization snapshot           | $0.006          |

To use transcription models, specify the provider format `openai:transcription:<model name>`:

```yaml title="promptfooconfig.yaml"
prompts:
  - file://sample-audio.mp3

providers:
  - id: openai:transcription:whisper-1
    config:
      language: en # Optional: specify language for better accuracy
      temperature: 0 # Optional: 0 for more deterministic output

  - id: openai:transcription:gpt-4o-transcribe
    config:
      language: en
      prompt: This is a technical discussion about AI and machine learning.

  - id: openai:transcription:gpt-4o-transcribe-diarize
    config:
      num_speakers: 2 # Optional: expected number of speakers
      speaker_labels: ['Alice', 'Bob'] # Optional: provide speaker names

tests:
  - assert:
      - type: contains
        value: expected transcript content
```

#### Transcription configuration options

| Parameter                 | Description                               | Options                |
| ------------------------- | ----------------------------------------- | ---------------------- |
| `language`                | Language of the audio (ISO-639-1)         | e.g., 'en', 'es', 'fr' |
| `prompt`                  | Context to improve transcription accuracy | Any text string        |
| `temperature`             | Controls randomness (0-1)                 | Number between 0 and 1 |
| `timestamp_granularities` | Get word or segment-level timestamps      | ['word', 'segment']    |
| `num_speakers`            | Expected number of speakers (diarization) | Number                 |
| `speaker_labels`          | Names for speakers (diarization)          | Array of strings       |

Supported audio formats include MP3, MP4, MPEG, MPGA, M4A, WAV, and WEBM.

#### Diarization example

The diarization model identifies different speakers in the audio:

```yaml title="promptfooconfig.yaml"
prompts:
  - file://interview.mp3

providers:
  - id: openai:transcription:gpt-4o-transcribe-diarize
    config:
      num_speakers: 2
      speaker_labels: ['Interviewer', 'Guest']

tests:
  - assert:
      - type: contains
        value: Interviewer
      - type: contains
        value: Guest
```

For a complete working example, see the [OpenAI audio transcription example](https://github.com/promptfoo/promptfoo/tree/main/examples/openai-audio-transcription) or initialize it with:

```bash
npx promptfoo@latest init --example openai-audio-transcription
```

## Realtime API Models

The Realtime API allows for real-time communication with models like `gpt-realtime-1.5` and `gpt-realtime` using WebSockets, supporting both text and audio inputs/outputs with streaming responses.

### Supported Realtime Models

- `gpt-realtime-1.5` - Flagship realtime model ($4/$16 per 1M text tokens, $32/$64 per 1M audio tokens)
- `gpt-realtime` - General-availability realtime model ($4/$16 per 1M text tokens, $32/$64 per 1M audio tokens)
- `gpt-realtime-mini` - Cost-efficient realtime model ($0.60/$2.40 per 1M text tokens, $10/$20 per 1M audio tokens)
- `gpt-realtime-mini-2025-12-15`
- `gpt-4o-realtime-preview-2024-12-17` and `gpt-4o-mini-realtime-preview-2024-12-17` - Legacy preview models

### Using Realtime API

To use the OpenAI Realtime API, use the provider format `openai:realtime:<model name>`:

```yaml title="promptfooconfig.yaml"
providers:
  - id: openai:realtime:gpt-realtime-1.5
    config:
      modalities: ['text', 'audio']
      voice: 'alloy'
      instructions: 'You are a helpful assistant.'
      temperature: 0.7
      websocketTimeout: 60000 # 60 seconds
      # Optional: point to custom/proxy endpoints; WS URL is derived automatically
      # https:// → wss://, http:// → ws://
      # Example: wss://my-custom-api.com/v1/realtime
      # Example: ws://localhost:8080/v1/realtime
      # apiBaseUrl: 'https://my-custom-api.com/v1'
```

### Realtime-specific Configuration Options

The Realtime API configuration supports these parameters in addition to standard OpenAI parameters:

| Parameter                    | Description                                                                     | Default                | Options                                                             |
| ---------------------------- | ------------------------------------------------------------------------------- | ---------------------- | ------------------------------------------------------------------- |
| `modalities`                 | Types of content the model can process and generate                             | ['text', 'audio']      | 'text', 'audio'                                                     |
| `voice`                      | Voice for audio generation                                                      | 'alloy'                | alloy, ash, ballad, coral, echo, sage, shimmer, verse, cedar, marin |
| `instructions`               | System instructions for the model                                               | 'You are a helpful...' | Any text string                                                     |
| `input_audio_format`         | Format of audio input                                                           | 'pcm16'                | 'pcm16', 'g711_ulaw', 'g711_alaw'                                   |
| `output_audio_format`        | Format of audio output                                                          | 'pcm16'                | 'pcm16', 'g711_ulaw', 'g711_alaw'                                   |
| `websocketTimeout`           | Timeout for WebSocket connection (milliseconds)                                 | 30000                  | Any number                                                          |
| `max_response_output_tokens` | Maximum tokens in model response. Invalid Realtime values fall back to `'inf'`. | 'inf'                  | Integer from 1-4096 or 'inf'                                        |
| `tools`                      | Array of tool definitions for function calling                                  | []                     | Array of tool objects                                               |
| `tool_choice`                | Controls how tools are selected                                                 | 'auto'                 | 'none', 'auto', 'required', or object                               |

#### Custom endpoints and proxies (Realtime)

The Realtime provider respects the same base URL configuration as other OpenAI providers. The WebSocket URL is derived from `getApiUrl()` by converting protocols: `https://` → `wss://` and `http://` → `ws://`.

You can use this to target Azure-compatible endpoints, proxies, or local/dev servers:

```yaml
providers:
  - id: openai:realtime:gpt-realtime-1.5
    config:
      apiBaseUrl: 'https://my-custom-api.com/v1' # connects to wss://my-custom-api.com/v1/realtime
      modalities: ['text']
      temperature: 0.7
```

Environment variables `OPENAI_API_BASE_URL` and `OPENAI_BASE_URL` also apply to Realtime WebSocket connections.

### Function Calling with Realtime API

The Realtime API supports function calling via tools, similar to the Chat API. Here's an example configuration:

```yaml title="promptfooconfig.yaml"
providers:
  - id: openai:realtime:gpt-realtime-1.5
    config:
      tools:
        - type: function
          name: get_weather
          description: Get the current weather for a location
          parameters:
            type: object
            properties:
              location:
                type: string
                description: The city and state, e.g. San Francisco, CA
            required: ['location']
      tool_choice: 'auto'
```

### Complete Example

For a complete working example that demonstrates the Realtime API capabilities, see the [OpenAI Realtime API example](https://github.com/promptfoo/promptfoo/tree/main/examples/openai-realtime) or initialize it with:

```bash
npx promptfoo@latest init --example openai-realtime
```

This example includes:

- Basic single-turn interactions with the Realtime API
- Multi-turn conversations with persistent context
- Conversation threading with separate conversation IDs
- JavaScript prompt function for properly formatting messages
- Function calling with the Realtime API
- Detailed documentation on handling content types correctly

### Input and Message Format

When using the Realtime API with promptfoo, you can specify the prompt in JSON format:

```json title="realtime-input.json"
[
  {
    "role": "user",
    "content": [
      {
        "type": "text",
        "text": "{{question}}"
      }
    ]
  }
]
```

The Realtime API supports the same multimedia formats as the Chat API, allowing you to include images and audio in your prompts.

### Multi-Turn Conversations

The Realtime API supports multi-turn conversations with persistent context. For implementation details and examples, see the [OpenAI Realtime example](https://github.com/promptfoo/promptfoo/tree/main/examples/openai-realtime), which demonstrates both single-turn interactions and conversation threading using the `conversationId` metadata property.

> **Important**: When implementing multi-turn conversations, use `type: "input_text"` for user inputs and `type: "text"` for assistant responses.

## Responses API

OpenAI's Responses API is the most advanced interface for generating model responses, supporting text and image inputs, function calling, and conversation state. It provides access to OpenAI's full suite of features including reasoning models like o1, o3, and o4 series.

### Supported Responses Models

The Responses API supports a wide range of models, including:

- `gpt-5.5` - GPT-5.5 model ($5/$30 per 1M tokens)
- `gpt-5.5-2026-04-23` - Dated snapshot of gpt-5.5
- `gpt-5.5-pro` - Premium GPT-5.5 model ($30/$180 per 1M tokens)
- `gpt-5.5-pro-2026-04-23` - Dated snapshot of gpt-5.5-pro
- `gpt-5.4` - GPT-5.4 model ($2.50/$15 per 1M tokens)
- `gpt-5.4-2026-03-05` - Dated snapshot of gpt-5.4
- `gpt-5.4-mini` - Smaller GPT-5.4 model ($0.75/$4.50 per 1M tokens)
- `gpt-5.4-mini-2026-03-17` - Dated snapshot of gpt-5.4-mini
- `gpt-5.4-nano` - Lowest-cost GPT-5.4 model ($0.20/$1.25 per 1M tokens)
- `gpt-5.4-nano-2026-03-17` - Dated snapshot of gpt-5.4-nano
- `gpt-5.4-pro` - Premium GPT-5.4 model ($30/$180 per 1M tokens)
- `gpt-5.4-pro-2026-03-05` - Dated snapshot of gpt-5.4-pro
- `gpt-5.5` - GPT-5.5 model
- `gpt-5.5-2026-04-23` - Dated snapshot of gpt-5.5
- `gpt-5.5-pro` - Premium GPT-5.5 pro model
- `gpt-5.5-pro-2026-04-23` - Dated snapshot of gpt-5.5-pro
- `gpt-5` - Earlier GPT-5 family model
- `gpt-5-chat` - GPT-5 chat alias
- `gpt-5.1` - GPT-5.1 base model
- `gpt-5.1-chat-latest` - GPT-5.1 chat alias
- `gpt-5.3-chat-latest` - GPT-5.3 chat alias
- `gpt-5.2-chat-latest` - GPT-5.2 chat-optimized alias
- `gpt-5.2-codex` - GPT-5.2 coding variant
- `gpt-5.2-pro` - Premium GPT-5.2 model with highest reasoning capability ($15/$120 per 1M tokens)
- `o1` - Powerful reasoning model
- `o1-mini` - Smaller, more affordable reasoning model
- `o1-pro` - Enhanced reasoning model with more compute
- `o3-pro` - Highest-tier reasoning model
- `o3` - OpenAI's most powerful reasoning model
- `o3-mini` - Smaller, more affordable reasoning model
- `o4-mini` - Latest fast, cost-effective reasoning model
- `codex-mini-latest` - Fast reasoning model optimized for the Codex CLI
- `gpt-5-codex` - GPT-5 based coding model optimized for code generation
- `gpt-5-pro` - Premium GPT-5 model with highest reasoning capability ($15/$120 per 1M tokens)

### Using the Responses API

To use the OpenAI Responses API, use the provider format `openai:responses:<model name>`:

```yaml title="promptfooconfig.yaml"
providers:
  - id: openai:responses:gpt-5
    config:
      temperature: 0.7
      max_output_tokens: 500
      instructions: 'You are a helpful, creative AI assistant.'
```

### Responses-specific Configuration Options

The Responses API configuration supports these parameters in addition to standard OpenAI parameters:

| Parameter              | Description                                                    | Default    | Options                             |
| ---------------------- | -------------------------------------------------------------- | ---------- | ----------------------------------- |
| `instructions`         | System instructions for the model                              | None       | Any text string                     |
| `include`              | Additional response payloads to return, such as search results | None       | Array of OpenAI include values      |
| `max_output_tokens`    | Maximum tokens to generate in the response                     | 1024       | Any number                          |
| `metadata`             | Key-value pairs attached to the model response                 | None       | Map of string keys to string values |
| `parallel_tool_calls`  | Allow model to run tool calls in parallel                      | true       | Boolean                             |
| `previous_response_id` | ID of a previous response for multi-turn context               | None       | String                              |
| `store`                | Whether to store the response for later retrieval              | true       | Boolean                             |
| `truncation`           | Strategy to handle context window overflow                     | 'disabled' | 'auto', 'disabled'                  |
| `reasoning`            | Configuration for reasoning models                             | None       | Object with `effort` field          |

### MCP (Model Context Protocol) Support

The Responses API supports OpenAI's MCP integration, allowing models to use remote MCP servers to perform tasks. MCP tools enable access to external services and APIs through a standardized protocol.

#### Basic MCP Configuration

To use MCP tools with the Responses API, add them to the `tools` array:

```yaml title="promptfooconfig.yaml"
providers:
  - id: openai:responses:gpt-5
    config:
      tools:
        - type: mcp
          server_label: deepwiki
          server_url: https://mcp.deepwiki.com/mcp
          require_approval: never
```

#### MCP Tool Configuration Options

| Parameter          | Description                             | Required | Options                                  |
| ------------------ | --------------------------------------- | -------- | ---------------------------------------- |
| `type`             | Tool type (must be 'mcp')               | Yes      | 'mcp'                                    |
| `server_label`     | Label to identify the MCP server        | Yes      | Any string                               |
| `server_url`       | URL of the remote MCP server            | Yes      | Valid URL                                |
| `require_approval` | Approval settings for tool calls        | No       | 'never' or object with approval settings |
| `allowed_tools`    | Specific tools to allow from the server | No       | Array of tool names                      |
| `headers`          | Custom headers for authentication       | No       | Object with header key-value pairs       |

#### Authentication with MCP Servers

Most MCP servers require authentication. Use the `headers` parameter to provide API keys or tokens:

```yaml title="promptfooconfig.yaml"
providers:
  - id: openai:responses:gpt-5
    config:
      tools:
        - type: mcp
          server_label: stripe
          server_url: https://mcp.stripe.com
          headers:
            Authorization: 'Bearer sk-test_...'
          require_approval: never
```

#### Filtering MCP Tools

To limit which tools are available from an MCP server, use the `allowed_tools` parameter:

```yaml title="promptfooconfig.yaml"
providers:
  - id: openai:responses:gpt-5
    config:
      tools:
        - type: mcp
          server_label: deepwiki
          server_url: https://mcp.deepwiki.com/mcp
          allowed_tools: ['ask_question']
          require_approval: never
```

#### Approval Settings

By default, OpenAI requires approval before sharing data with MCP servers. You can configure approval settings:

```yaml title="promptfooconfig.yaml"
# Never require approval for all tools
providers:
  - id: openai:responses:gpt-5
    config:
      tools:
        - type: mcp
          server_label: deepwiki
          server_url: https://mcp.deepwiki.com/mcp
          require_approval: never

# Never require approval for specific tools only
providers:
  - id: openai:responses:gpt-5
    config:
      tools:
        - type: mcp
          server_label: deepwiki
          server_url: https://mcp.deepwiki.com/mcp
          require_approval:
            never:
              tool_names: ["ask_question", "read_wiki_structure"]
```

#### Complete MCP Example

```yaml title="promptfooconfig.yaml"
# yaml-language-server: $schema=https://promptfoo.dev/config-schema.json
prompts:
  - 'What are the transport protocols supported in the MCP specification for {{repo}}?'

providers:
  - id: openai:responses:gpt-5
    config:
      tools:
        - type: mcp
          server_label: deepwiki
          server_url: https://mcp.deepwiki.com/mcp
          require_approval: never
          allowed_tools: ['ask_question']

tests:
  - vars:
      repo: modelcontextprotocol/modelcontextprotocol
    assert:
      - type: contains
        value: 'transport protocols'
```

For a complete working example, see the [OpenAI MCP example](https://github.com/promptfoo/promptfoo/tree/main/examples/openai-mcp) or initialize it with:

```bash
npx promptfoo@latest init --example openai-mcp
```

### Reasoning Models

When using reasoning models like `o1`, `o1-pro`, `o3`, `o3-pro`, `o3-mini`, or `o4-mini`, you can control the reasoning effort:

```yaml title="promptfooconfig.yaml"
providers:
  - id: openai:responses:o3
    config:
      reasoning:
        effort: 'medium' # Can be "low", "medium", or "high"
      max_output_tokens: 1000
```

Reasoning models "think before they answer," generating internal reasoning that isn't visible in the output but counts toward token usage and billing.

### o3 and o4-mini Models

OpenAI offers advanced reasoning models in the o-series:

#### o3 and o4-mini

These reasoning models provide different performance and efficiency profiles:

- **o3**: Powerful reasoning model, optimized for complex mathematical, scientific, and coding tasks
- **o4-mini**: Efficient reasoning model with strong performance in coding and visual tasks at lower cost

Both models feature:

- Large context window (200,000 tokens)
- High maximum output tokens (100,000 tokens)

For current specifications and pricing information, refer to [OpenAI's pricing page](https://openai.com/pricing).

Example configuration:

```yaml title="promptfooconfig.yaml"
providers:
  - id: openai:responses:o3
    config:
      reasoning:
        effort: 'high'
      max_output_tokens: 2000

  - id: openai:responses:o4-mini
    config:
      reasoning:
        effort: 'medium'
      max_output_tokens: 1000
```

### Deep Research Models (Responses API Only)

Deep research models (`o3-deep-research`, `o4-mini-deep-research`) are specialized reasoning models designed for complex research tasks that require web search capabilities.

Available models:

- `o3-deep-research` - Most powerful deep research model ($10/1M input, $40/1M output)
- `o3-deep-research-2025-06-26` - Snapshot version
- `o4-mini-deep-research` - Faster, more affordable ($2/1M input, $8/1M output)
- `o4-mini-deep-research-2025-06-26` - Snapshot version

All deep research models:

- **Require** `web_search_preview` tool to be configured
- Support 200,000 token context window
- Support up to 100,000 output tokens
- May take 2-10 minutes to complete research tasks
- Use significant tokens for reasoning before generating output

Example configuration:

```yaml title="promptfooconfig.yaml"
providers:
  - id: openai:responses:o4-mini-deep-research
    config:
      max_output_tokens: 50000 # High limit recommended
      tools:
        - type: web_search_preview # Required
```

#### Advanced Configuration

```yaml title="promptfooconfig.yaml"
providers:
  - id: openai:responses:o3-deep-research
    config:
      max_output_tokens: 100000
      max_tool_calls: 50 # Limit searches to control cost/latency
      background: true # Recommended for long-running tasks
      store: true # Store conversation for 30 days
      tools:
        - type: web_search_preview # Required
        - type: code_interpreter # Optional: For data analysis
          container:
            type: auto
        - type: mcp # Optional: Connect to private data
          server_label: mycompany_data
          server_url: https://api.mycompany.com/mcp
          require_approval: never # Must be 'never' for deep research
```

#### Response Format

Deep research models return specialized output items:

- **web_search_call**: Web search actions (search, open_page, find_in_page)
- **code_interpreter_call**: Code execution for analysis
- **message**: Final answer with inline citations and annotations

Example response structure:

```json
{
  "output": [
    {
      "type": "web_search_call",
      "action": {
        "type": "search",
        "query": "latest AI research papers 2025"
      }
    },
    {
      "type": "message",
      "content": [
        {
          "type": "output_text",
          "text": "Based on my research...",
          "annotations": [
            {
              "url": "https://arxiv.org/...",
              "title": "Paper Title",
              "start_index": 123,
              "end_index": 145
            }
          ]
        }
      ]
    }
  ]
}
```

#### Best Practices

1. **Use Background Mode**: For production, always use `background: true` to handle long response times
2. **Set High Token Limits**: Use `max_output_tokens: 50000` or higher
3. **Configure Timeouts**: Set `PROMPTFOO_EVAL_TIMEOUT_MS=600000` for 10-minute timeouts
4. **Control Costs**: Use `max_tool_calls` to limit the number of searches
5. **Enhance Prompts**: Consider using a faster model to clarify/rewrite prompts before deep research

#### Timeout Configuration

Deep research models automatically use appropriate timeouts:

- If `PROMPTFOO_EVAL_TIMEOUT_MS` is set, it will be used for the API call
- Otherwise, deep research models default to a 10-minute timeout (600,000ms)
- Regular models continue to use the standard 5-minute timeout

Example:

```bash
# Set a custom timeout for all evaluations
export PROMPTFOO_EVAL_TIMEOUT_MS=900000  # 15 minutes

# Or set the default API timeout (affects all providers)
export REQUEST_TIMEOUT_MS=600000  # 10 minutes
```

:::tip
Deep research models require high `max_output_tokens` values (50,000+) and long timeouts. Set `PROMPTFOO_EVAL_TIMEOUT_MS=600000` for 10-minute timeouts.
:::

:::warning
The `web_search_preview` tool is **required** for deep research models. The provider will return an error if this tool is not configured.
:::

### GPT-5 Pro Timeout Configuration

`gpt-5-pro`, `gpt-5.2-pro`, `gpt-5.4-pro`, and `gpt-5.5-pro` are long-running models that often require extended timeouts due to advanced reasoning. Like deep research models, these variants automatically receive a 10-minute timeout (600,000ms) instead of the standard 5-minute timeout.

**Automatic timeout behavior:**

- GPT-5 pro variants automatically get a 10-minute timeout (600,000ms) - **no configuration needed**
- If you need longer, set `PROMPTFOO_EVAL_TIMEOUT_MS` (e.g., 900000 for 15 minutes)
- `REQUEST_TIMEOUT_MS` is **ignored** for GPT-5 pro variants (the automatic timeout takes precedence)

**Most users won't need any timeout configuration** - the automatic 10-minute timeout is sufficient for most GPT-5 pro requests.

**If you experience timeouts, configure this:**

```bash
# Only if you need more than the automatic 10 minutes
export PROMPTFOO_EVAL_TIMEOUT_MS=1200000   # 20 minutes

# For infrastructure reliability (recommended)
export PROMPTFOO_RETRY_5XX=true            # Retry 502 Bad Gateway errors
export PROMPTFOO_REQUEST_BACKOFF_MS=10000  # Longer retry backoff

# Reduce concurrency to avoid rate limits
promptfoo eval --max-concurrency 2
```

**Common GPT-5 pro errors and solutions:**

If you encounter errors with GPT-5 pro models:

1. **Request timed out** - If a GPT-5 pro model needs more than the automatic 10 minutes, set `PROMPTFOO_EVAL_TIMEOUT_MS=1200000` (20 minutes)
2. **502 Bad Gateway** - Enable `PROMPTFOO_RETRY_5XX=true` to retry Cloudflare/OpenAI infrastructure timeouts
3. **getaddrinfo ENOTFOUND** - Transient DNS errors; reduce concurrency with `--max-concurrency 2`
4. **Upstream connection errors** - OpenAI load balancer issues; increase backoff with `PROMPTFOO_REQUEST_BACKOFF_MS=10000`

:::tip
GPT-5 pro models automatically get a 10-minute timeout. If you see infrastructure errors (502, DNS failures), enable `PROMPTFOO_RETRY_5XX=true` and reduce concurrency.
:::

### Sending Images in Prompts

The Responses API supports structured prompts with text, image, and file inputs. Example:

```json title="prompt.json"
[
  {
    "type": "message",
    "role": "user",
    "content": [
      {
        "type": "input_text",
        "text": "Describe what you see in this image about {{topic}}."
      },
      {
        "type": "image_url",
        "image_url": {
          "url": "{{image_url}}"
        }
      }
    ]
  }
]
```

File inputs can use the same structured prompt format with `type: "input_file"`. Set
`detail: "high"` when you need higher-quality file rendering; otherwise OpenAI defaults
to `low`.

```json title="prompt.json"
[
  {
    "type": "message",
    "role": "user",
    "content": [
      {
        "type": "input_text",
        "text": "Summarize the attached contract."
      },
      {
        "type": "input_file",
        "file_id": "file_abc123",
        "detail": "high"
      }
    ]
  }
]
```

### Prompt Caching and Included Tool Results

Use `prompt_cache_key` for stable repeated prefixes and `prompt_cache_retention: 24h`
when you want extended prompt caching. GPT-5.5, GPT-5.5 Pro, and future Responses
models require extended retention, so `prompt_cache_retention: in_memory` will fail for
those models.

The `include` option requests extra structured payloads in the raw Responses object.
For example, `web_search_call.results` returns search results when you need to inspect
them in assertions or downstream tooling:

```yaml title="promptfooconfig.yaml"
providers:
  - id: openai:responses:gpt-5.5
    config:
      prompt_cache_key: repeated-policy-prefix
      prompt_cache_retention: 24h
      include:
        - web_search_call.results
        - reasoning.encrypted_content
```

### Function Calling

The Responses API supports tool and function calling, similar to the Chat API:

```yaml title="promptfooconfig.yaml"
providers:
  - id: openai:responses:gpt-5
    config:
      tools:
        - type: function
          function:
            name: get_weather
            description: Get the current weather for a location
            parameters:
              type: object
              properties:
                location:
                  type: string
                  description: The city and state, e.g. San Francisco, CA
              required: ['location']
      tool_choice: 'auto'
```

### Using with Azure

The Responses API can also be used with Azure OpenAI endpoints by configuring the `apiHost`:

```yaml
providers:
  - id: openai:responses:gpt-4.1
    config:
      apiHost: 'your-resource.openai.azure.com'
      apiKey: '{{ env.AZURE_API_KEY }}' # or set OPENAI_API_KEY env var
      temperature: 0.7
      instructions: 'You are a helpful assistant.'
      response_format: file://./response-schema.json
```

Legacy `apiHost`, newer `apiBaseUrl`, and OpenAI endpoint environment variable Azure configurations all support the same Responses reasoning and verbosity options.

For comprehensive Azure Responses API documentation, see the [Azure provider documentation](/docs/providers/azure#azure-responses-api).

### Complete Example

For a complete working example, see the [OpenAI Responses API example](https://github.com/promptfoo/promptfoo/tree/main/examples/openai-responses) or initialize it with:

```bash
npx promptfoo@latest init --example openai-responses
```

## Troubleshooting

### OpenAI rate limits

Promptfoo automatically handles OpenAI rate limits with retry and adaptive concurrency. See [Rate Limits](/docs/configuration/rate-limits) for details.

If you need manual control, you can:

1. **Reduce concurrency** with `--max-concurrency 1` in the CLI or `evaluateOptions.maxConcurrency` in config
2. **Add fixed delays** with `--delay 3000` (milliseconds) or `evaluateOptions.delay` in config
3. **Adjust backoff** with `PROMPTFOO_REQUEST_BACKOFF_MS` environment variable (default: 5000ms)

### OpenAI flakiness

To retry HTTP requests that are Internal Server errors, set the `PROMPTFOO_RETRY_5XX` environment variable to `1`.

## Agentic Providers

OpenAI offers several agentic providers for different use cases:

### Agents SDK

Test multi-turn agentic workflows with the [OpenAI Agents provider](/docs/providers/openai-agents). This provider supports the [@openai/agents](https://github.com/openai/openai-agents-js) SDK with tools, handoffs, and tracing. For the Python `openai-agents` SDK, use the [OpenAI Agents Python SDK guide](/docs/guides/evaluate-openai-agents-python).

```yaml
providers:
  - openai:agents:my-agent
    config:
      agent: file://./agents/support-agent.ts
      tools: file://./tools/support-tools.ts
      maxTurns: 10
      modelSettings:
        retry:
          maxRetries: 2
          policy: providerSuggested
```

See the [OpenAI Agents documentation](/docs/providers/openai-agents) for full configuration options, retry policies, and examples.

### Codex SDK

For agentic coding tasks with working directory access and structured JSON output, use the [OpenAI Codex SDK provider](/docs/providers/openai-codex-sdk). This provider supports GPT-5.5, GPT-5.4, and Codex-optimized GPT-5 models for code generation. You can select a model inline with `openai:codex:gpt-5.5` or via `config.model` when you need additional options:

Promptfoo preserves SDK-reported input, output, cached input, and reasoning output tokens in `tokenUsage` when Codex returns them.

```yaml
providers:
  - id: openai:codex-sdk
    config:
      model: gpt-5.5
      working_dir: ./src
      output_schema:
        type: object
        properties:
          code: { type: string }
          explanation: { type: string }
```

See the [OpenAI Codex SDK documentation](/docs/providers/openai-codex-sdk) for thread management, structured output, and Git-aware operations.

### Codex App Server

For app-server protocol evals, use the [OpenAI Codex App Server provider](/docs/providers/openai-codex-app-server). It starts `codex app-server` as a local child process and is intended for rich-client behavior such as streamed app-server items, approval requests, MCP elicitations, skill/plugin/app connector events, and thread lifecycle metadata. It does not attach to an already-running Codex Desktop app process.