---
title: Rate Limits
description: Configure automatic rate limit handling with exponential backoff, header-aware delays, and adaptive concurrency for LLM provider APIs.
sidebar_label: Rate Limits
sidebar_position: 15
---

# Rate Limits

Promptfoo automatically handles rate limits from LLM providers. When a provider returns HTTP 429 or similar rate limit errors, requests are automatically retried with exponential backoff.

## Automatic Handling

Rate limit handling is built into the evaluator and requires no configuration:

- **Automatic retry**: Failed requests are retried up to 3 times with exponential backoff by default (overridable per provider via `maxRetries`, including `0` to disable retries)
- **Header-aware delays**: Respects `retry-after` headers from providers
- **Adaptive concurrency**: Reduces concurrent requests when rate limits are hit
- **Per-provider isolation**: Each provider and API key has separate rate limit tracking

### Supported Headers

Promptfoo parses rate limit headers from major providers:

| Provider     | Headers                                                                                                          |
| ------------ | ---------------------------------------------------------------------------------------------------------------- |
| OpenAI       | `x-ratelimit-remaining-requests`, `x-ratelimit-limit-requests`, `x-ratelimit-remaining-tokens`, `retry-after-ms` |
| Anthropic    | `anthropic-ratelimit-requests-remaining`, `anthropic-ratelimit-tokens-remaining`, `retry-after`                  |
| Azure OpenAI | `x-ratelimit-remaining-requests`, `retry-after-ms`, `retry-after`                                                |
| Generic      | `retry-after`, `ratelimit-remaining`, `ratelimit-reset`                                                          |

### Transient Error Handling

Promptfoo automatically retries requests that fail with transient server errors:

| Status Code | Description         | Retry Condition                                      |
| ----------- | ------------------- | ---------------------------------------------------- |
| 502         | Bad Gateway         | Status text contains "bad gateway"                   |
| 503         | Service Unavailable | Status text contains "service unavailable"           |
| 504         | Gateway Timeout     | Status text contains "gateway timeout"               |
| 524         | A Timeout Occurred  | Status text contains "timeout" (Cloudflare-specific) |

These errors are retried up to 3 times with exponential backoff (1s, 2s, 4s). The status text check ensures that permanent failures (like authentication errors that happen to use 502) are not retried.

### How Adaptive Concurrency Works

The scheduler uses AIMD (Additive Increase, Multiplicative Decrease) to optimize throughput:

1. When a rate limit is hit, concurrency is reduced by 50%
2. After sustained successful requests, concurrency increases by 1
3. When remaining quota drops below 10% (from headers), concurrency is proactively reduced

This allows you to set a higher `maxConcurrency` and let promptfoo find the optimal rate automatically.

## Configuration

### Concurrency

Control the maximum number of concurrent requests:

```yaml
evaluateOptions:
  maxConcurrency: 10
```

Or via CLI:

```bash
promptfoo eval --max-concurrency 10
```

The adaptive scheduler will reduce this if rate limits are encountered, but cannot exceed your configured maximum.

### Fixed Delay

Add a fixed delay between requests (in addition to any rate limit backoff):

```yaml
evaluateOptions:
  delay: 1000 # milliseconds
```

Or via CLI:

```bash
promptfoo eval --delay 1000
```

Or via environment variable:

```bash
PROMPTFOO_DELAY_MS=1000 promptfoo eval
```

### Backoff Configuration

Promptfoo has two retry layers:

1. **Provider-level retry** (scheduler): Retries `callApi()` with 1-second base backoff, up to 3 times by default. If a provider config sets `maxRetries`, the scheduler uses that value (including `0` to disable scheduler retries entirely).
2. **HTTP-level retry**: Retries failed HTTP requests. Defaults to 4 retries, or the provider's `maxRetries` when set.

When a provider config includes `maxRetries`, promptfoo propagates that value to both layers. Explicit per-call overrides (e.g. a provider that passes a specific `maxRetries` to `fetchWithRetries`) still take precedence. For direct `fetchWithProxy` calls, transient retries (502/503/504/524) are disabled when the provider sets `maxRetries: 0`.

Example — disable retries for a provider to fail fast on rate limits:

```yaml
providers:
  - id: openai:chat:gpt-4.1-mini
    config:
      maxRetries: 0
```

Environment variables for the scheduler:

| Environment Variable                   | Description                                | Default  |
| -------------------------------------- | ------------------------------------------ | -------- |
| `PROMPTFOO_DISABLE_ADAPTIVE_SCHEDULER` | Disable adaptive concurrency (use fixed)   | false    |
| `PROMPTFOO_MIN_CONCURRENCY`            | Minimum concurrency (floor for adaptive)   | 1        |
| `PROMPTFOO_SCHEDULER_QUEUE_TIMEOUT_MS` | Timeout for queued requests (0 to disable) | 300000ms |

Environment variables for HTTP-level retry:

| Environment Variable           | Description                       | Default |
| ------------------------------ | --------------------------------- | ------- |
| `PROMPTFOO_REQUEST_BACKOFF_MS` | Base delay for HTTP retry backoff | 5000ms  |
| `PROMPTFOO_RETRY_5XX`          | Retry on HTTP 500 errors          | false   |

Example:

```bash
PROMPTFOO_REQUEST_BACKOFF_MS=10000 PROMPTFOO_RETRY_5XX=true promptfoo eval
```

The scheduler's retry handles most rate limiting automatically. The HTTP-level retry provides additional resilience for network issues.

## Provider-Specific Notes

### OpenAI

OpenAI has separate rate limits for requests and tokens. The scheduler tracks both. For high-volume evaluations:

```yaml
evaluateOptions:
  maxConcurrency: 20 # Scheduler will adapt down if needed
```

See [OpenAI troubleshooting](/docs/providers/openai#troubleshooting) for additional options.

### Anthropic

Anthropic rate limits are typically per-minute. The scheduler respects `retry-after` headers from the API.

### Custom Providers

Custom providers trigger automatic retry when errors contain:

- "429"
- "rate limit"
- "too many requests"

To provide retry timing, include headers in your response metadata:

```javascript
return {
  output: 'response',
  metadata: {
    headers: {
      'retry-after': '60', // seconds
    },
  },
};
```

## Debugging

To see rate limit events, enable debug logging:

```bash
LOG_LEVEL=debug promptfoo eval -c config.yaml
```

Events logged:

- `ratelimit:hit` - Rate limit encountered
- `ratelimit:learned` - Provider limits discovered from headers
- `ratelimit:warning` - Approaching rate limit threshold
- `concurrency:decreased` / `concurrency:increased` - Adaptive concurrency changes
- `request:retrying` - Retry in progress

## Best Practices

1. **Start with higher concurrency** - Set `maxConcurrency` to your desired throughput; the scheduler will adapt down if needed

2. **Use caching** - Enable [caching](/docs/configuration/caching) to avoid re-running identical requests

3. **Monitor debug logs** - If evaluations are slow, check for frequent `ratelimit:hit` events

4. **Consider provider tiers** - Higher API tiers typically have higher rate limits; the scheduler will automatically use whatever limits the provider allows

## Disabling Automatic Handling

The scheduler is always active but has minimal overhead. For fully deterministic behavior (e.g., in tests), use:

```yaml
evaluateOptions:
  maxConcurrency: 1
  delay: 1000
```

This ensures sequential execution with fixed delays between requests.