# Adaptive Rate Limit Scheduler - Architecture

## Overview

The adaptive rate limit scheduler automatically handles provider rate limits during evaluations. It's **zero-configuration** - users don't need to change anything. The scheduler transparently wraps all provider calls with intelligent rate limit detection, retry logic, and adaptive concurrency management.

## Design Goals

The scheduler addresses common challenges when running evaluations against rate-limited APIs:

- **No manual tuning**: Users shouldn't need to guess the right `-j` (concurrency) value
- **Automatic recovery**: Rate limit errors (429) should be retried, not fail permanently
- **Prevent cascading failures**: High concurrency shouldn't cause mass failures
- **Zero configuration**: Works out of the box with sensible defaults

## Architecture

```text
┌─────────────────────────────────────────────────────────────────────────────┐
│                              Evaluator                                       │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                     RateLimitRegistry                                │    │
│  │  (Central coordinator - one per evaluation)                          │    │
│  │                                                                      │    │
│  │  ┌──────────────────┐  ┌──────────────────┐  ┌──────────────────┐   │    │
│  │  │ProviderRateLimit │  │ProviderRateLimit │  │ProviderRateLimit │   │    │
│  │  │     State        │  │     State        │  │     State        │   │    │
│  │  │  (openai/key1)   │  │  (openai/key2)   │  │  (anthropic)     │   │    │
│  │  │                  │  │                  │  │                  │   │    │
│  │  │ ┌─────────────┐  │  │ ┌─────────────┐  │  │ ┌─────────────┐  │   │    │
│  │  │ │  SlotQueue  │  │  │ │  SlotQueue  │  │  │ │  SlotQueue  │  │   │    │
│  │  │ │ (FIFO)      │  │  │ │ (FIFO)      │  │  │ │ (FIFO)      │  │   │    │
│  │  │ └─────────────┘  │  │ └─────────────┘  │  │ └─────────────┘  │   │    │
│  │  │ ┌─────────────┐  │  │ ┌─────────────┐  │  │ ┌─────────────┐  │   │    │
│  │  │ │  Adaptive   │  │  │ │  Adaptive   │  │  │ │  Adaptive   │  │   │    │
│  │  │ │ Concurrency │  │  │ │ Concurrency │  │  │ │ Concurrency │  │   │    │
│  │  │ └─────────────┘  │  │ └─────────────┘  │  │ └─────────────┘  │   │    │
│  │  └──────────────────┘  └──────────────────┘  └──────────────────┘   │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────────────────────┘
```

## Component Responsibilities

### RateLimitRegistry

**File**: `src/scheduler/rateLimitRegistry.ts`

Central coordinator that:

- Creates/retrieves per-provider state based on rate limit keys
- Routes provider calls to the appropriate state
- Aggregates metrics across all providers
- Emits events for monitoring

```typescript
// Usage (automatic in evaluator)
const result = await registry.execute(provider, () => provider.callApi(...), {
  getHeaders: (result) => result.metadata?.headers,
  isRateLimited: (result, error) => error?.message?.includes('429'),
  getRetryAfter: (result, error) => parseRetryAfter(headers['retry-after']),
});
```

### ProviderRateLimitState

**File**: `src/scheduler/providerRateLimitState.ts`

Per-provider state manager that:

- Manages the slot queue for concurrency control
- Tracks rate limit headers from responses
- Implements retry logic with exponential backoff
- Adapts concurrency based on success/failure patterns
- Collects latency metrics

### SlotQueue

**File**: `src/scheduler/slotQueue.ts`

FIFO queue with concurrency limiting:

- Acquires/releases "slots" for concurrent requests
- Blocks when at max concurrency or quota exhausted
- Tracks remaining requests/tokens from headers
- Schedules queue processing after rate limit windows

Key insight: **Race-condition-free** slot allocation. All requests queue, then slots are allocated in FIFO order.

### AdaptiveConcurrency

**File**: `src/scheduler/adaptiveConcurrency.ts`

Dynamic concurrency adjustment:

- **On rate limit**: Reduce concurrency by 50% (multiplicative decrease)
- **On sustained success**: Increase by 1 (additive increase)
- **Proactive throttling**: Reduce when approaching limits (via headers)

This implements AIMD (Additive Increase, Multiplicative Decrease) - the same algorithm TCP uses for congestion control.

### HeaderParser

**File**: `src/scheduler/headerParser.ts`

Parses rate limit headers from multiple providers:

- **OpenAI**: `x-ratelimit-remaining-requests`, `x-ratelimit-limit-requests`
- **Anthropic**: `anthropic-ratelimit-requests-remaining`
- **Generic**: `retry-after`, `retry-after-ms`, `ratelimit-reset`

### RetryPolicy

**File**: `src/scheduler/retryPolicy.ts`

Determines retry behavior:

- Exponential backoff with jitter
- Respects server `retry-after` headers
- Configurable max retries (default: 3)
- Retries on: rate limits, timeouts, 502/503/504

## Data Flow

```text
1. Evaluator calls registry.execute(provider, callFn)
       │
       ▼
2. Registry gets/creates ProviderRateLimitState for this provider
       │
       ▼
3. State.executeWithRetry() is called
       │
       ▼
4. SlotQueue.acquire() - wait for available slot
       │
       ▼
5. Execute callFn() - actual provider API call
       │
       ▼
6. Parse response headers → update rate limit state
       │
       ▼
7. Check if rate limited:
   ├─ Yes → retry with backoff, reduce concurrency
   └─ No  → record success, maybe increase concurrency
       │
       ▼
8. SlotQueue.release() - free slot for next request
       │
       ▼
9. Return result (or throw after max retries)
```

## Rate Limit Key Generation

Each provider gets a unique "rate limit key" based on:

- Provider ID (e.g., "openai:chat:gpt-4o")
- API key hash (different keys = different rate limits)
- Organization ID (if applicable)

This ensures:

- Same provider + same key = shared rate limit state
- Same provider + different keys = separate rate limits
- Different providers = completely isolated

## Key Design Decisions

### 1. Zero Configuration

Users shouldn't need to tune rate limit settings. The scheduler learns from response headers and adapts automatically.

### 2. Fail-Safe Defaults

- Default max concurrency: 4 (conservative)
- Default retry delay: 60 seconds (when no header)
- Max retries: 3 (prevents infinite loops)

### 3. Proactive Throttling

Don't wait for 429 errors. When headers show <10% remaining quota, proactively reduce concurrency.

### 4. Per-Provider Isolation

Different providers have different rate limits. Don't let OpenAI rate limits affect Anthropic calls.

### 5. Transparent Integration

The scheduler wraps `provider.callApi()` without changing the interface. Existing code works unchanged.

## Metrics

The scheduler tracks:

- `totalRequests` - All requests attempted
- `completedRequests` - Successful completions
- `failedRequests` - Permanent failures (after retries)
- `rateLimitHits` - Times 429 was encountered
- `retriedRequests` - Requests that required retry
- `avgLatencyMs`, `p50LatencyMs`, `p99LatencyMs` - Latency distribution

## Events

For monitoring/debugging, the scheduler emits:

- `slot:acquired` / `slot:released` - Concurrency tracking
- `ratelimit:hit` - Rate limit encountered
- `ratelimit:learned` - First time seeing provider's limits
- `ratelimit:warning` - Approaching rate limit
- `concurrency:increased` / `concurrency:decreased` - Adaptive changes
- `request:retrying` - Retry in progress

## Testing

256 tests covering:

- Unit tests for each component
- Edge cases (negative values, zero values, overflow)
- Race condition prevention
- Integration with evaluator

## Performance Characteristics

- **Overhead**: Minimal - just slot acquisition and header parsing
- **Memory**: O(providers) - one state object per unique rate limit key
- **Latency buffer**: Circular buffer, last 100 requests, O(1) insertion