# compare-gpt-model-tiers-mmlu-pro (GPT Model Tiers MMLU-Pro Comparison)

You can run this example with:

```bash
npx promptfoo@latest init --example compare-gpt-model-tiers-mmlu-pro
cd compare-gpt-model-tiers-mmlu-pro
```

This example demonstrates how to benchmark full, mini, and nano OpenAI GPT model tiers using MMLU-Pro, a more challenging successor to MMLU with up to 10 answer options per question.

## Prerequisites

- promptfoo CLI installed (`npm install -g promptfoo` or `brew install promptfoo`)
- OpenAI API key set as `OPENAI_API_KEY`
- Hugging Face account and access token (optional for public MMLU-Pro data, useful for higher rate limits)

## Hugging Face Authentication

For higher rate limits or private datasets, authenticate with Hugging Face:

1. Create a Hugging Face account at [huggingface.co](https://huggingface.co) if you don't have one
2. Generate an access token at [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens)
3. Set your token as an environment variable:

   ```bash
   export HF_TOKEN=your_token_here
   ```

   Or add it to your `.env` file:

   ```env
   HF_TOKEN=your_token_here
   ```

## Running the Eval

1. Get a local copy of the promptfooconfig:

   ```bash
   npx promptfoo@latest init --example compare-gpt-model-tiers-mmlu-pro
   cd compare-gpt-model-tiers-mmlu-pro
   ```

2. Run the evaluation:

   ```bash
   npx promptfoo@latest eval
   ```

3. View the results:

   ```bash
   npx promptfoo@latest view
   ```

## What's Being Tested

This comparison evaluates all three models on 100 MMLU-Pro questions spanning many subject categories.

## Test Structure

The configuration in `promptfooconfig.yaml` includes:

1. **Prompt Template**: Renders all available MMLU-Pro options dynamically and asks for a final answer in a fixed format
2. **Quality Checks**:
   - 60-second timeout per question
   - Required final answer format (`Therefore, the answer is X`)
   - Deterministic JavaScript scoring that compares the parsed final letter against `answer`
3. **Model Configuration**:
   - 1200 max completion tokens for concise reasoning plus the final answer

## Customizing

You can modify the test by editing `promptfooconfig.yaml`:

1. **Evaluate more MMLU-Pro questions**:

   ```yaml
   tests:
     - huggingface://datasets/TIGER-Lab/MMLU-Pro?split=test&config=default&limit=250
   ```

2. **Change the number of questions**:

   ```yaml
   tests:
     - huggingface://datasets/TIGER-Lab/MMLU-Pro?split=test&config=default&limit=200
   ```

3. **Adjust model parameters**:

   ```yaml
   providers:
     - id: openai:chat:gpt-5.4
       config:
         max_completion_tokens: 1500
   ```

## Additional Resources

- [OpenAI provider documentation](https://promptfoo.dev/docs/providers/openai)
- [MMLU-Pro benchmark details](https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro)