---
sidebar_label: Mixtral vs GPT
description: Compare Mixtral vs GPT performance on custom datasets using automated benchmarks and evaluation metrics to identify the optimal model for your use case
---

# Mixtral vs GPT: Run a benchmark with your own data

In this guide, we'll walk through the steps to compare three large language models (LLMs): Mixtral, GPT-5-mini, and GPT-5. We will use `promptfoo`, a command-line interface (CLI) tool, to run evaluations and compare the performance of these models based on a set of prompts and test cases.

![mixtral and gpt comparison](/img/docs/mixtral-vs-gpt.png)

## Requirements

- `promptfoo` CLI installed on your system.
- Access to Replicate for Mixtral.
- Access to OpenAI for GPT-5-mini and GPT-5.
- API keys for Replicate (`REPLICATE_API_TOKEN`) and OpenAI (`OPENAI_API_KEY`).

## Step 1: Initial Setup

Create a new directory for your comparison project:

```sh
mkdir mixtral-gpt-comparison
cd mixtral-gpt-comparison
```

## Step 2: Configure the models

Create a `promptfooconfig.yaml` with the models you want to compare. Here's an example configuration with Mixtral, GPT-5-mini, and GPT-5:

```yaml title="promptfooconfig.yaml"
providers:
  - replicate:mistralai/mixtral-8x22b-instruct-v0.1
  - openai:gpt-5-mini
  - openai:gpt-5
```

Set your API keys as environment variables:

```sh
export REPLICATE_API_TOKEN=your_replicate_api_token
export OPENAI_API_KEY=your_openai_api_key
```

:::info
In this example, we're using Replicate, but you can also use providers like [HuggingFace](/docs/providers/huggingface), [TogetherAI](/docs/providers/togetherai), etc:

```yaml
- huggingface:text-generation:mistralai/Mistral-7B-Instruct-v0.3
- id: openai:chat:mistralai/Mixtral-8x22B-Instruct-v0.1
  config:
    apiBaseUrl: https://api.together.xyz/v1
```

Local options such as [ollama](/docs/providers/ollama), [vllm](/docs/providers/vllm), and [localai](/docs/providers/localai) also exist. See [providers](/docs/providers) for all options.
:::

### Optional: Configure model parameters

Customize the behavior of each model by setting parameters such as `max_tokens` or `max_length`:

```yaml title="promptfooconfig.yaml"
providers:
  - id: openai:gpt-5-mini
    // highlight-start
    config:
      max_tokens: 128
    // highlight-end
  - id: openai:gpt-5
    // highlight-start
    config:
      max_tokens: 128
    // highlight-end
  - id: replicate:mistralai/mixtral-8x22b-instruct-v0.1
    // highlight-start
    config:
      temperature: 0.01
      max_new_tokens: 128
    // highlight-end
```

## Step 3: Set up your prompts

Set up the prompts that you want to run for each model. In this case, we'll just use a simple prompt, because we want to compare model performance.

```yaml title="promptfooconfig.yaml"
prompts:
  - 'Answer this as best you can: {{query}}'
```

If desired, you can test multiple prompts (just add more to the list), or test [different prompts for each model](/docs/configuration/prompts#model-specific-prompts).

## Step 4: Add test cases

Define the test cases that you want to use for the evaluation. This includes setting up variables that will be interpolated into the prompts:

```yaml title="promptfooconfig.yaml"
tests:
  - vars:
      query: 'What is the capital of France?'
    assert:
      - type: contains
        value: 'Paris'
  - vars:
      query: 'Explain the theory of relativity.'
    assert:
      - type: contains
        value: 'Einstein'
  - vars:
      query: 'Write a poem about the sea.'
    assert:
      - type: llm-rubric
        value: 'The poem should evoke imagery such as waves or the ocean.'
  - vars:
      query: 'What are the health benefits of eating apples?'
    assert:
      - type: contains
        value: 'vitamin'
  - vars:
      query: "Translate 'Hello, how are you?' into Spanish."
    assert:
      - type: similar
        value: 'Hola, ¿cómo estás?'
  - vars:
      query: 'Output a JSON list of colors'
    assert:
      - type: is-json
      - type: latency
        threshold: 5000
```

Optionally, you can set up assertions to automatically assess the output for correctness.

## Step 5: Run the comparison

With everything configured, run the evaluation using the `promptfoo` CLI:

```
npx promptfoo@latest eval
```

This command will execute each test case against each configured model and record the results.

To visualize the results, use the `promptfoo` viewer:

```sh
npx promptfoo@latest view
```

It will show results like so:

![mixtral and gpt comparison](/img/docs/mixtral-vs-gpt.png)

You can also output the results to a file in various formats, such as JSON, YAML, or CSV:

```
npx promptfoo@latest eval -o results.csv
```

## Conclusion

The comparison will provide you with a side-by-side performance view of Mixtral, GPT-5-mini, and GPT-5 based on your test cases. Use this data to make informed decisions about which LLM best suits your application.

While public benchmarks like [Arena](https://lmarena.ai/) tell you how these models perform on _generic_ tasks, they are no substitute for running a benchmark on your _own_ data and use cases.

The examples above highlighted a few cases where GPT outperforms Mixtral: notably, GPT-5 was better at following JSON output instructions. But, GPT-5-mini had the highest eval score because of the latency requirements that we added to one of the test cases. Overall, the best choice is going to depend largely on the test cases that you construct and your own application constraints.