---
sidebar_position: 21
sidebar_label: Dataset generation
title: Dataset Generation - Automated Test Data Creation
description: Generate comprehensive test datasets automatically using promptfoo. Create diverse test cases, personas, and edge cases for thorough LLM evaluation.
keywords:
  [
    dataset generation,
    automated testing,
    test data creation,
    LLM datasets,
    evaluation data,
    test automation,
    synthetic data,
  ]
pagination_prev: configuration/scenarios
pagination_next: configuration/huggingface-datasets
---

# Dataset generation

Your dataset is the heart of your LLM eval. To the extent possible, it should closely represent true inputs into your LLM app.

promptfoo can extend existing datasets and help make them more comprehensive and diverse using the `promptfoo generate dataset` command. This guide will walk you through the process of generating datasets using `promptfoo`.

### Prepare your prompts

Before generating a dataset, you need to have your `prompts` ready, and _optionally_ `tests`:

```yaml
prompts:
  - 'Act as a travel guide for {{location}}'
  - 'I want you to act as a travel guide. I will write you my location and you will suggest a place to visit near my location. In some cases, I will also give you the type of places I will visit. You will also suggest me places of similar type that are close to my first location. My current location is {{location}}'

tests:
  - vars:
      location: 'San Francisco'
  - vars:
      location: 'Wyoming'
  - vars:
      location: 'Kyoto'
  - vars:
      location: 'Great Barrier Reef'
```

Alternatively, you can specify your [prompts as CSV](/docs/configuration/prompts#csv-files-csv):

```yaml
prompts: file://travel-guide-prompts.csv
```

where the CSV looks like:

```csv title="travel-guide-prompts.csv"
prompt
"Act as a travel guide for {{location}}"
"I want you to act as a travel guide. I will write you my location and you will suggest a place to visit near my location. In some cases, I will also give you the type of places I will visit. You will also suggest me places of similar type that are close to my first location. My current location is {{location}}"
```

### Run `promptfoo generate dataset`

Dataset generation uses your prompts and any existing test cases to generate new, unique test cases that can be used for evaluation.

Run the command in the same directory as your config:

```sh
promptfoo generate dataset
```

This will output the `tests` YAML to your terminal.

If you want to write the new dataset to a YAML:

```sh
promptfoo generate dataset -o tests.yaml
```

a CSV:

```sh
promptfoo generate dataset -o tests.csv
```

Or if you want to edit the existing config in-place:

```sh
promptfoo generate dataset -w
```

### Loading from output files

When using the `-o` flag, you will need to include the generated dataset within the [tests](/docs/configuration/test-cases) block of your configuration file.

For example:

```yaml
prompts:
  - 'Act as a travel guide for {{location}}'
  - 'I want you to act as a travel guide. I will write you my location and you will suggest a place to visit near my location. In some cases, I will also give you the type of places I will visit. You will also suggest me places of similar type that are close to my first location. My current location is {{location}}'

tests:
  - file://tests.csv
  - vars:
      location: 'San Francisco'
  - vars:
      location: 'Wyoming'
  - vars:
      location: 'Kyoto'
  - vars:
      location: 'Great Barrier Reef'
```

### Customize the generation process

You can customize the dataset generation process by providing additional options to the `promptfoo generate dataset` command. Below is a table of supported parameters:

| Parameter                  | Description                                                             |
| -------------------------- | ----------------------------------------------------------------------- |
| `-c, --config`             | Path to the configuration file.                                         |
| `-i, --instructions`       | Specific instructions for the LLM to follow when generating test cases. |
| `-o, --output [path]`      | Path to output file. Supports CSV and YAML.                             |
| `-w, --write`              | Write the generated test cases directly to the configuration file.      |
| `--numPersonas`            | Number of personas to generate for the dataset.                         |
| `--numTestCasesPerPersona` | Number of test cases to generate per persona.                           |
| `--provider`               | Provider to use for the dataset generation. Eg: openai:chat:gpt-5       |

For example:

```sh
promptfoo generate dataset --config path_to_config.yaml --output path_to_output.yaml --instructions "Consider edge cases related to international travel"
```

### Using a custom provider

The `--provider` flag specifies the LLM used to generate test cases. This is separate from the providers in your config file (which are the targets being tested).

By default, dataset generation uses OpenAI (`OPENAI_API_KEY`). To use a different provider, set the appropriate environment variables:

```bash
# Azure OpenAI
export AZURE_OPENAI_API_KEY=your-key
export AZURE_API_HOST=your-host.openai.azure.com
export AZURE_OPENAI_DEPLOYMENT_NAME=your-deployment

promptfoo generate dataset
```

Alternatively, use the `--provider` flag with any supported provider:

```bash
promptfoo generate dataset --provider openai:chat:gpt-5-mini
```

For more control, create a provider config file:

```yaml title="synthesis-provider.yaml"
id: openai:responses:gpt-5.2
config:
  reasoning:
    effort: medium
  max_output_tokens: 4096
```

```bash
promptfoo generate dataset --provider file://synthesis-provider.yaml
```

You can also use a Python provider:

```bash
promptfoo generate dataset --provider file://synthesis-provider.py
```