--- sidebar_position: 21 sidebar_label: Dataset generation title: Dataset Generation - Automated Test Data Creation description: Generate comprehensive test datasets automatically using promptfoo. Create diverse test cases, personas, and edge cases for thorough LLM evaluation. keywords: [ dataset generation, automated testing, test data creation, LLM datasets, evaluation data, test automation, synthetic data, ] pagination_prev: configuration/scenarios pagination_next: configuration/huggingface-datasets --- # Dataset generation Your dataset is the heart of your LLM eval. To the extent possible, it should closely represent true inputs into your LLM app. promptfoo can extend existing datasets and help make them more comprehensive and diverse using the `promptfoo generate dataset` command. This guide will walk you through the process of generating datasets using `promptfoo`. ### Prepare your prompts Before generating a dataset, you need to have your `prompts` ready, and _optionally_ `tests`: ```yaml prompts: - 'Act as a travel guide for {{location}}' - 'I want you to act as a travel guide. I will write you my location and you will suggest a place to visit near my location. In some cases, I will also give you the type of places I will visit. You will also suggest me places of similar type that are close to my first location. My current location is {{location}}' tests: - vars: location: 'San Francisco' - vars: location: 'Wyoming' - vars: location: 'Kyoto' - vars: location: 'Great Barrier Reef' ``` Alternatively, you can specify your [prompts as CSV](/docs/configuration/prompts#csv-files-csv): ```yaml prompts: file://travel-guide-prompts.csv ``` where the CSV looks like: ```csv title="travel-guide-prompts.csv" prompt "Act as a travel guide for {{location}}" "I want you to act as a travel guide. I will write you my location and you will suggest a place to visit near my location. In some cases, I will also give you the type of places I will visit. You will also suggest me places of similar type that are close to my first location. My current location is {{location}}" ``` ### Run `promptfoo generate dataset` Dataset generation uses your prompts and any existing test cases to generate new, unique test cases that can be used for evaluation. Run the command in the same directory as your config: ```sh promptfoo generate dataset ``` This will output the `tests` YAML to your terminal. If you want to write the new dataset to a YAML: ```sh promptfoo generate dataset -o tests.yaml ``` a CSV: ```sh promptfoo generate dataset -o tests.csv ``` Or if you want to edit the existing config in-place: ```sh promptfoo generate dataset -w ``` ### Loading from output files When using the `-o` flag, you will need to include the generated dataset within the [tests](/docs/configuration/test-cases) block of your configuration file. For example: ```yaml prompts: - 'Act as a travel guide for {{location}}' - 'I want you to act as a travel guide. I will write you my location and you will suggest a place to visit near my location. In some cases, I will also give you the type of places I will visit. You will also suggest me places of similar type that are close to my first location. My current location is {{location}}' tests: - file://tests.csv - vars: location: 'San Francisco' - vars: location: 'Wyoming' - vars: location: 'Kyoto' - vars: location: 'Great Barrier Reef' ``` ### Customize the generation process You can customize the dataset generation process by providing additional options to the `promptfoo generate dataset` command. Below is a table of supported parameters: | Parameter | Description | | -------------------------- | ----------------------------------------------------------------------- | | `-c, --config` | Path to the configuration file. | | `-i, --instructions` | Specific instructions for the LLM to follow when generating test cases. | | `-o, --output [path]` | Path to output file. Supports CSV and YAML. | | `-w, --write` | Write the generated test cases directly to the configuration file. | | `--numPersonas` | Number of personas to generate for the dataset. | | `--numTestCasesPerPersona` | Number of test cases to generate per persona. | | `--provider` | Provider to use for the dataset generation. Eg: openai:chat:gpt-5 | For example: ```sh promptfoo generate dataset --config path_to_config.yaml --output path_to_output.yaml --instructions "Consider edge cases related to international travel" ``` ### Using a custom provider The `--provider` flag specifies the LLM used to generate test cases. This is separate from the providers in your config file (which are the targets being tested). By default, dataset generation uses OpenAI (`OPENAI_API_KEY`). To use a different provider, set the appropriate environment variables: ```bash # Azure OpenAI export AZURE_OPENAI_API_KEY=your-key export AZURE_API_HOST=your-host.openai.azure.com export AZURE_OPENAI_DEPLOYMENT_NAME=your-deployment promptfoo generate dataset ``` Alternatively, use the `--provider` flag with any supported provider: ```bash promptfoo generate dataset --provider openai:chat:gpt-5-mini ``` For more control, create a provider config file: ```yaml title="synthesis-provider.yaml" id: openai:responses:gpt-5.2 config: reasoning: effort: medium max_output_tokens: 4096 ``` ```bash promptfoo generate dataset --provider file://synthesis-provider.yaml ``` You can also use a Python provider: ```bash promptfoo generate dataset --provider file://synthesis-provider.py ```