--- sidebar_position: 2 sidebar_label: Reference title: Configuration Reference - Complete API Documentation description: Comprehensive reference for all promptfoo configuration options, properties, and settings. Complete API documentation for evaluation setup. keywords: [ promptfoo reference, configuration API, evaluation options, provider settings, test configuration, assertion types, ] pagination_prev: configuration/guide pagination_next: configuration/prompts --- # Reference Here is the main structure of the promptfoo configuration file: ### Config | Property | Type | Required | Description | | ------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | description | string | No | Optional description of what your LLM is trying to do | | tags | Record\ | No | Optional tags to describe the test suite (e.g. `env: production`, `application: chatbot`) | | providers | [ProvidersConfig](#providersconfig) | Yes, unless `targets` is set | One or more [LLM APIs](/docs/providers) to use. Exactly one of `providers` or `targets` must be set. | | targets | [ProvidersConfig](#providersconfig) | Yes, unless `providers` is set | Alias for `providers`, commonly used in [red team](/docs/red-team) configs. Exactly one of `targets` or `providers` must be set. | | prompts | string \| string[] \| Record\ \| Prompt[] | Yes | One or more [prompts](/docs/configuration/prompts) to load | | tests | string \| (string \| [Test Case](#test-case) \| [Test Generator Config](#test-generator-config))[] \| [Test Generator Config](#test-generator-config) | No | Path to a [test file](/docs/configuration/test-cases), inline tests, or a generator. If omitted, promptfoo runs each prompt/provider pair once with empty vars. | | scenarios | (string \| [Scenario](#scenario))[] | No | [Scenario](/docs/configuration/scenarios) files or inline scenario definitions | | defaultTest | `file://${string}` \| Partial [Test Case](#test-case) | No | Sets the [default properties](/docs/configuration/guide#default-test-cases) for each test case. Can be an inline object or a `file://` path to an external YAML/JSON file. | | outputPath | string \| string[] | No | Where to write output. Writes to console/web viewer if not set. See [output formats](/docs/configuration/outputs). | | sharing | boolean \| object | No | Enables or configures [result sharing](/docs/usage/sharing) with optional `apiBaseUrl` and `appBaseUrl` fields | | nunjucksFilters | Record\ | No | Map of [Nunjucks](https://mozilla.github.io/nunjucks/) filter names to file paths | | env | Record\ | No | Environment variables to set for the test run. These values will override existing environment variables. Can be used to set API keys and other configuration values needed by providers. | | derivedMetrics | [DerivedMetric](#derivedmetric)[] | No | Metrics calculated after the eval from named assertion scores | | extensions | string[] \| null | No | List of [extension files](#extension-hooks) to load. Each extension is a file path with a function name. Can be Python (.py) or JavaScript (.js) files. Supported hooks are 'beforeAll', 'afterAll', 'beforeEach', 'afterEach'. | | metadata | Record\ | No | Arbitrary metadata stored with the eval config | | redteam | RedteamConfig | No | [Red team](/docs/red-team/configuration) configuration | | writeLatestResults | boolean | No | Write latest results to promptfoo storage so they can be viewed in the web UI | | tracing | TracingConfig | No | [OpenTelemetry tracing](/docs/tracing) configuration | | evaluateOptions.maxConcurrency | number | No | Maximum number of concurrent requests. Defaults to 4 | | evaluateOptions.repeat | number | No | Number of times to run each test case. Defaults to 1 | | evaluateOptions.delay | number | No | Force the test runner to wait after each API call (milliseconds). Defaults to 0 | | evaluateOptions.showProgressBar | boolean | No | Whether to display the progress bar | | evaluateOptions.cache | boolean | No | Whether to use disk [cache](/docs/configuration/caching) for results (default: true) | | evaluateOptions.timeoutMs | number | No | Timeout in milliseconds for each individual test case/provider API call. When reached, that specific test is marked as an error. Default is 0 (no timeout). | | evaluateOptions.maxEvalTimeMs | number | No | Maximum total runtime in milliseconds for the entire evaluation process. When reached, all remaining tests are marked as errors and the evaluation ends. Default is 0 (no limit). | | commandLineOptions | [CommandLineOptions](#commandlineoptions) | No | Default values for command-line options. These values will be used unless overridden by actual command-line arguments. | ### Test Case A test case represents a single example input that is fed into all prompts and providers. | Property | Type | Required | Description | | ------------------------------ | ----------------------------------------------------------------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | description | string | No | Description of what you're testing | | vars | Record\ \| string \| string[] | No | Key-value pairs to substitute in the prompt. If `vars` is a string or string array, promptfoo loads test vars from those file paths. See [Test Case Configuration](/docs/configuration/test-cases) for loading vars from files. | | provider | string \| ProviderOptions \| ApiProvider | No | Override the default [provider](/docs/providers) for this specific test case | | providers | string[] | No | Filter which providers this test runs against. Supports labels, IDs, and wildcards (e.g., `openai:*`). See [filtering tests by provider](/docs/configuration/test-cases#filtering-tests-by-provider). | | prompts | string[] | No | Filter this test to run only with specific prompts (by label or ID). Supports wildcards like `Math:*`. See [Filtering Tests by Prompt](/docs/configuration/test-cases#filtering-tests-by-prompt). | | providerOutput | string \| Record\ | No | Precomputed provider output. When set, promptfoo skips calling the provider and runs assertions directly against this output. | | assert | ([Assertion](#assertion) \| [Assertion Set](#assertion-set))[] | No | List of automatic checks to run on the LLM output. See [assertions & metrics](/docs/configuration/expected-outputs) for all available types. | | assertScoringFunction | `file://` JavaScript/Python path \| function | No | Custom scoring function that combines named assertion scores into the final grading result. | | threshold | number | No | Test will fail if the combined score of assertions is less than this number | | metadata | Record\ | No | Additional metadata to include with the test case, useful for [filtering](/docs/configuration/test-cases#metadata-in-csv) or grouping results | | options | Object | No | Additional configuration settings for the test case | | options.transformVars | string \| function | No | A filepath (js or py), JavaScript snippet, or Node.js function that runs on the vars before they are substituted into the prompt. See [transforming input variables](/docs/configuration/guide#transforming-input-variables). | | options.transform | string \| function | No | A filepath (js or py), JavaScript snippet, or Node.js function that runs on LLM output before assertions. See [transforming outputs](/docs/configuration/guide#transforming-outputs). | | options.postprocess | string \| function | No | Deprecated alias for `options.transform` | | options.prefix | string | No | Text to prepend to the prompt | | options.suffix | string | No | Text to append to the prompt | | options.provider | string \| ProviderOptions \| ApiProvider \| Record\ | No | The API provider to use for [model-graded](/docs/configuration/expected-outputs/model-graded) assertion grading | | options.rubricPrompt | string \| string[] \| ChatMessage[] | No | Custom prompt for [model-graded](/docs/configuration/expected-outputs/model-graded) assertions | | options.factuality | object | No | Score weights for factuality assertions (`subset`, `superset`, `agree`, `disagree`, `differButFactual`) | | options.disableVarExpansion | boolean | No | If true, arrays in `vars` are not expanded into multiple test cases | | options.disableConversationVar | boolean | No | If true, promptfoo does not include the implicit `_conversation` variable in the prompt | | options.disableDefaultAsserts | boolean | No | If true, this test case does not inherit assertions from `defaultTest.assert`; other `defaultTest` properties still apply | | options.runSerially | boolean | No | If true, run this test case without concurrency regardless of global settings | | options.storeOutputAs | string | No | The output of this test will be stored as a variable, which can be used in subsequent tests. See [multi-turn conversations](/docs/configuration/chat#using-storeoutputas). | | options.\ | any | No | Provider-specific config fields (e.g., `response_format`, `responseSchema`) are passed through to the provider. Use `file://` to load from external files. See [Per-test provider config](#per-test-provider-config). | ### Test Generator Config Use a test generator config when `tests` should be produced by a JavaScript or Python generator. | Property | Type | Required | Description | | -------- | --------------------- | -------- | ---------------------------------------------------------------------------- | | path | string | Yes | Path to the generator function, e.g. `file://path/to/tests.py:function_name` | | config | Record\ | No | Configuration passed to the generator. Values may reference `file://` paths. | #### Per-test provider config {#per-test-provider-config} Test-level `options` can include provider-specific configuration fields that override the provider's default config for that specific test. This is useful for: - Using different structured output schemas per test - Varying temperature or other parameters for specific test cases - Testing the same prompt with different model configurations ```yaml tests: - vars: question: 'What is 2 + 2?' options: # Provider-specific: loaded from external file response_format: file://./schemas/math-response.json # Provider-specific: inline override temperature: 0 ``` The external file must contain the complete configuration object. For OpenAI structured outputs: ```json title="schemas/math-response.json" { "type": "json_schema", "json_schema": { "name": "math_response", "strict": true, "schema": { "type": "object", "properties": { "answer": { "type": "number" }, "explanation": { "type": "string" } }, "required": ["answer", "explanation"], "additionalProperties": false } } } ``` See the [OpenAI structured outputs guide](/docs/providers/openai#using-response_format) for more details. ### Assertion More details on using assertions, including examples [here](/docs/configuration/expected-outputs). | Property | Type | Required | Description | | ---------------- | ----------------------------------------------------------------- | -------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | type | string | Yes | Type of assertion. See [assertion types](/docs/configuration/expected-outputs#assertion-types) for all available types. `not-` prefixes are supported for most base assertion types. | | value | string \| string[] \| number \| object \| function | No | The expected value, if applicable | | config | Record\ | No | Extra configuration passed to the assertion or assertion value function | | threshold | number | No | The threshold value, applicable only to certain types such as [`similar`](/docs/configuration/expected-outputs/similar), [`cost`](/docs/configuration/expected-outputs/deterministic#cost), [`javascript`](/docs/configuration/expected-outputs/javascript), [`python`](/docs/configuration/expected-outputs/python) | | weight | number | No | Weight of this assertion compared to other assertions in the test case. Defaults to 1 | | provider | string \| ProviderOptions \| ApiProvider \| Record\ | No | Some assertions (type = [`similar`](/docs/configuration/expected-outputs/similar), [`llm-rubric`](/docs/configuration/expected-outputs/model-graded/llm-rubric), [model-graded-\*](/docs/configuration/expected-outputs/model-graded)) require an [LLM provider](/docs/providers) | | rubricPrompt | string \| string[] \| ChatMessage[] | No | Override the grading rubric for model-graded assertions | | metric | string | No | The label for this result. Assertions with the same `metric` will be aggregated together. See [named metrics](/docs/configuration/expected-outputs#defining-named-metrics). | | transform | string \| function | No | Transform the output before running this assertion. This receives the test-transformed output. | | contextTransform | string \| function | No | Transform provider-normalized output into context for [context-based assertions](/docs/configuration/expected-outputs/model-graded#context-based). See [Context Transform](/docs/configuration/expected-outputs/model-graded#dynamically-via-context-transform) for more details. | ### Assertion Set An assertion set groups multiple assertions and can define its own threshold, metric, weight, and shared config. | Property | Type | Required | Description | | --------- | ------------------------- | -------- | ----------------------------------------------------------------- | | type | `assert-set` | Yes | Marks this item as an assertion set | | assert | [Assertion](#assertion)[] | Yes | Assertions in the set | | threshold | number | No | Required score for the set | | weight | number | No | Weight of this set compared to other assertions or assertion sets | | metric | string | No | Named metric for the set | | config | Record\ | No | Shared config passed into every assertion in the set | ### CommandLineOptions Set default values for command-line options. These defaults will be used unless overridden by command-line arguments. | Property | Type | Description | | ------------------------ | ------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | **Basic Configuration** | | | | description | string | Description of what your LLM is trying to do | | config | string[] | Path(s) to configuration files | | envPath | string \| string[] | Path(s) to .env file(s). When multiple files are specified, later files override earlier values. | | **Input Files** | | | | prompts | string[] | One or more paths to [prompt files](/docs/configuration/prompts) | | providers | string[] | One or more [LLM provider](/docs/providers) identifiers | | tests | string | Path to CSV file with [test cases](/docs/configuration/test-cases) | | vars | string | Path to CSV file with test variables | | assertions | string | Path to [assertions](/docs/configuration/expected-outputs) file | | modelOutputs | string | Path to JSON file containing model outputs | | **Prompt Modifications** | | | | promptPrefix | string | Text to prepend to every prompt | | promptSuffix | string | Text to append to every prompt | | generateSuggestions | boolean | Generate new prompts and append them to the prompt list | | suggestionsCount | integer | Number of prompt variations to generate when `generateSuggestions` is enabled (default `1`, max `50`). May also be set under `evaluateOptions`; the CLI flag `--suggest-prompts ` is equivalent. | | **Test Execution** | | | | maxConcurrency | number | Maximum number of concurrent requests | | repeat | number | Number of times to run each test case | | delay | number | Delay between API calls in milliseconds | | grader | string | [Provider](/docs/providers) that will grade [model-graded](/docs/configuration/expected-outputs/model-graded) outputs | | var | object | Set test variables as key-value pairs (e.g. `{key1: 'value1', key2: 'value2'}`) | | **Filtering** | | | | filterPattern | string | Only run tests whose description matches the regular expression pattern | | filterPrompts | string | Only run tests with prompts whose `id` or `label` matches this regex | | filterProviders | string | Only run tests with providers matching this regex (matches against provider `id` or `label`) | | filterTargets | string | Only run tests with targets matching this regex (alias for filterProviders) | | filterFirstN | number | Only run the first N test cases | | filterRange | string | Run test cases in a zero-based `start:end` range. The end index is exclusive | | filterSample | number | Run a random sample of N test cases | | filterMetadata | string \| string[] | Only run tests matching metadata filters in `key=value` format. Multiple filters are combined with AND logic. | | filterErrorsOnly | string | Only run tests that resulted in errors from a previous output path or eval ID | | filterFailing | string | Only run non-passing tests (assertion failures and errors) from a previous output path or eval ID | | filterFailingOnly | string | Only run assertion failures from a previous output path or eval ID, excluding errors | | **Output & Display** | | | | output | string[] | [Output file](/docs/configuration/outputs) paths (csv, txt, json, yaml, yml, html) | | table | boolean | Show output table (default: true, disable with --no-table) | | tableCellMaxLength | number | Maximum length of table cells in console output | | progressBar | boolean | Whether to display progress bar during evaluation | | verbose | boolean | Enable verbose output | | share | boolean | Whether to create a shareable URL | | noShare | boolean | Disable sharing, overriding config-based sharing | | **Caching & Storage** | | | | cache | boolean | Whether to use disk [cache](/docs/configuration/caching) for results (default: true) | | write | boolean | Whether to write results to promptfoo directory (default: true) | | **Other Options** | | | | watch | boolean | Whether to watch for config changes and re-run automatically | | retryErrors | boolean | Retry all ERROR results from the latest eval | | extension | string[] | Extension hooks to load from the CLI (same format as top-level `extensions`) | #### Example ```yaml title="promptfooconfig.yaml" # yaml-language-server: $schema=https://promptfoo.dev/config-schema.json prompts: - prompt1.txt - prompt2.txt providers: - openai:gpt-5 tests: tests.csv # Set default command-line options commandLineOptions: envPath: # Load from multiple .env files (later overrides earlier) - .env - .env.local maxConcurrency: 10 repeat: 3 delay: 1000 verbose: true grader: openai:gpt-5-mini table: true cache: false tableCellMaxLength: 100 # Filtering options filterPattern: 'auth.*' # Only run tests with 'auth' in description filterProviders: 'openai.*' # Only test OpenAI providers filterRange: '0:100' # Run tests 0 through 99 filterSample: 50 # Random sample of 50 tests # Prompt modifications promptPrefix: 'You are a helpful assistant. ' promptSuffix: "\n\nPlease be concise." # Variables var: temperature: '0.7' max_tokens: '1000' ``` With this configuration, running `npx promptfoo eval` will use these defaults. You can still override them: ```bash # Uses maxConcurrency: 10 from config npx promptfoo eval # Overrides maxConcurrency to 5 npx promptfoo eval --max-concurrency 5 ``` ### AssertionValueFunctionContext When using [JavaScript](/docs/configuration/expected-outputs/javascript) or [Python](/docs/configuration/expected-outputs/python) assertions, your function receives a context object with the following interface: ```typescript interface AssertionValueFunctionContext { // Raw prompt sent to LLM prompt: string | undefined; // Test case variables vars: Record; // The complete test case (see #test-case) test: AtomicTestCase; // Log probabilities from the LLM response, if available logProbs: number[] | undefined; // Configuration passed to the assertion config?: Record; // The provider that generated the response (see /docs/providers) provider: ApiProvider | undefined; // The complete provider response (see #providerresponse) providerResponse: ProviderResponse | undefined; // OpenTelemetry trace data when tracing is enabled and the assertion uses trace context trace?: TraceData; } ``` ### VarValue `VarValue` is the value type accepted in test `vars`, assertion contexts, and provider call contexts. ```typescript type VarValue = string | number | boolean | object | unknown[]; ``` ### TraceData `TraceData` is available to trace-aware assertions when tracing is enabled. ```typescript interface TraceSpan { spanId: string; parentSpanId?: string; name: string; startTime: number; endTime?: number; attributes?: Record; statusCode?: number; statusMessage?: string; } interface TraceData { traceId: string; evaluationId: string; testCaseId: string; metadata?: Record; spans: TraceSpan[]; } ``` :::note promptfoo supports `.js` and `.json` file extensions in addition to `.yaml`. It automatically loads `promptfooconfig.*`, but you can use a custom config file with `promptfoo eval -c path/to/config`. ::: ## Extension Hooks Promptfoo supports extension hooks that allow you to run custom code that modifies the evaluation state at specific points in the evaluation lifecycle. These hooks are defined in extension files specified in the `extensions` property of the configuration. ### Available Hooks | Name | Description | Context | | ---------- | --------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------- | | beforeAll | Runs before the entire test suite begins | `{ suite: TestSuite }` | | afterAll | Runs after the entire test suite has finished | `{ results: EvaluateResult[], prompts: CompletedPrompt[], suite: TestSuite, evalId: string, config: Partial }` | | beforeEach | Runs before each individual test | `{ test: TestCase }` | | afterEach | Runs after each individual test | `{ test: TestCase, result: EvaluateResult }` | ### Session Management in Hooks For multi-turn conversations or stateful interactions, hooks can be used to manage per-test sessions (i.e. "conversation threads"). #### Pre-Test Session Definition A common pattern is to create session on your server in the `beforeEach` hook and clean them up in the `afterEach` hook: ```javascript export async function extensionHook(hookName, context) { if (hookName === 'beforeEach') { const res = await fetch('http://localhost:8080/session'); const sessionId = await res.text(); return { test: { ...context.test, vars: { ...context.test.vars, sessionId } } }; // Scope the session id to the current test case } if (hookName === 'afterEach') { const id = context.test.vars.sessionId; // Read the session id from the test case scope await fetch(`http://localhost:8080/session/${id}`, { method: 'DELETE' }); } } ``` See the working [stateful-session-management example](https://github.com/promptfoo/promptfoo/tree/main/examples/config-stateful-session-management) for a complete implementation. #### Test-Time Session Definition Session ids returned by your provider in `response.sessionId` will be used as the session id for the test case. If the provider does not return a session id, the test variables (`vars.sessionId`) will be used as fallback. **For HTTP providers**, you extract session IDs from server responses using a `sessionParser` configuration. The session parser tells promptfoo how to extract the session ID from response headers or body, which then becomes `response.sessionId`. For example: ```yaml providers: - id: http config: url: 'https://example.com/api' # Session parser extracts ID from response → becomes response.sessionId sessionParser: 'data.headers["x-session-id"]' headers: # Use the extracted session ID in subsequent requests 'x-session-id': '{{sessionId}}' ``` See the [HTTP provider session management documentation](/docs/providers/http#session-management) for complete details on configuring session parsers. It is made available in the `afterEach` hook context at: ```javascript context.result.metadata.sessionId; ``` **Note:** For regular providers, the sessionId comes from either `response.sessionId` (provider-generated via session parser or direct provider support) or `vars.sessionId` (set in beforeEach hook or test config). The priority is: `response.sessionId` > `vars.sessionId`. For example: ```javascript async function extensionHook(hookName, context) { if (hookName === 'afterEach') { const sessionId = context.result.metadata.sessionId; if (sessionId) { console.log(`Test completed with session: ${sessionId}`); // You can use this sessionId for tracking, logging, or cleanup } } } ``` For iterative red team strategies (e.g., jailbreak, tree search), the `sessionIds` array is made available in the `afterEach` hook context at: ```javascript context.result.metadata.sessionIds; ``` This is an array containing all session IDs from the iterative exploration process. Each iteration may have its own session ID, allowing you to track the full conversation history across multiple attempts. Example usage for iterative providers: ```javascript async function extensionHook(hookName, context) { if (hookName === 'afterEach') { // For regular providers - single session ID const sessionId = context.result.metadata.sessionId; // For iterative providers (jailbreak, tree search) - array of session IDs const sessionIds = context.result.metadata.sessionIds; if (sessionIds && Array.isArray(sessionIds)) { console.log(`Jailbreak completed with ${sessionIds.length} iterations`); sessionIds.forEach((id, index) => { console.log(` Iteration ${index + 1}: session ${id}`); }); // You can use these sessionIds for detailed tracking of the attack path } } } ``` Note: The `sessionIds` array only contains defined session IDs - any iterations without a session ID are filtered out. ### Implementing Hooks To implement these hooks, create a JavaScript or Python file with a function that handles the hooks you want to use. Then, specify the path to this file and the function name in the `extensions` array in your configuration. :::note A custom function name receives all event types (`beforeAll`, `afterAll`, `beforeEach`, `afterEach`) with the legacy `(hookName, context)` calling convention. If the function name is exactly one of the hook names, promptfoo only runs it for that hook and calls it as `(context, { hookName })`. ::: Example configuration: ```yaml extensions: - file://path/to/your/extension.js:extensionHook - file://path/to/your/extension.py:extension_hook ``` :::important When specifying an extension in the configuration, you must include the function name after the file path, separated by a colon (`:`). This tells promptfoo which function to call in the extension file. ::: Python example extension file: ```python from typing import Optional def extension_hook(hook_name, context) -> Optional[dict]: # Perform any necessary setup if hook_name == 'beforeAll': print(f"Setting up test suite: {context['suite'].get('description', '')}") # Add an additional test case to the suite: context["suite"]["tests"].append( { "vars": { "body": "It's a beautiful day", "language": "Spanish", }, "assert": [{"type": "contains", "value": "Es un día hermoso."}], } ) # Add an additional default assertion to the suite: context["suite"]["defaultTest"]["assert"].append({"type": "is-json"}) return context # Perform any necessary teardown or reporting elif hook_name == 'afterAll': print(f"Test suite completed: {context['suite'].get('description', '')}") print(f"Total tests: {len(context['results'])}") # Prepare for individual test elif hook_name == 'beforeEach': print(f"Running test: {context['test'].get('description', '')}") # Change all languages to pirate-dialect context["test"]["vars"]["language"] = f'Pirate {context["test"]["vars"]["language"]}' return context # Clean up after individual test or log results elif hook_name == 'afterEach': print(f"Test completed: {context['test'].get('description', '')}. Pass: {context['result'].get('success', False)}") ``` JavaScript example extension file: ```javascript async function extensionHook(hookName, context) { // Perform any necessary setup if (hookName === 'beforeAll') { console.log(`Setting up test suite: ${context.suite.description || ''}`); // Add an additional test case to the suite: context.suite.tests.push({ vars: { body: "It's a beautiful day", language: 'Spanish', }, assert: [{ type: 'contains', value: 'Es un día hermoso.' }], }); return context; } // Perform any necessary teardown or reporting else if (hookName === 'afterAll') { console.log(`Test suite completed: ${context.suite.description || ''}`); console.log(`Total tests: ${context.results.length}`); } // Prepare for individual test else if (hookName === 'beforeEach') { console.log(`Running test: ${context.test.description || ''}`); // Change all languages to pirate-dialect context.test.vars.language = `Pirate ${context.test.vars.language}`; return context; } // Clean up after individual test or log results else if (hookName === 'afterEach') { console.log( `Test completed: ${context.test.description || ''}. Pass: ${context.result.success || false}`, ); } } module.exports = extensionHook; ``` These hooks provide powerful extensibility to your promptfoo evaluations, allowing you to implement custom logic for setup, teardown, logging, or integration with other systems. The extension function receives the `hookName` and a `context` object, which contains relevant data for each hook type. You can use this information to perform actions specific to each stage of the evaluation process. The `beforeAll`, `beforeEach`, and `afterEach` hooks may mutate specific properties of their respective `context` arguments in order to modify evaluation state. To persist these changes, the hook must return the modified context. All merges are **shallow**: returned properties replace existing values at the top level. Nested objects (e.g., `metadata: { nested: { a: 1 } }`) are replaced entirely, not deep-merged. #### beforeAll | Property | Type | Description | | --------------------------------- | -------------------------- | --------------------------------------------------------------------------------- | | `context.suite.prompts` | [`Prompt[]`](#prompt) | The prompts to be evaluated. | | `context.suite.providerPromptMap` | `Record` | A map of provider IDs to prompt labels. | | `context.suite.tests` | [`TestCase[]`](#test-case) | The test cases to be evaluated. | | `context.suite.scenarios` | [`Scenario[]`](#scenario) | The [scenarios](/docs/configuration/scenarios) to be evaluated. | | `context.suite.defaultTest` | [`TestCase`](#test-case) | The default test case to be evaluated. | | `context.suite.nunjucksFilters` | `Record` | A map of [Nunjucks](https://mozilla.github.io/nunjucks/) filters. | | `context.suite.derivedMetrics` | `DerivedMetric[]` | [Derived metrics](/docs/configuration/expected-outputs#creating-derived-metrics). | | `context.suite.redteam` | `RedteamConfig` | The [red team](/docs/red-team) configuration to be evaluated. | #### beforeEach | Property | Type | Description | | -------------- | ------------------------ | ------------------------------ | | `context.test` | [`TestCase`](#test-case) | The test case to be evaluated. | #### afterEach | Property | Type | Description | | ---------------------------------- | ------------------------ | ------------------------------------------------------- | | `context.result.namedScores` | `Record` | Custom numeric metrics (e.g., `num_turns`, `cost_usd`). | | `context.result.metadata` | `Record` | Structured data (e.g., tool call details, URLs). | | `context.result.response.metadata` | `Record` | Response-level metadata (e.g., session viewer URLs). | Fields like `success`, `score`, and `response.output` are **not** overridable from `afterEach`. #### afterAll The `afterAll` hook is intended for side effects (sending to monitoring, cleanup, etc.) and its return value is not persisted. Use it for read-only operations on the completed evaluation. | Property | Type | Description | | ----------------- | --------------------------------------- | --------------------------------------- | | `context.suite` | [`TestSuite`](#testsuite) | The completed test suite | | `context.results` | [`EvaluateResult`](#evaluateresult)[] | All evaluation results as plain objects | | `context.prompts` | [`CompletedPrompt`](#completedprompt)[] | Completed prompts with metrics | | `context.evalId` | string | Unique identifier for this eval run | | `context.config` | Partial\ | The full evaluation configuration | ## Provider-related types ### Guardrails GuardrailResponse is an object that represents the GuardrailResponse from a provider. It includes flags indicating if prompt or output failed guardrails. ```typescript interface GuardrailResponse { flagged?: boolean; flaggedInput?: boolean; flaggedOutput?: boolean; reason?: string; } ``` ## Transformation Pipeline Understanding the transformation pipeline is crucial for complex evaluations, especially for [RAG systems](/docs/guides/evaluate-rag) which require [context-based assertions](/docs/configuration/expected-outputs/model-graded#context-based). Here's how transforms are applied: ### Execution Flow ```mermaid graph LR subgraph Provider A B end subgraph Test Cases D E F G end A[API Response] --> B[Provider transformResponse] B --> D[options.transform] B --> E[Assert contextTransform] D --> F[Test Assertions] E --> G[Context Assertions] ``` ### Complete Example: RAG System Evaluation This example demonstrates how different transforms work together in a RAG evaluation : ```yaml providers: - id: 'http://localhost:3000/api/rag' config: # Step 1: Provider transform - normalize API response structure transformResponse: | // API returns: { status: "success", data: { answer: "...", sources: [...] } } // Transform to: { answer: "...", sources: [...] } json.data tests: - vars: query: 'What is the refund policy?' options: # Step 2a: Test transform - extract answer for general assertions # Receives output from transformResponse: { answer: "...", sources: [...] } transform: 'output.answer' assert: # Regular assertion uses test-transformed output (just the answer string) - type: contains value: '30 days' # Context assertions use contextTransform - type: context-faithfulness # Step 2b: Context transform - extract sources # Also receives output from transformResponse: { answer: "...", sources: [...] } contextTransform: 'output.sources.map(s => s.content).join("\n")' threshold: 0.9 # Another assertion can have its own transform - type: equals value: 'confident' # Step 3: Assertion-level transform (applied after test transform) # Receives: "30-day refund policy" (the test-transformed output) transform: | output.includes("30") ? "confident" : "uncertain" ``` ### Key Points 1. **Provider Transform** (`transformResponse`): Applied first to normalize provider responses 2. **Test Case Transforms**: - `options.transform`: Modifies output for regular assertions - `contextTransform`: Extracts context for context-based assertions - Both receive the provider-transformed output directly 3. **Assertion Transform**: Applied to already-transformed output for specific assertions ### ProvidersConfig ```typescript type ProvidersConfig = | string | ProviderFunction | ApiProvider | (string | ProviderFunction | ApiProvider | Record | ProviderOptions)[]; ``` ### ProviderFunction A ProviderFunction is a function that takes a prompt as an argument and returns a Promise that resolves to a ProviderResponse. It allows you to define custom logic for calling an API. ```typescript type ProviderFunction = ( prompt: string, context?: CallApiContextParams, options?: { includeLogProbs?: boolean; abortSignal?: AbortSignal }, ) => Promise; ``` ### CallApiContextParams `CallApiContextParams` is the context passed to provider `callApi` implementations and model-graded assertion providers. ```typescript interface CallApiContextParams { filters?: Record string>; getCache?: any; logger?: any; originalProvider?: ApiProvider; prompt: Prompt; vars: Record; debug?: boolean; test?: AtomicTestCase; bustCache?: boolean; // W3C Trace Context headers traceparent?: string; tracestate?: string; // Evaluation metadata evaluationId?: string; testCaseId?: string; testIdx?: number; promptIdx?: number; repeatIndex?: number; } ``` ### ProviderOptions ProviderOptions is an object that includes the `id` of the provider and an optional `config` object that can be used to pass provider-specific configurations. For providers with built-in cost estimation, `config` can also include pricing overrides such as `cost`, `inputCost`, and `outputCost`. When supported, `inputCost` and `outputCost` take precedence over the legacy shared `cost` value. OpenAI audio-capable models also support `audioCost`, `audioInputCost`, and `audioOutputCost`. ```typescript interface ProviderOptions { id?: ProviderId; label?: string; config?: any; // List of prompt labels to include (exact, group prefix like "group", or wildcard "group:*") prompts?: string[]; // Transform the output, either with inline Javascript, external py/js script, or a function // See /docs/configuration/guide#transforming-outputs transform?: string | TransformFunction; // Sleep this long before each request delay?: number; // Provider-specific environment overrides env?: EnvOverrides; // Multi-input definitions for red team targets inputs?: Inputs; } ``` ### ProviderResponse ProviderResponse is an object that represents the response from a provider. It includes the output from the provider, any error that occurred, information about token usage, and a flag indicating whether the response was cached. ```typescript interface ProviderResponse { cached?: boolean; cost?: number; // required for cost assertion (see /docs/configuration/expected-outputs/deterministic#cost) error?: string; output?: any; raw?: any; prompt?: string | ChatMessage[]; // actual prompt sent, if different from rendered prompt metadata?: { redteamFinalPrompt?: string; http?: { status: number; statusText: string; headers?: Record; requestHeaders?: Record; }; [key: string]: any; }; tokenUsage?: TokenUsage; materializationHandled?: boolean; materializedVars?: Record; inputMaterialization?: Record; providerTransformedOutput?: any; logProbs?: number[]; // required for perplexity assertion (see /docs/configuration/expected-outputs/deterministic#perplexity) latencyMs?: number; isRefusal?: boolean; // the provider has explicitly refused to generate a response (see /docs/configuration/expected-outputs/deterministic#is-refusal) finishReason?: string; sessionId?: string; conversationEnded?: boolean; conversationEndReason?: string; guardrails?: GuardrailResponse; isBase64?: boolean; format?: string; audio?: { id?: string; data?: string; blobRef?: BlobRef; transcript?: string; [key: string]: any; }; video?: { id?: string; blobRef?: BlobRef; url?: string; model?: string; [key: string]: any }; images?: ImageOutput[]; } ``` ### ProviderEmbeddingResponse ProviderEmbeddingResponse is an object that represents the response from a provider's embedding API. It includes the embedding from the provider, any error that occurred, and information about token usage. ```typescript interface ProviderEmbeddingResponse { cached?: boolean; cost?: number; error?: string; embedding?: number[]; latencyMs?: number; tokenUsage?: Partial; metadata?: { transformed?: boolean; originalText?: string; [key: string]: any; }; } ``` ## Evaluation inputs ### TestSuite `TestSuite` is the resolved runtime suite passed to extension hooks after providers, prompts, tests, filters, and other config have been loaded. ```typescript interface TestSuite { tags?: Record; description?: string; providers: ApiProvider[]; prompts: Prompt[]; providerPromptMap?: Record; tests?: TestCase[]; scenarios?: Scenario[]; defaultTest?: `file://${string}` | Omit; nunjucksFilters?: Record string>; env?: EnvOverrides; derivedMetrics?: DerivedMetric[]; extensions?: string[] | null; redteam?: RedteamConfig; tracing?: TracingConfig; } ``` ### TestSuiteConfiguration The source type name for this pre-parse configuration shape is `TestSuiteConfig`. ```typescript interface TestSuiteConfig { // Optional tags to describe the test suite tags?: Record; // Optional description of what you're trying to test description?: string; // One or more LLM APIs to use, for example: openai:gpt-5-mini, openai:gpt-5 localai:chat:vicuna providers: ProvidersConfig; // One or more prompts prompts: string | (string | Prompt)[] | Record; // Path to a test file, OR list of LLM prompt variations (aka "test case") tests?: string | (string | TestCase | TestGeneratorConfig)[] | TestGeneratorConfig; // Scenarios, groupings of data and tests to be evaluated scenarios?: (string | Scenario)[]; // Sets the default properties for each test case. Useful for setting an assertion, on all test cases, for example. defaultTest?: `file://${string}` | Omit; // Path to write output. Writes to console/web viewer if not set. outputPath?: string | string[]; // Determines whether or not sharing is enabled. sharing?: | boolean | { apiBaseUrl?: string; appBaseUrl?: string; }; // Nunjucks filters nunjucksFilters?: Record; // Envar overrides env?: EnvOverrides | Record; // Metrics to calculate after the eval has completed derivedMetrics?: DerivedMetric[]; // Extension hooks extensions?: string[] | null; // Arbitrary metadata about this configuration metadata?: Record; // Red team configuration redteam?: RedteamConfig; // Whether to write latest results to promptfoo storage. This enables you to use the web viewer. writeLatestResults?: boolean; // OpenTelemetry tracing configuration tracing?: TracingConfig; } ``` ### UnifiedConfig UnifiedConfig is an object that includes the test suite configuration, evaluation options, and command line options. It is used to hold the complete configuration for the evaluation. ```typescript interface UnifiedConfig extends Omit { // Exactly one of providers or targets must be set. providers?: ProvidersConfig; targets?: ProvidersConfig; evaluateOptions?: EvaluateOptions; commandLineOptions?: Partial; } ``` ### Scenario `Scenario` is an object that represents a group of test cases to be evaluated. It includes a description, default test case configuration, and a list of test cases. ```typescript interface Scenario { description?: string; config: Partial[]; tests: TestCase[]; } ``` Also, see [this table here](/docs/configuration/scenarios#configuration) for descriptions. ### DerivedMetric `DerivedMetric` calculates a metric from named assertion scores after the eval has completed. ```typescript interface DerivedMetric { name: string; value: string | ((namedScores: Record, context: RunEvalOptions) => number); } ``` ### RunEvalOptions `RunEvalOptions` is the per-row execution context passed into derived metric callbacks. ```typescript interface RunEvalOptions { provider: ApiProvider; prompt: Prompt; delay: number; test: AtomicTestCase; testSuite?: TestSuite; nunjucksFilters?: Record string>; evaluateOptions?: EvaluateOptions; testIdx: number; promptIdx: number; repeatIndex: number; conversations?: Record< string, { prompt: string | object; input: string; output: string | object; metadata?: object }[] >; registers?: Record; isRedteam: boolean; concurrency?: number; evalId?: string; abortSignal?: AbortSignal; } ``` ### Prompt A `Prompt` is what it sounds like. When specifying a prompt object in a static config, it should look like this: ```typescript type PromptConfigObject = | { id: string; // Path, usually prefixed with file:// label?: string; // How to display it in outputs and web UI raw?: string; // Optional inline prompt text } | { raw: string; // Inline prompt text label: string; // How to display it in outputs and web UI id?: string; template?: string; display?: string; // Deprecated: use label function?: PromptFunction; config?: any; // Provider config merged for this prompt }; ``` When passing a `Prompt` object directly to the Javascript library: ```typescript interface Prompt { // The actual prompt raw: string; // How it should appear in the UI label: string; // A function to generate a prompt on a per-input basis. Overrides the raw prompt. function?: (context: { vars: Record; provider?: ApiProvider; }) => Promise; } ``` ### TokenUsage ```typescript interface TokenUsage { prompt?: number; completion?: number; cached?: number; total?: number; numRequests?: number; completionDetails?: CompletionTokenDetails; assertions?: TokenUsage; } interface CompletionTokenDetails { reasoning?: number; acceptedPrediction?: number; rejectedPrediction?: number; cacheReadInputTokens?: number; cacheCreationInputTokens?: number; } ``` ### PromptMetrics `PromptMetrics` is passed to `EvaluateOptions.progressCallback` and stored on completed prompts. ```typescript interface PromptMetrics { score: number; testPassCount: number; testFailCount: number; testErrorCount: number; assertPassCount: number; assertFailCount: number; totalLatencyMs: number; tokenUsage: TokenUsage; namedScores: Record; namedScoresCount: Record; namedScoreWeights?: Record; redteam?: { pluginPassCount: Record; pluginFailCount: Record; strategyPassCount: Record; strategyFailCount: Record; }; cost: number; } ``` ### EvaluateOptions EvaluateOptions is an object that includes options for how the evaluation should be performed. It includes the maximum concurrency for API calls, whether to show a progress bar, a callback for progress updates, the number of times to repeat each test, and a delay between tests. ```typescript interface EvaluateOptions { cache?: boolean; delay?: number; eventSource?: string; generateSuggestions?: boolean; suggestionsCount?: number; /** Deprecated: use maxConcurrency: 1 or -j 1 instead. */ interactiveProviders?: boolean; maxConcurrency?: number; repeat?: number; showProgressBar?: boolean; timeoutMs?: number; maxEvalTimeMs?: number; isRedteam?: boolean; silent?: boolean; abortSignal?: AbortSignal; progressCallback?: ( completed: number, total: number, index: number, evalStep: RunEvalOptions, metrics: PromptMetrics, ) => void; } ``` ## Evaluation outputs ### EvaluateTable EvaluateTable is an object that represents the results of the evaluation in a tabular format. It includes a header with the prompts and variables, and a body with the outputs and variables for each test case. ```typescript interface EvaluateTable { head: { prompts: CompletedPrompt[]; vars: string[]; }; body: EvaluateTableRow[]; } interface EvaluateTableRow { description?: string; outputs: EvaluateTableOutput[]; vars: string[]; test: AtomicTestCase; testIdx: number; } ``` ### EvaluateTableOutput EvaluateTableOutput is an object that represents the output of a single evaluation in a tabular format. It includes the pass/fail result, score, output text, prompt, latency, token usage, and grading result. ```typescript // 0 = none, 1 = assertion failure, 2 = error type ResultFailureReason = 0 | 1 | 2; interface EvaluateTableOutput { cost: number; failureReason: ResultFailureReason; gradingResult?: GradingResult | null; id: string; latencyMs: number; metadata?: Record; namedScores: Record; pass: boolean; prompt: string; provider?: string; response?: ProviderResponse; score: number; testCase: AtomicTestCase; text: string; tokenUsage?: Partial; error?: string | null; audio?: ProviderResponse['audio']; video?: ProviderResponse['video']; images?: ImageOutput[]; } ``` ### EvaluateSummary EvaluateSummary is an object that represents a summary of the evaluation results. It includes the version of the evaluator, the results of each evaluation, a table of the results, and statistics about the evaluation. The latest version is 3. It removed the table and added in a new prompts property. ```typescript interface EvaluateSummaryV3 { version: 3; timestamp: string; // ISO 8601 datetime results: EvaluateResult[]; prompts: CompletedPrompt[]; stats: EvaluateStats; } ``` ```typescript interface EvaluateSummaryV2 { version: number; timestamp: string; // ISO 8601 datetime results: EvaluateResult[]; table: EvaluateTable; stats: EvaluateStats; } ``` ### EvaluateStats EvaluateStats is an object that includes statistics about the evaluation. It includes the number of successful and failed tests, and the total token usage. ```typescript interface EvaluateStats { successes: number; failures: number; errors: number; tokenUsage: Required; durationMs?: number; generationDurationMs?: number; evaluationDurationMs?: number; } ``` ### EvaluateResult EvaluateResult roughly corresponds to a single "cell" in the grid comparison view. It includes information on the provider, prompt, and other inputs, as well as the outputs. ```typescript interface EvaluateResult { id?: string; description?: string; promptIdx: number; testIdx: number; testCase: AtomicTestCase; promptId: string; provider: Pick; prompt: Prompt; vars: Record; response?: ProviderResponse; error?: string | null; failureReason: ResultFailureReason; success: boolean; score: number; latencyMs: number; gradingResult?: GradingResult | null; namedScores: Record; cost?: number; metadata?: Record; tokenUsage?: Required; } ``` ### GradingResult GradingResult is an object that represents the result of grading a test case. It includes whether the test case passed, the score, the reason for the result, the tokens used, and the results of any component assertions. ```typescript interface ResultSuggestion { type: string; action: 'replace-prompt' | 'pre-filter' | 'post-filter' | 'note'; value: string; } interface GradingResult { pass: boolean; // did test pass? score: number; // score between 0 and 1 reason: string; // plaintext reason for outcome namedScores?: Record; // labeled metrics attached to this result namedScoreWeights?: Record; // weighted denominator for namedScores tokensUsed?: TokenUsage; // tokens consumed by the test componentResults?: GradingResult[]; // nested component results assertion?: Assertion; // source assertion comment?: string; // user comment suggestions?: ResultSuggestion[]; // suggested follow-up actions metadata?: { pluginId?: string; strategyId?: string; context?: string | string[]; contextUnits?: string[]; renderedAssertionValue?: string; renderedGradingPrompt?: string; graderError?: true; [key: string]: any; }; } ``` ### CompletedPrompt CompletedPrompt is an object that represents a prompt that has been evaluated. It includes the raw prompt, the provider, metrics, and other information. ```typescript interface CompletedPrompt { id?: string; raw: string; template?: string; display?: string; label: string; function?: PromptFunction; // These config options are merged into the provider config. config?: any; provider: string; metrics?: { score: number; testPassCount: number; testFailCount: number; testErrorCount: number; assertPassCount: number; assertFailCount: number; totalLatencyMs: number; tokenUsage: TokenUsage; namedScores: Record; namedScoresCount: Record; namedScoreWeights?: Record; redteam?: { pluginPassCount: Record; pluginFailCount: Record; strategyPassCount: Record; strategyFailCount: Record; }; cost: number; }; } ```