---
sidebar_position: 7
description: 'Comprehensive overview of model-graded evaluation techniques leveraging AI models to assess quality, safety, and accuracy'
---

# Model-graded metrics

promptfoo supports several types of model-graded assertions:

Output-based:

- [`llm-rubric`](/docs/configuration/expected-outputs/model-graded/llm-rubric) - Promptfoo's general-purpose grader; uses an LLM to evaluate outputs against custom criteria or rubrics.
- [`search-rubric`](/docs/configuration/expected-outputs/model-graded/search-rubric) - Like `llm-rubric` but with web search capabilities for verifying current information.
- [`model-graded-closedqa`](/docs/configuration/expected-outputs/model-graded/model-graded-closedqa) - Checks if LLM answers meet specific requirements using OpenAI's public evals prompts.
- [`factuality`](/docs/configuration/expected-outputs/model-graded/factuality) - Evaluates factual consistency between LLM output and a reference statement. Uses OpenAI's public evals prompt to determine if the output is factually consistent with the reference.
- [`g-eval`](/docs/configuration/expected-outputs/model-graded/g-eval) - Uses chain-of-thought prompting to evaluate outputs against custom criteria following the G-Eval framework.
- [`answer-relevance`](/docs/configuration/expected-outputs/model-graded/answer-relevance) - Evaluates whether LLM output is directly related to the original query.
- [`similar`](/docs/configuration/expected-outputs/similar) - Checks semantic similarity between output and expected value using embedding models.
- [`pi`](/docs/configuration/expected-outputs/model-graded/pi) - Alternative scoring approach using a dedicated evaluation model to score inputs/outputs against criteria.
- [`classifier`](/docs/configuration/expected-outputs/classifier) - Runs LLM output through HuggingFace text classifiers for detection of tone, bias, toxicity, and other properties. See [classifier grading docs](/docs/configuration/expected-outputs/classifier).
- [`moderation`](/docs/configuration/expected-outputs/moderation) - Uses OpenAI's moderation API to ensure LLM outputs are safe and comply with usage policies. See [moderation grading docs](/docs/configuration/expected-outputs/moderation).
- [`select-best`](/docs/configuration/expected-outputs/model-graded/select-best) - Compares multiple outputs from different prompts/providers and selects the best one based on custom criteria.
- [`max-score`](/docs/configuration/expected-outputs/model-graded/max-score) - Selects the output with the highest aggregate score based on other assertion results.

Context-based:

- [`context-recall`](/docs/configuration/expected-outputs/model-graded/context-recall) - ensure that ground truth appears in context
- [`context-relevance`](/docs/configuration/expected-outputs/model-graded/context-relevance) - ensure that context is relevant to original query
- [`context-faithfulness`](/docs/configuration/expected-outputs/model-graded/context-faithfulness) - ensure that LLM output is supported by context

Conversational:

- [`conversation-relevance`](/docs/configuration/expected-outputs/model-graded/conversation-relevance) - ensure that responses remain relevant throughout a conversation

Trajectory-based:

- [`trajectory:goal-success`](#trajectorygoal-success) - uses an LLM judge to decide whether a traced agent run achieved its goal

Context-based assertions are particularly useful for evaluating RAG systems. For complete RAG evaluation examples, see the [RAG Evaluation Guide](/docs/guides/evaluate-rag).

## Examples (output-based)

Example of `llm-rubric` and/or `model-graded-closedqa`:

```yaml
assert:
  - type: model-graded-closedqa # or llm-rubric
    # Make sure the LLM output adheres to this criteria:
    value: Is not apologetic
```

Example of factuality check:

```yaml
assert:
  - type: factuality
    # Make sure the LLM output is consistent with this statement:
    value: Sacramento is the capital of California
```

## trajectory:goal-success {#trajectorygoal-success}

Use `trajectory:goal-success` when you care about whether an agent actually completed a task, not just whether it used a specific tool or produced a plausible final sentence.

This assertion requires trace data. Promptfoo summarizes the traced trajectory, includes the final output, and asks a grading model whether the run achieved the goal you specify.

```yaml
tests:
  - vars:
      order_id: '123'
    assert:
      - type: trajectory:goal-success
        value: 'Determine the shipping status for order {{ order_id }} and tell the user whether it has shipped'
```

Like other model-graded assertions, you can set `threshold`, `provider`, or `rubricPrompt`:

```yaml
tests:
  - assert:
      - type: trajectory:goal-success
        value: Resolve the user's issue and provide the correct next step
        threshold: 0.8
        provider: openai:gpt-5-mini
```

This works best alongside deterministic trajectory checks such as [`trajectory:tool-used`](/docs/configuration/expected-outputs/deterministic/#trajectorytool-used), [`trajectory:tool-args-match`](/docs/configuration/expected-outputs/deterministic/#trajectorytool-args-match), or [`trajectory:tool-sequence`](/docs/configuration/expected-outputs/deterministic/#trajectorytool-sequence) when the exact path through the task also matters.

Prepend `not-` to flag runs that achieved a **forbidden** goal (`type: not-trajectory:goal-success`). Inversion only flips real grader verdicts — judge transport or parse failures still report as failures so a broken judge cannot silently turn into a passing "did not achieve forbidden goal" result.

Example of pi scorer:

```yaml
assert:
  - type: pi
    # Evaluate output based on this criteria:
    value: Is not apologetic and provides a clear, concise answer
    threshold: 0.8 # Requires a score of 0.8 or higher to pass
```

For more information on factuality, see the [guide on LLM factuality](/docs/guides/factuality-eval).

## Non-English Evaluation

For multilingual evaluation output with compatible assertion types, use a custom `rubricPrompt`:

```yaml
defaultTest:
  options:
    rubricPrompt: |
      [
        {
          "role": "system",
          // German: "You evaluate outputs based on criteria. Respond with JSON: {\"reason\": \"string\", \"pass\": boolean, \"score\": number}. ALL responses in German."
          "content": "Du bewertest Ausgaben nach Kriterien. Antworte mit JSON: {\"reason\": \"string\", \"pass\": boolean, \"score\": number}. ALLE Antworten auf Deutsch."
        },
        {
          "role": "user", 
          // German: "Output: {{ output }}\nCriterion: {{ rubric }}"
          "content": "Ausgabe: {{ output }}\nKriterium: {{ rubric }}"
        }
      ]

assert:
  - type: llm-rubric
    # German: "Responds helpfully"
    value: 'Antwortet hilfreich'
  - type: g-eval
    # German: "Clear and precise"
    value: 'Klar und präzise'
  - type: model-graded-closedqa
    # German: "Gives direct answer"
    value: 'Gibt direkte Antwort'
```

This produces German reasoning: `{"reason": "Die Antwort ist hilfreich und klar.", "pass": true, "score": 1.0}`

<!-- German reasoning: "The answer is helpful and clear." -->

**Note:** This approach works with `llm-rubric`, `g-eval`, and `model-graded-closedqa`. Other assertions like `factuality` and `context-recall` require specific output formats and need assertion-specific prompts.

For more language options and alternative approaches, see the [llm-rubric language guide](/docs/configuration/expected-outputs/model-graded/llm-rubric#non-english-evaluation).

Here's an example output that indicates PASS/FAIL based on LLM assessment ([see example setup and outputs](https://github.com/promptfoo/promptfoo/tree/main/examples/eval-self-grading)):

[![LLM prompt quality evaluation with PASS/FAIL expectations](https://user-images.githubusercontent.com/310310/236690475-b05205e8-483e-4a6d-bb84-41c2b06a1247.png)](https://user-images.githubusercontent.com/310310/236690475-b05205e8-483e-4a6d-bb84-41c2b06a1247.png)

### Using variables in the rubric

You can use test `vars` in the LLM rubric. This example uses the `question` variable to help detect hallucinations:

```yaml
providers:
  - openai:gpt-5-mini
prompts:
  - file://prompt1.txt
  - file://prompt2.txt
defaultTest:
  assert:
    - type: llm-rubric
      value: 'Says that it is uncertain or unable to answer the question: "{{question}}"'
tests:
  - vars:
      question: What's the weather in New York?
  - vars:
      question: Who won the latest football match between the Giants and 49ers?
```

## Examples (comparison)

The `select-best` assertion type is used to compare multiple outputs in the same TestCase row and select the one that best meets a specified criterion.

Here's an example of how to use `select-best` in a configuration file:

```yaml
prompts:
  - 'Write a tweet about {{topic}}'
  - 'Write a very concise, funny tweet about {{topic}}'

providers:
  - openai:gpt-5

tests:
  - vars:
      topic: bananas
    assert:
      - type: select-best
        value: choose the funniest tweet

  - vars:
      topic: nyc
    assert:
      - type: select-best
        value: choose the tweet that contains the most facts
```

The `max-score` assertion type is used to objectively select the output with the highest score from other assertions:

```yaml
prompts:
  - 'Write a summary of {{article}}'
  - 'Write a detailed summary of {{article}}'
  - 'Write a comprehensive summary of {{article}} with key points'

providers:
  - openai:gpt-5

tests:
  - vars:
      article: 'AI safety research is accelerating...'
    assert:
      - type: contains
        value: 'AI safety'
      - type: contains
        value: 'research'
      - type: llm-rubric
        value: 'Summary captures the main points accurately'
      - type: max-score
        value:
          method: average # Use average of all assertion scores
          threshold: 0.7 # Require at least 70% score to pass
```

## Overriding the LLM grader

By default, model-graded asserts use promptfoo's built-in grading provider. Promptfoo chooses that
provider from the credentials available in the environment; for example, OpenAI, Anthropic, Gemini,
Mistral, GitHub Models, Azure OpenAI, and Codex login credentials can each activate a different
default. If you do not have access to the selected default or prefer a different judge, you can
override the grader. There are several ways to do this, depending on your preferred workflow:

1. Using the `--grader` CLI option:

   ```
   promptfoo eval --grader openai:gpt-5-mini
   ```

2. Using `test.options` or `defaultTest.options` on a per-test or testsuite basis:

   ```yaml
   defaultTest:
     options:
       provider: openai:gpt-5-mini
   tests:
     - description: Use LLM to evaluate output
       assert:
         - type: llm-rubric
           value: Is spoken like a pirate
   ```

3. Using `assertion.provider` on a per-assertion basis:

   ```yaml
   tests:
     - description: Use LLM to evaluate output
       assert:
         - type: llm-rubric
           value: Is spoken like a pirate
           provider: openai:gpt-5-mini
   ```

Use the `provider.config` field to set custom parameters such as `temperature`, `max_tokens`, or API host:

```yaml
tests:
  - assert:
      - type: llm-rubric
        value: Is not apologetic and provides a clear, concise answer
        provider:
          id: openai:gpt-5-mini
          config:
            temperature: 0
```

This works at every level where a grader can be set — per-assertion (`assertion.provider`), per-test (`test.options.provider`), and globally (`defaultTest.options.provider`).

If you configure a full provider object globally, do not also add a shorthand
`provider: openai:chat:...` to the assertion. Assertion-level providers take precedence, so the
global provider object's `config` values such as `apiBaseUrl`, `apiKey`, `temperature`, or
`showThinking` will not be inherited. Either remove the assertion-level provider or repeat the full
provider object there.

:::note
The built-in OpenAI grader already uses `temperature=0` by default, so you only need to set it when
overriding the grader with a custom `provider` block that would otherwise inherit a non-zero
default. GPT-5 series reasoning models ignore `temperature` entirely.

The built-in OpenAI grader may spend hidden reasoning tokens internally, but promptfoo receives the
final grader output without private reasoning text prepended to the output string. The
`showThinking: false` guidance below is for OpenAI-compatible or local judge providers that return
reasoning fields such as `reasoning` or `reasoning_content`.
:::

Also note that [custom providers](/docs/providers/custom-api) are supported as well.

### OpenAI-compatible thinking judges

Self-hosted OpenAI-compatible judges such as [vLLM](/docs/providers/vllm), LocalAI, and llamafile
can return reasoning in a separate field while putting the final answer in `content`. Set
`showThinking: false` on the judge provider so promptfoo uses only the final `content` for grading:

```yaml
defaultTest:
  options:
    provider:
      id: openai:chat:llm_judge
      config:
        apiBaseUrl: http://localhost:8000/v1
        apiKey: empty
        temperature: 0
        max_tokens: 10000
        showThinking: false
```

This is not specific to `llm-rubric`. JSON-first metrics can parse scratchpad JSON,
`answer-relevance` can embed questions with `Thinking:` prepended, RAG metrics can score scratchpad
sentences or attribution markers, and `select-best` can read a scratchpad number as the winning
index.

For vLLM specifically, `showThinking: false` only removes reasoning after vLLM has parsed it into a
separate field such as `reasoning_content`. If `max_tokens` or the server context window is too
small, vLLM may return an unfinished `<think>` block in `content`; increase the budget or disable
thinking for judge requests.

For vLLM models whose chat template enables thinking by default, you can also disable thinking at
request time. See the [vLLM judge guide](/docs/providers/vllm#use-vllm-as-an-llm-judge) for
complete Qwen, GPT-OSS, and GLM examples.

### Multiple graders

Some assertions (such as `answer-relevance`) use multiple types of providers. To override both the embedding and text providers separately, you can do something like this:

```yaml
defaultTest:
  options:
    provider:
      text:
        id: azureopenai:chat:gpt-4-deployment
        config:
          apiHost: xxx.openai.azure.com
      embedding:
        id: azureopenai:embeddings:text-embedding-ada-002-deployment
        config:
          apiHost: xxx.openai.azure.com
```

If you are implementing a custom provider, `text` providers require a `callApi` function that returns a [`ProviderResponse`](/docs/configuration/reference/#providerresponse), whereas embedding providers require a `callEmbeddingApi` function that returns a [`ProviderEmbeddingResponse`](/docs/configuration/reference/#providerembeddingresponse).

## Overriding the rubric prompt

For the greatest control over the output of `llm-rubric`, you may set a custom prompt using the `rubricPrompt` property of `TestCase` or `Assertion`.

The rubric prompt has two built-in variables that you may use:

- `{{output}}` - The output of the LLM (you probably want to use this)
- `{{rubric}}` - The `value` of the llm-rubric `assert` object

:::tip Object handling in variables

When `{{output}}` or `{{rubric}}` contain objects, they are automatically converted to JSON strings by default to prevent display issues. To access object properties directly (e.g., `{{output.text}}`), enable object property access:

```bash
export PROMPTFOO_DISABLE_OBJECT_STRINGIFY=true
promptfoo eval
```

For details, see the [object template handling guide](/docs/usage/troubleshooting#object-template-handling).

:::

In this example, we set `rubricPrompt` under `defaultTest`, which applies it to every test in this test suite:

```yaml
defaultTest:
  options:
    rubricPrompt: >
      [
        {
          "role": "system",
          "content": "Grade the output by the following specifications, keeping track of the points scored:\n\nDid the output mention {{x}}? +1 point\nDid the output describe {{y}}? +1 point\nDid the output ask to clarify {{z}}? +1 point\n\nCalculate the score but always pass the test. Output your response in the following JSON format:\n{pass: true, score: number, reason: string}"
        },
        {
          "role": "user",
          "content": "Output: {{ output }}"
        }
      ]
```

See the [full example](https://github.com/promptfoo/promptfoo/blob/main/examples/eval-custom-grading-prompt/promptfooconfig.yaml).

### Image-based rubric prompts

`llm-rubric` can also grade responses that reference images. Provide a `rubricPrompt` in OpenAI chat format that includes an image and use a vision-capable provider such as `openai:gpt-5.

```yaml
defaultTest:
  options:
    provider: openai:gpt-5
    rubricPrompt: |
      [
        { "role": "system", "content": "Evaluate if the answer matches the image. Respond with JSON {reason:string, pass:boolean, score:number}" },
        {
          "role": "user",
          "content": [
            { "type": "image_url", "image_url": { "url": "{{image_url}}" } },
            { "type": "text", "text": "Output: {{ output }}\nRubric: {{ rubric }}" }
          ]
        }
      ]
```

#### select-best rubric prompt

For control over the `select-best` rubric prompt, you may use the variables `{{outputs}}` (list of strings) and `{{criteria}}` (string). It expects the LLM output to contain the index of the winning output.

## Classifiers

Classifiers can be used to detect tone, bias, toxicity, helpfulness, and much more. See [classifier documentation](/docs/configuration/expected-outputs/classifier).

---

## Context-based

Context-based assertions are a special class of model-graded assertions that evaluate whether the LLM's output is supported by context provided at inference time. They are particularly useful for evaluating RAG systems.

- [`context-recall`](/docs/configuration/expected-outputs/model-graded/context-recall) - ensure that ground truth appears in context
- [`context-relevance`](/docs/configuration/expected-outputs/model-graded/context-relevance) - ensure that context is relevant to original query
- [`context-faithfulness`](/docs/configuration/expected-outputs/model-graded/context-faithfulness) - ensure that LLM output is supported by context

### Defining context

Context can be defined in one of two ways: statically using test case variables or dynamically from the provider's response.

#### Statically via test variables

Set `context` as a variable in your test case:

```yaml
tests:
  - vars:
      context: 'Paris is the capital of France. It has a population of over 2 million people.'
    assert:
      - type: context-recall
        value: 'Paris is the capital of France'
        threshold: 0.8
```

#### Dynamically via Context Transform

Defining `contextTransform` allows you to construct context from provider responses. This is particularly useful for RAG systems.

```yaml
assert:
  - type: context-faithfulness
    contextTransform: 'output.citations.join("\n")'
    threshold: 0.8
```

The `contextTransform` property accepts a stringified Javascript expression which itself accepts two arguments: `output` and `context`, and **must return a non-empty string.**

```typescript
/**
 * The context transform function signature.
 */
type ContextTransform = (output: Output, context: Context) => string;

/**
 * The provider's response output.
 */
type Output = string | object;

/**
 * Metadata about the test case, prompt, and provider response.
 */
type Context = {
  // Test case variables
  vars: Record<string, string | object>;

  // Raw prompt sent to LLM
  prompt: {
    label: string;
  };

  // Provider-specific metadata.
  // The documentation for each provider will describe any available metadata.
  metadata?: object;
};
```

For example, given the following provider response:

```typescript
/**
 * A response from a fictional Research Knowledge Base.
 */
type ProviderResponse = {
  output: {
    content: string;
  };
  metadata: {
    retrieved_docs: {
      content: string;
    }[];
  };
};
```

```yaml
assert:
  - type: context-faithfulness
    contextTransform: 'output.content'
    threshold: 0.8

  - type: context-relevance
    # Note: `ProviderResponse['metadata']` is accessible as `context.metadata`
    contextTransform: 'context.metadata.retrieved_docs.map(d => d.content).join("\n")'
    threshold: 0.7
```

If your expression should return `undefined` or `null`, for example because no context is available, add a fallback:

```yaml
contextTransform: 'output.context ?? "No context found"'
```

If you expected your context to be non-empty, but it's empty, you can debug your provider response by returning a stringified version of the response:

```yaml
contextTransform: 'JSON.stringify(output, null, 2)'
```

### Examples

Context-based metrics require a `query` and context. You must also set the `threshold` property on your test (all scores are normalized between 0 and 1).

Here's an example config using statically-defined (`test.vars.context`) context:

```yaml
prompts:
  - |
    You are an internal corporate chatbot.
    Respond to this query: {{query}}
    Here is some context that you can use to write your response: {{context}}
providers:
  - openai:gpt-5
tests:
  - vars:
      query: What is the max purchase that doesn't require approval?
      context: file://docs/reimbursement.md
    assert:
      - type: contains
        value: '$500'
      - type: factuality
        value: the employee's manager is responsible for approvals
      - type: answer-relevance
        threshold: 0.9
      - type: context-recall
        threshold: 0.9
        value: max purchase price without approval is $500. Talk to Fred before submitting anything.
      - type: context-relevance
        threshold: 0.9
      - type: context-faithfulness
        threshold: 0.9
  - vars:
      query: How many weeks is maternity leave?
      context: file://docs/maternity.md
    assert:
      - type: factuality
        value: maternity leave is 4 months
      - type: answer-relevance
        threshold: 0.9
      - type: context-recall
        threshold: 0.9
        value: The company offers 4 months of maternity leave, unless you are an elephant, in which case you get 22 months of maternity leave.
      - type: context-relevance
        threshold: 0.9
      - type: context-faithfulness
        threshold: 0.9
```

Alternatively, if your system returns context in the response, like in a RAG system, you can use `contextTransform`:

```yaml
prompts:
  - |
    You are an internal corporate chatbot.
    Respond to this query: {{query}}
providers:
  - openai:gpt-5
tests:
  - vars:
      query: What is the max purchase that doesn't require approval?
    assert:
      - type: context-recall
        contextTransform: 'output.context'
        threshold: 0.9
        value: max purchase price without approval is $500
      - type: context-relevance
        contextTransform: 'output.context'
        threshold: 0.9
      - type: context-faithfulness
        contextTransform: 'output.context'
        threshold: 0.9
```

## Transforming outputs for context assertions

### Transform: Extract answer before context grading

```yaml
providers:
  - echo

tests:
  - vars:
      prompt: '{"answer": "Paris is the capital of France", "confidence": 0.95}'
      context: 'France is a country in Europe. Its capital city is Paris, which has over 2 million residents.'
    assert:
      - type: context-faithfulness
        transform: 'JSON.parse(output).answer' # Grade only the answer field
        threshold: 0.9

      - type: context-recall
        transform: 'JSON.parse(output).answer' # Check if answer appears in context
        value: 'Paris is the capital of France'
        threshold: 0.8
```

### Context transform: Extract context from provider response

```yaml
providers:
  - echo

tests:
  - vars:
      prompt: '{"answer": "Returns accepted within 30 days", "sources": ["Returns are accepted for 30 days from purchase", "30-day money-back guarantee"]}'
      query: 'What is the return policy?'
    assert:
      - type: context-faithfulness
        transform: 'JSON.parse(output).answer'
        contextTransform: 'JSON.parse(output).sources.join(". ")' # Extract sources as context
        threshold: 0.9

      - type: context-relevance
        contextTransform: 'JSON.parse(output).sources.join(". ")' # Check if context is relevant to query
        threshold: 0.8
```

### Transform response: Normalize RAG system output

```yaml
providers:
  - id: http://rag-api.example.com/search
    config:
      transformResponse: 'json.data' # Extract data field from API response

tests:
  - vars:
      query: 'What are the office hours?'
    assert:
      - type: context-faithfulness
        transform: 'output.answer' # After transformResponse, extract answer
        contextTransform: 'output.documents.map(d => d.text).join(" ")' # Extract documents as context
        threshold: 0.85
```

**Processing order:** API call → `transformResponse` → `transform` → `contextTransform` → context assertion

## Common patterns and troubleshooting

### Understanding pass vs. score behavior

Model-graded assertions like `llm-rubric` determine PASS/FAIL using two mechanisms:

1. **Without threshold**: PASS depends only on the grader's `pass` field (defaults to `true` if omitted)
2. **With threshold**: PASS requires both `pass === true` AND `score >= threshold`

This means a result like `{"pass": true, "score": 0}` will pass without a threshold, but fail with `threshold: 1`.

**Common issue**: Tests show PASS even when scores are low

```yaml
# ❌ Problem: All tests pass regardless of score
assert:
  - type: llm-rubric
    value: |
      Return 0 if the response is incorrect
      Return 1 if the response is correct
    # No threshold set - always passes if grader doesn't return explicit pass: false
```

**Solutions**:

```yaml
# ✅ Option A: Add threshold to make score drive PASS/FAIL
assert:
  - type: llm-rubric
    value: |
      Return 0 if the response is incorrect
      Return 1 if the response is correct
    threshold: 1  # Only pass when score >= 1

# ✅ Option B: Have grader control pass explicitly
assert:
  - type: llm-rubric
    value: |
      Return {"pass": true, "score": 1} if the response is correct
      Return {"pass": false, "score": 0} if the response is incorrect
```

### Threshold usage across assertion types

Different assertion types use thresholds differently:

```yaml
assert:
  # Similarity-based (0-1 range)
  - type: context-faithfulness
    threshold: 0.8 # Requires 80%+ faithfulness

  # Binary scoring (0 or 1)
  - type: llm-rubric
    value: 'Is helpful and accurate'
    threshold: 1 # Requires perfect score

  # Custom scoring (any range)
  - type: pi
    value: 'Quality of response'
    threshold: 0.7
```

For more details on pass/score semantics, see the [llm-rubric documentation](/docs/configuration/expected-outputs/model-graded/llm-rubric#pass-vs-score-semantics).

## Other assertion types

For more info on assertions, see [Test assertions](/docs/configuration/expected-outputs).