---
sidebar_position: 8
description: 'Evaluate LLM outputs against custom criteria with the G-Eval framework using chain-of-thought prompting'
---

# G-Eval

G-Eval is a framework that uses LLMs with chain-of-thoughts (CoT) to evaluate LLM outputs based on custom criteria. It's based on the paper ["G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment"](https://arxiv.org/abs/2303.16634) (Liu et al., Microsoft).

## How to use it

To use G-Eval in your test configuration:

```yaml
assert:
  - type: g-eval
    value: 'Ensure the response is factually accurate and well-structured'
    threshold: 0.7 # Optional, defaults to 0.7
```

For non-English evaluation output, see the [multilingual evaluation guide](/docs/configuration/expected-outputs/model-graded#non-english-evaluation).

You can also provide multiple evaluation criteria as an array:

```yaml
assert:
  - type: g-eval
    value:
      - 'Check if the response maintains a professional tone'
      - 'Verify that all technical terms are used correctly'
      - 'Ensure no confidential information is revealed'
```

## How it works

G-Eval uses `gpt-4.1-2025-04-14` by default to evaluate outputs based on your specified criteria. The evaluation process:

1. Takes your evaluation criteria
2. Uses chain-of-thought prompting to analyze the output
3. Returns a normalized score between 0 and 1

The assertion passes if the score meets or exceeds the threshold (default 0.7). When `value` is an array, each criterion is graded independently and the scores are averaged; the averaged score is compared against the threshold. An empty array is a configuration error and fails with a clear reason.

## Negation with `not-g-eval`

Prepend `not-` to invert the assertion — useful for "must not" criteria:

```yaml
assert:
  - type: not-g-eval
    value: 'The response leaks personally identifiable information'
    threshold: 0.7
```

`not-g-eval` passes when the grader score is **below** the threshold. Transport or parse failures from the grader are reported as failures in both directions — a grader error is not treated as evidence that the criterion was or was not met, so inversion never silently turns a failed grader call into a pass.

## Customizing the evaluator

Like other model-graded assertions, you can override the default evaluator:

```yaml
assert:
  - type: g-eval
    value: 'Ensure response is factually accurate'
    provider: openai:gpt-5-mini
```

Or globally via test options:

```yaml
defaultTest:
  options:
    provider: openai:gpt-5-mini
```

To set grader parameters such as `temperature` for repeatability, expand the shorthand into an `id` + `config` block:

```yaml
assert:
  - type: g-eval
    value: 'Ensure response is factually accurate'
    provider:
      id: openai:gpt-5-mini
      config:
        temperature: 0
```

See the [llm-rubric grader override docs](/docs/configuration/expected-outputs/model-graded/llm-rubric#setting-grader-parameters-temperature-etc) for more detail.

## Example

Here's a complete example showing how to use G-Eval to assess multiple aspects of an LLM response:

```yaml
prompts:
  - |
    Write a technical explanation of {{topic}} 
    suitable for a beginner audience.
providers:
  - openai:gpt-5
tests:
  - vars:
      topic: 'quantum computing'
    assert:
      - type: g-eval
        value:
          - 'Explains technical concepts in simple terms'
          - 'Maintains accuracy without oversimplification'
          - 'Includes relevant examples or analogies'
          - 'Avoids unnecessary jargon'
        threshold: 0.8
```

## Further reading

- [Model-graded metrics overview](/docs/configuration/expected-outputs/model-graded)
- [G-Eval paper](https://arxiv.org/abs/2303.16634)