---
title: Max-score assertion
description: Configure the `max-score` assertion to deterministically pick the highest-scoring output based on other assertions' scores.
sidebar_label: Max Score
---

# Max Score

The `max-score` assertion selects the output with the highest aggregate score from other assertions. Unlike `select-best` which uses LLM judgment, `max-score` provides objective, deterministic selection based on quantitative scores from other assertions.

## When to use max-score

Use `max-score` when you want to:

- Select the best output based on objective, measurable criteria
- Combine multiple metrics with different importance (weights)
- Have transparent, reproducible selection without LLM API calls
- Select outputs based on a combination of correctness, quality, and other metrics

## How it works

1. All regular assertions run first on each output
2. `max-score` collects the scores from these assertions
3. Calculates an aggregate score for each output (average by default)
4. Selects the output with the highest aggregate score
5. Returns pass=true for the highest scoring output, pass=false for others

## Basic usage

```yaml
prompts:
  - 'Write a function to {{task}}'
  - 'Write an efficient function to {{task}}'
  - 'Write a well-documented function to {{task}}'

providers:
  - openai:gpt-5

tests:
  - vars:
      task: 'calculate fibonacci numbers'
    assert:
      # Regular assertions that score each output
      - type: python
        value: 'assert fibonacci(10) == 55'
      - type: llm-rubric
        value: 'Code is efficient'
      - type: contains
        value: 'def fibonacci'
      # Max-score selects the output with highest average score
      - type: max-score
```

## Configuration options

### Aggregation method

Choose how scores are combined:

```yaml
assert:
  - type: max-score
    value:
      method: average # Default: average | sum
```

### Weighted scoring

Give different importance to different assertions by specifying weights per assertion type:

```yaml
assert:
  - type: python # Test correctness
  - type: llm-rubric # Test quality
    value: 'Well documented'
  - type: max-score
    value:
      weights:
        python: 3 # Correctness is 3x more important
        llm-rubric: 1 # Documentation is 1x weight
```

#### How weights work

- Each assertion type can have a custom weight (default: 1.0)
- For `method: average`, the final score is: `sum(score × weight) / sum(weights)`
- For `method: sum`, the final score is: `sum(score × weight)`
- Weights apply to all assertions of that type

Example calculation with `method: average`:

```
Output A: python=1.0, llm-rubric=0.5, contains=1.0
Weights:  python=3,   llm-rubric=1,   contains=1 (default)

Score = (1.0×3 + 0.5×1 + 1.0×1) / (3 + 1 + 1)
      = (3.0 + 0.5 + 1.0) / 5
      = 0.9
```

### Minimum threshold

Require a minimum score for selection:

```yaml
assert:
  - type: max-score
    value:
      threshold: 0.7 # Only select if average score >= 0.7
```

## Scoring details

- **Binary assertions** (pass/fail): Score as 1.0 or 0.0
- **Scored assertions**: Use the numeric score (typically 0-1 range)
- **Default weights**: 1.0 for all assertions
- **Tie breaking**: First output wins (deterministic)

## Examples

### Example 1: Multi-criteria code selection

```yaml
prompts:
  - 'Write a Python function to {{task}}'
  - 'Write an optimized Python function to {{task}}'
  - 'Write a documented Python function to {{task}}'

providers:
  - openai:gpt-5-mini

tests:
  - vars:
      task: 'merge two sorted lists'
    assert:
      - type: python
        value: |
          list1 = [1, 3, 5]
          list2 = [2, 4, 6]
          result = merge_lists(list1, list2)
          assert result == [1, 2, 3, 4, 5, 6]

      - type: llm-rubric
        value: 'Code has O(n+m) time complexity'

      - type: llm-rubric
        value: 'Code is well documented with docstring'

      - type: max-score
        value:
          weights:
            python: 3 # Correctness most important
            llm-rubric: 1 # Each quality metric has weight 1
```

### Example 2: Content generation selection

```yaml
prompts:
  - 'Explain {{concept}} simply'
  - 'Explain {{concept}} in detail'
  - 'Explain {{concept}} with examples'

providers:
  - anthropic:claude-3-haiku-20240307

tests:
  - vars:
      concept: 'machine learning'
    assert:
      - type: llm-rubric
        value: 'Explanation is accurate'

      - type: llm-rubric
        value: 'Explanation is clear and easy to understand'

      - type: contains
        value: 'example'

      - type: max-score
        value:
          method: average # All criteria equally important
```

### Example 3: API response selection

```yaml
tests:
  - vars:
      query: 'weather in Paris'
    assert:
      - type: is-json

      - type: contains-json
        value:
          required: ['temperature', 'humidity', 'conditions']

      - type: llm-rubric
        value: 'Response includes all requested weather data'

      - type: latency
        threshold: 1000 # Under 1 second

      - type: max-score
        value:
          weights:
            is-json: 2 # Must be valid JSON
            contains-json: 2 # Must have required fields
            llm-rubric: 1 # Quality check
            latency: 1 # Performance matters
```

## Comparison with select-best

| Feature          | max-score                        | select-best         |
| ---------------- | -------------------------------- | ------------------- |
| Selection method | Aggregate scores from assertions | LLM judgment        |
| API calls        | None (uses existing scores)      | One per eval        |
| Reproducibility  | Deterministic                    | May vary            |
| Best for         | Objective criteria               | Subjective criteria |
| Transparency     | Shows exact scores               | Shows LLM reasoning |
| Cost             | Free (no API calls)              | Costs per API call  |

## Edge cases

- **No other assertions**: Error - max-score requires at least one assertion to aggregate
- **Tie scores**: First output wins (by index)
- **All outputs fail**: Still selects the highest scorer ("least bad")
- **Below threshold**: No output selected if threshold is specified and not met

## Tips

1. **Use specific assertions**: More assertions provide better signal for selection
2. **Weight important criteria**: Use weights to emphasize what matters most
3. **Combine with select-best**: You can use both in the same test for comparison
4. **Debug with scores**: The output shows aggregate scores for transparency

## Further reading

- [Model-graded metrics](/docs/configuration/expected-outputs/model-graded) for other model-based assertions
- [Select best](/docs/configuration/expected-outputs/model-graded/select-best) for subjective selection
- [Assertions](/docs/configuration/expected-outputs) for all available assertion types