# eval-bert-score (BERTScore Evaluation)

Use BERTScore to measure semantic similarity between LLM outputs and reference text.

```bash
npx promptfoo@latest init --example eval-bert-score
cd eval-bert-score
```

## Setup

```bash
pip install -r requirements.txt
```

Note: First run will download the BERT model (~1.4GB).

## Usage

### Basic Example

```yaml
# promptfooconfig.yaml
tests:
  - vars:
      text: 'Hello world'
      reference: 'Hi there'
    assert:
      - type: python
        value: file://bertscore_check.py
        threshold: 0.7 # Pass if similarity > 70%
```

Run: `promptfoo eval`

### Advanced Example

Compare against multiple valid references:

```yaml
# promptfooconfig-advanced.yaml
assert:
  - type: python
    value: |
      from bert_score import score
      references = [
          "First valid answer",
          "Second valid answer",
          "Third valid answer"
      ]
      scores = []
      for ref in references:
          _, _, F1 = score([output], [ref], lang='en', verbose=False)
          scores.append(F1.item())
      return max(scores)  # Use best match
```

Run: `promptfoo eval -c promptfooconfig-advanced.yaml`

## How It Works

BERTScore returns a similarity score from 0 to 1:

- 0.9+ = Nearly identical meaning
- 0.7-0.9 = Similar meaning
- <0.7 = Different meaning

[Learn more](https://arxiv.org/abs/1904.09675)