# eval-conversation-relevance (Conversation Relevance)

You can run this example with:

```bash
npx promptfoo@latest init --example eval-conversation-relevance
cd eval-conversation-relevance
```

This example demonstrates how to use the `conversation-relevance` assertion to evaluate whether chatbot responses remain relevant throughout a conversation.

## What is Conversation Relevance?

The conversation relevance metric evaluates whether each response in a conversation is relevant to the context and previous messages. It uses a sliding window approach to analyze conversation segments.

## Running the Example

1. Install dependencies:

   ```bash
   npm install -g promptfoo
   ```

2. Set your OpenAI API key:

   ```bash
   export OPENAI_API_KEY=your-api-key
   ```

3. Run the evaluation:
   ```bash
   promptfoo eval
   ```

## Example Test Cases

### 1. Single-turn Evaluation

Tests basic relevance for a single query-response pair about travel to Paris.

### 2. Multi-turn Travel Conversation

Evaluates a complete conversation about travel planning where all responses should be relevant.

### 3. Conversation with Irrelevant Response

Demonstrates detection of an off-topic response (stock market comment) in the middle of a conversation about wedding planning.

### 4. Technical Support Conversation

Shows a high-quality technical support conversation with a high relevance threshold (0.95).

## Configuration Options

- `threshold`: Minimum score required to pass (0-1)
- `config.windowSize`: Number of messages in each sliding window (default: 5)
- `provider`: Override the default grading model

## Interpreting Results

- **Score**: Proportion of conversation windows deemed relevant
- **Pass/Fail**: Based on whether the score meets the threshold
- **Reason**: Explanation when responses are found irrelevant

## Tips

1. Use lower thresholds (0.7-0.8) for general conversations
2. Use higher thresholds (0.9-0.95) for specialized domains like technical support
3. Adjust window size based on conversation complexity
4. Consider using more capable models (GPT-4) for grading complex conversations

## How Scoring Works

The metric evaluates each message position using a sliding window approach. For example, with a 5-message conversation and window size of 3:

- Window 1: Message 1 only (evaluates if Response 1 is relevant)
- Window 2: Messages 1-2 (evaluates if Response 2 is relevant given context)
- Window 3: Messages 1-3 (evaluates if Response 3 is relevant given context)
- Window 4: Messages 2-4 (evaluates if Response 4 is relevant given context)
- Window 5: Messages 3-5 (evaluates if Response 5 is relevant given context)

Each window evaluates whether the LAST assistant response in that window is relevant. The final score is:

```text
Score = Number of Relevant Windows / Total Number of Windows
```