--- sidebar_position: 25 description: 'Evaluate conversation coherence by checking if LLM responses maintain context relevance across multi-turn dialogues' --- # Conversation Relevance The `conversation-relevance` assertion evaluates whether responses in a conversation remain relevant throughout the dialogue. This is particularly useful for chatbot applications where maintaining conversational coherence is critical. ## How it works The conversation relevance metric uses a sliding window approach to evaluate conversations: 1. **Single-turn evaluation**: For simple query-response pairs, it checks if the response is relevant to the input 2. **Multi-turn evaluation**: For conversations, it creates sliding windows of messages and evaluates if each assistant response is relevant within its conversational context 3. **Scoring**: The final score is the proportion of windows where the response was deemed relevant ## Basic usage ```yaml assert: - type: conversation-relevance threshold: 0.8 ``` ## Using with conversations The assertion works with the special `_conversation` variable that contains an array of input/output pairs: ```yaml tests: - vars: _conversation: - input: 'What is the capital of France?' output: 'The capital of France is Paris.' - input: 'What is its population?' output: 'Paris has a population of about 2.2 million people.' - input: 'Tell me about famous landmarks there.' output: 'Paris is famous for the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral.' assert: - type: conversation-relevance threshold: 0.8 ``` :::note `_conversation` message content is treated as literal runtime data and is not rendered as a Nunjucks template. Template syntax such as `{{ vars.value }}` or `{{ env.API_KEY }}` is preserved verbatim for security. ::: ## Configuration options ### Window size Control how many conversation turns are considered in each sliding window: ```yaml assert: - type: conversation-relevance threshold: 0.8 config: windowSize: 3 # Default is 5 ``` ### Custom grading rubric Override the default relevance evaluation prompt: ```yaml assert: - type: conversation-relevance threshold: 0.8 rubricPrompt: | Evaluate if the assistant's response is relevant to the user's query. Consider the conversation context when making your judgment. Output JSON with 'verdict' (yes/no) and 'reason' fields. ``` ## Examples ### Basic single-turn evaluation When evaluating a single turn, the assertion uses the prompt and output from the test case: ```yaml prompts: - 'Explain {{topic}}' providers: - openai:gpt-5 tests: - vars: topic: 'machine learning' assert: - type: conversation-relevance threshold: 0.8 ``` ### Multi-turn conversation with context ```yaml tests: - vars: _conversation: - input: "I'm planning a trip to Japan." output: 'That sounds exciting! When are you planning to visit?' - input: 'Next spring. What should I see?' output: 'Spring is perfect for cherry blossoms! Visit Tokyo, Kyoto, and Mount Fuji.' - input: 'What about food recommendations?' output: 'Try sushi, ramen, tempura, and wagyu beef. Street food markets are amazing too!' assert: - type: conversation-relevance threshold: 0.9 config: windowSize: 3 ``` ### Detecting off-topic responses This example shows how the metric catches irrelevant responses: ```yaml tests: - vars: _conversation: - input: 'What is 2+2?' output: '2+2 equals 4.' - input: 'What about 3+3?' output: 'The capital of France is Paris.' # Irrelevant response - input: 'Can you solve 5+5?' output: '5+5 equals 10.' assert: - type: conversation-relevance threshold: 0.8 config: windowSize: 2 ``` ## Special considerations ### Vague inputs The metric is designed to handle vague inputs appropriately. Vague responses to vague inputs (like greetings) are considered acceptable: ```yaml tests: - vars: _conversation: - input: 'Hi there!' output: 'Hello! How can I help you today?' - input: 'How are you?' output: "I'm doing well, thank you! How are you?" assert: - type: conversation-relevance threshold: 0.8 ``` ### Short conversations If the conversation has fewer messages than the window size, the entire conversation is evaluated as a single window. ## Provider configuration Like other model-graded assertions, you can override the default grading provider: ```yaml assert: - type: conversation-relevance threshold: 0.8 provider: openai:gpt-5-mini ``` Or set it globally: ```yaml defaultTest: options: provider: anthropic:claude-3-7-sonnet-latest ``` ## See also - [Context relevance](/docs/configuration/expected-outputs/model-graded/context-relevance) - For evaluating if context is relevant to a query - [Answer relevance](/docs/configuration/expected-outputs/model-graded/answer-relevance) - For evaluating if an answer is relevant to a question - [Model-graded metrics](/docs/configuration/expected-outputs/model-graded) - Overview of all model-graded assertions ## Citation This implementation is adapted from [DeepEval's Conversation Relevancy metric](https://docs.confident-ai.com/docs/metrics-conversation-relevancy).