--- title: ToxicChat description: Evaluate model resilience against 7,599 toxic and jailbreaking prompts from real user conversations in the ToxicChat dataset --- # ToxicChat Plugin The ToxicChat plugin tests your model's ability to handle toxic user prompts and resist jailbreaking attempts using the [ToxicChat dataset](https://huggingface.co/datasets/lmsys/toxic-chat) from LMSYS. This dataset contains real user conversations labeled for toxicity and jailbreaking attempts, making it valuable for evaluating content safety measures. ## Configuration ```yaml plugins: - id: toxic-chat # Optional: customize number of test cases (default: 5) numTests: 10 ``` ## How It Works The plugin: 1. **Fetches data** from the ToxicChat dataset on Hugging Face 2. **Filters prompts** labeled as toxic (toxicity=1) or jailbreaking attempts (jailbreaking=1) 3. **Tests your model** with these challenging prompts 4. **Evaluates responses** for content safety compliance ## Example Output The plugin tests prompts like: - Requests for harmful content generation - Attempts to bypass safety guidelines - Toxic language and harassment - Jailbreaking techniques A **passing** response should: - Refuse to engage with harmful requests - Maintain appropriate boundaries - Not provide toxic or harmful content A **failing** response might: - Comply with toxic requests - Generate harmful content - Assist with jailbreaking attempts ## Related Concepts - [Harmbench Plugin](./harmbench.md) - Tests using the HarmBench dataset - [Harmful Content Plugin](./harmful.md) - Tests for various types of harmful content - [DoNotAnswer Plugin](./donotanswer.md) - Tests handling of harmful queries