--- sidebar_label: BeaverTails description: Red team LLM safety boundaries using BeaverTails dataset of 330k harmful prompts to protect AI systems from abuse, hate speech, and malicious content exploitation --- # BeaverTails Dataset for LLM Safety Testing ## Overview The BeaverTails plugin uses the [BeaverTails dataset](https://huggingface.co/datasets/PKU-Alignment/BeaverTails), a dataset of 330,000 prompt injections published by PKU-Alignment to test LLM systems against a comprehensive collection of potentially harmful prompts. The dataset covers a wide range of harmful content categories including: - Animal and human abuse - Criminal activities and unethical behavior - Discrimination and hate speech - Explicit adult content - Misinformation and dangerous advice - Privacy violations - Violence and terrorism ## How it works The plugin: - Fetches test cases from the BeaverTails dataset - Filters out safe/benign test cases to focus on potentially harmful prompts - Randomly selects prompts from the filtered set - Tests if the model appropriately refuses or handles the harmful prompts ## Quick Start 1. Set your Hugging Face API token as an environment variable: ```bash export HF_TOKEN=your_huggingface_token # or HF_API_TOKEN ``` 2. Enable the plugin in your configuration: ```yaml title="promptfooconfig.yaml" redteam: plugins: - beavertails ``` ## Configuration You can control the number of test cases using the `numTests` parameter (defaults to 5): ```yaml redteam: plugins: - id: beavertails numTests: 25 ``` ### Filtering by Subcategory The BeaverTails dataset includes 14 specific harm subcategories. You can filter test cases to focus on specific subcategories: ```yaml redteam: plugins: - id: beavertails numTests: 10 config: subcategories: - self_harm - privacy_violation ``` #### Available Subcategories The following subcategories are available: - `animal_abuse` - Cruelty, harm, or distress to animals - `child_abuse` - Physical, emotional, or sexual abuse of minors - `controversial_topics,politics` - Sensitive political or social issues - `discrimination,stereotype,injustice` - Unfair treatment based on characteristics - `drug_abuse,weapons,banned_substance` - Illegal substance or weapon misuse - `financial_crime,property_crime,theft` - Fraud, embezzlement, theft - `hate_speech,offensive_language` - Hostile communication targeting groups - `misinformation_regarding_ethics,laws_and_safety` - False information about critical topics - `non_violent_unethical_behavior` - Morally wrong actions without violence - `privacy_violation` - Unauthorized access or disclosure of personal information - `self_harm` - Self-inflicted harm or suicidal content - `sexually_explicit,adult_content` - Inappropriate sexual material - `terrorism,organized_crime` - Terrorist or organized crime content - `violence,aiding_and_abetting,incitement` - Physical harm or encouragement **Example: Testing only self-harm content** ```yaml redteam: plugins: - id: beavertails numTests: 20 config: subcategories: - self_harm ``` This focused approach allows you to: - Test specific safety boundaries relevant to your application - Get more granular insights into model performance by category - Focus testing resources on high-priority harm categories ## Grading The plugin uses a specialized grader that evaluates model outputs against a safety rubric. The grader checks if the output: - Refuses to engage with harmful content - Avoids promoting or assisting with harmful activities - Maintains appropriate boundaries - Provides safe and ethical responses Outputs are graded as either SAFE (pass) or COMPLICIT (fail) based on these criteria. ## Related Concepts - [Types of LLM Vulnerabilities](../llm-vulnerability-types.md) - [Prompt Injection](../strategies/prompt-injection.md) - [Jailbreak strategies](../strategies/composite-jailbreaks.md)