# redteam-beavertails (BeaverTails Red Team Testing) This example demonstrates how to use the BeaverTails plugin for red team testing, including subcategory filtering. ## Overview The BeaverTails plugin uses the [BeaverTails dataset](https://huggingface.co/datasets/PKU-Alignment/BeaverTails) (330k harmful prompts) to test AI safety boundaries. The subcategory filtering feature allows you to focus on specific harm categories relevant to your application. ## Quick Start You can run this example with: ```bash npx promptfoo@latest init --example redteam-beavertails cd redteam-beavertails ``` ## Configuration Files This example includes two configuration files: 1. **`promptfooconfig.yaml`** - Basic evaluation using HuggingFace dataset directly 2. **`promptfooconfig-subcategories.yaml`** - Red team generation with subcategory filtering ## Setup 1. Set up your HuggingFace API token (required for both configs): ```bash export HF_TOKEN=your_huggingface_token ``` 2. Set up your provider API key: ```bash export OPENAI_API_KEY=your_openai_api_key ``` ## Usage ### Basic Evaluation Run the basic evaluation using the BeaverTails-Evaluation dataset: ```bash promptfoo eval ``` View results: ```bash promptfoo view ``` ### Red Team with Subcategory Filtering Generate test cases with subcategory filtering: ```bash promptfoo redteam generate -c promptfooconfig-subcategories.yaml ``` This will create a `redteam.yaml` file with filtered test cases. Then evaluate your model: ```bash promptfoo eval ``` View results: ```bash promptfoo view ``` ## Subcategory Filtering The `promptfooconfig-subcategories.yaml` configuration shows two approaches: 1. **Basic usage** - Tests all harmful categories: ```yaml - id: beavertails numTests: 5 ``` 2. **Filtered usage** - Tests only specific subcategories: ```yaml - id: beavertails numTests: 5 config: subcategories: - self_harm - privacy_violation ``` ### Available Subcategories - `animal_abuse` - Cruelty, harm, or distress to animals - `child_abuse` - Physical, emotional, or sexual abuse of minors - `controversial_topics,politics` - Sensitive political or social issues - `discrimination,stereotype,injustice` - Unfair treatment based on characteristics - `drug_abuse,weapons,banned_substance` - Illegal substance or weapon misuse - `financial_crime,property_crime,theft` - Fraud, embezzlement, theft - `hate_speech,offensive_language` - Hostile communication targeting groups - `misinformation_regarding_ethics,laws_and_safety` - False information about critical topics - `non_violent_unethical_behavior` - Morally wrong actions without violence - `privacy_violation` - Unauthorized access or disclosure of personal information - `self_harm` - Self-inflicted harm or suicidal content - `sexually_explicit,adult_content` - Inappropriate sexual material - `terrorism,organized_crime` - Terrorist or organized crime content - `violence,aiding_and_abetting,incitement` - Physical harm or encouragement ## Learn More - [BeaverTails Plugin Documentation](https://promptfoo.dev/docs/red-team/plugins/beavertails/) - [Red Team Testing Guide](https://promptfoo.dev/docs/red-team/quickstart/)