--- sidebar_label: HarmBench description: Red team LLM safety using HarmBench's 400 harmful behavior tests, with filters for specific semantic and functional categories --- # HarmBench Plugin The HarmBench plugin uses the [HarmBench dataset](https://github.com/centerforaisafety/HarmBench/tree/main/data/behavior_datasets) published by the [Center for AI Safety](https://www.safe.ai/) to test LLM responses to harmful prompts. The dataset evaluates models across 400 key harmful behaviors including: - Chemical and biological threats (e.g., dangerous substances, weapons) - Illegal activities (e.g., theft, fraud, trafficking) - Misinformation and conspiracy theories - Harassment and hate speech - General harmful requests - Cybercrime (e.g., malware, system exploitation) - Copyright violations Unlike many of our other plugins, the HarmBench plugin uses a statically generated dataset vs dynamically generating cases such as those for [Harmful behavior detection](harmful.md) (which was partially based on HarmBench's classifications for harmful behavior), [excessive agency](excessive-agency.md), [hallucination](hallucination.md) and others. It is still however a great tool but not something that can be used in lieu of a constantly evolving, dynamically generated set of test cases. ## Configuration To include the HarmBench plugin in your LLM red teaming setup: ```yaml redteam: plugins: - harmbench ``` You can control the number of test cases using the `numTests` parameter: ```yaml redteam: plugins: - id: harmbench numTests: 25 # The default is 5, but you can define the limit here. ``` ### Filtering by Category You can run a subset of HarmBench by filtering the dataset's semantic categories: ```yaml redteam: plugins: - id: harmbench numTests: 20 config: categories: - cybercrime_intrusion - misinformation_disinformation ``` The available semantic categories are: - `chemical_biological` - `copyright` - `cybercrime_intrusion` - `harassment_bullying` - `harmful` - `illegal` - `misinformation_disinformation` Common aliases such as `cybercrime`, `misinformation`, and `chemical and biological` are also accepted and normalized to the canonical values above. ### Filtering by Functional Category HarmBench also distinguishes between `standard`, `contextual`, and `copyright` behaviors. You can filter by those functional slices as well: ```yaml redteam: plugins: - id: harmbench numTests: 20 config: categories: - misinformation functionalCategories: - contextual ``` When you set both semantic and functional filters, Promptfoo generates tests from the matching intersection. The available functional categories are: - `standard` - `contextual` - `copyright` ## References - [HarmBench Paper](https://arxiv.org/abs/2402.04249) - [HarmBench Dataset](https://github.com/centerforaisafety/HarmBench/tree/main/data/behavior_datasets) - [Center for AI Safety](https://www.safe.ai/) ## Related Concepts - [Types of LLM vulnerabilities](/docs/red-team/llm-vulnerability-types/) - Full vulnerability and plugin directory with category mapping - [Evaluating LLM safety with HarmBench](/docs/guides/evaling-with-harmbench) - [Harmful Content Plugin](harmful.md) - [BeaverTails Plugin](beavertails.md) - [CyberSecEval Plugin](cyberseceval.md) - [Pliny Plugin](pliny.md)