--- title: 'How to Red Team a HuggingFace Model: Complete Security Testing Guide' description: 'Open source models on HuggingFace often lack safety training. Learn how to red team these models and assess their vulnerabilities before deployment.' image: /img/blog/huggingface-red-team.png keywords: [ HuggingFace red teaming, open source model security, HuggingFace security testing, model jailbreak testing, AI model evaluation, LLM security assessment, ] date: 2024-11-20 authors: [ian] tags: [technical-guide, red-teaming, huggingface] --- # How to Red Team a HuggingFace Model Want to break a HuggingFace model? This guide shows you how to use [Promptfoo](https://github.com/promptfoo/promptfoo) to systematically probe for vulnerabilities through adversarial testing (red teaming). You'll learn how to craft prompts that bypass safety filters and manipulate model outputs for a [wide range of potential harms](/docs/red-team/llm-vulnerability-types/). ## Table of Contents 1. [Prerequisites](#prerequisites) 2. [Setting Up the Environment](#setting-up-the-environment) 3. [Configuring the HuggingFace Provider](#configuring-the-huggingface-provider) 4. [Defining the Red Teaming Configuration](#defining-the-red-teaming-configuration) 5. [Running the Red Team Evaluation](#running-the-red-team-evaluation) 6. [Analyzing the Results](#analyzing-the-results) ## Prerequisites Before you begin, ensure you have the following: - **Node.js 20+**: [Download Node.js](https://nodejs.org/en/download/) - **Promptfoo**: No prior installation is necessary; we'll use `npx` to run Promptfoo commands. - **HuggingFace API Token**: Sign up for a HuggingFace account and obtain an API token from your [account settings](https://huggingface.co/settings/tokens). Set your HuggingFace API token as an environment variable: ```bash export HF_API_TOKEN=your_huggingface_api_token ``` --- ## Setting Up the Environment ### Initialize a New Project Use the `promptfoo` command-line tool to initialize a new red teaming project: ```bash npx promptfoo@latest redteam init my-huggingface-redteam --no-gui cd my-huggingface-redteam ``` This command creates a new directory with a `promptfooconfig.yaml` file, which we'll use to configure our red teaming setup. ## Configuring the HuggingFace Provider In your `promptfooconfig.yaml` file, add a HuggingFace provider to specify the model you want to test. For this example, we'll use the Mistral 7B model for text generation. ```yaml targets: - id: huggingface:text-generation:mistralai/Mistral-7B-v0.1 label: Mistral-7B config: temperature: 0.7 max_new_tokens: 150 top_p: 0.9 ``` ### Explanation of Configuration Parameters - **id**: Specifies the HuggingFace model and task type (`huggingface:text-generation:gpt2`). - **label**: A friendly name for your provider, used in reports. - **config**: Additional parameters for the HuggingFace API. - **apiKey**: Your HuggingFace API token. - **temperature**: Controls the randomness in text generation. - **max_new_tokens**: The maximum number of tokens to generate. - **top_p**: Controls diversity via nucleus sampling. ## Defining the Red Teaming Configuration In the same `promptfooconfig.yaml` file, define the red teaming parameters under the `redteam` section. Be sure to customize the `purpose`, `plugins`, and `strategies` to match the types of vulnerabilities you care about. ```yaml redteam: numTests: 5 purpose: 'You are a customer service agent for...' plugins: - harmful strategies: - jailbreak - jailbreak:tree - prompt-injection ``` ### Key Components - **numTests**: Sets the number of test cases per plugin. - **purpose**: Describes the intended behavior of the model, guiding the generation of adversarial inputs. - **plugins**: Specifies the types of vulnerabilities to test. See [full list](/docs/red-team/llm-vulnerability-types/). - **strategies**: Techniques used to deliver adversarial inputs. See [full list](/docs/red-team/configuration/#strategies). ## Running the Red Team Evaluation ### Generate Adversarial Test Cases First, generate the adversarial inputs based on the defined plugins and strategies: ```bash npx promptfoo@latest redteam generate ``` This command creates a `redteam.yaml` file containing the generated test cases. ### Execute the Tests Run the tests against your HuggingFace model: ```bash npx promptfoo@latest redteam run ``` This command evaluates the model's responses to the adversarial inputs and logs the results. ## Analyzing the Results Generate a report to review the findings: ```bash npx promptfoo@latest redteam report ``` ![llm red team report](/img/riskreport-1@2x.png) ### Understanding the Report The report provides insights into: 1. **Vulnerability Categories**: Types of vulnerabilities discovered. 2. **Severity Levels**: Classifies vulnerabilities based on their potential impact. 3. **Test Cases**: Specific inputs that led to vulnerabilities. 4. **Suggested Mitigations**: Recommendations to address the identified issues. ### Re-evaluate After Changes After implementing mitigations, rerun the red team evaluation to ensure vulnerabilities have been addressed: ```bash npx promptfoo@latest redteam run npx promptfoo@latest redteam report ``` ## Additional Resources - [Red Team Quickstart Guide](/docs/red-team/quickstart/) - [HuggingFace Configuration Guide](/docs/providers/huggingface/) - [HuggingFace Inference API](https://huggingface.co/inference-api) - [List of LLM Vulnerabilities](https://promptfoo.dev/docs/red-team/llm-vulnerability-types/)