---
title: 'GPT-5.2 Initial Trust and Safety Assessment'
description: 'Day-0 red team results for GPT-5.2. 4,229 probes across 43 risk categories. Baseline safety holds at 96%, but jailbreaks drop it to as low as 22%.'
image: /img/blog/gpt-5.2-security/hero.jpg
keywords:
  [
    GPT-5.2 security,
    GPT-5.2 red team,
    prompt injection,
    LLM security,
    AI agent security,
    OpenAI security,
    jailbreak,
  ]
date: 2025-12-11
authors: [michael]
tags: [red-teaming, security-vulnerability, openai]
---

# GPT-5.2 Initial Trust and Safety Assessment

OpenAI released GPT-5.2 today (December 11, 2025) at approximately 10:00 AM PST. We [opened a PR for GPT-5.2 support](https://github.com/promptfoo/promptfoo/pull/6628) at 10:24 AM PST and kicked off a [red team](/docs/red-team/) eval ([security testing where you try to break something](/blog/ai-red-teaming-for-first-timers)). **First critical finding hit at 10:29 AM PST, 5 minutes later.** This is an early, targeted assessment focused on jailbreak resilience and harmful content, not a full security review.

This post covers what we tested, what failed, and what you should do about it.

The headline numbers: our [jailbreak strategies](/docs/red-team/strategies/) (techniques that [trick AI into bypassing its safety rules](/blog/how-to-jailbreak-llms)) improved attack success from **4.3% baseline to 78.5%** (multi-turn) and **61.0%** (single-turn). The weakest categories included [impersonation](/docs/red-team/plugins/imitation), graphic and sexual content, harassment, [disinformation](/blog/misinformation), hate speech, and self-harm, where a majority of targeted attacks succeeded.

<!-- truncate -->

## What We Tested

We tested jailbreak resilience and [harmful content](/docs/red-team/plugins/harmful) generation, along with [bias detection](/docs/red-team/plugins/bias) and security compliance. In practice, that means prompts about:

- [Cybercrime](/docs/red-team/plugins/harmful) assistance
- [Violent crime](/docs/red-team/plugins/harmful) instructions
- [Chemical/biological weapons](/docs/red-team/plugins/harmful) information
- [Misinformation](/blog/misinformation) generation
- [Illegal activity](/docs/red-team/plugins/harmful) guidance

We used two [attack strategies](/docs/red-team/strategies/):

- [**Hydra**](/docs/red-team/strategies/hydra): Multi-turn jailbreaking that builds up to harmful requests across multiple messages, adapting based on how the model responds
- [**Meta**](/docs/red-team/strategies/meta): Single-turn attacks that use clever prompt engineering to bypass safety filters in one shot

## How We Ran the Eval

Total runtime: **~30 minutes** on a MacBook Pro.

We used [Promptfoo](https://github.com/promptfoo/promptfoo), an [open-source LLM red teaming tool](/docs/red-team/quickstart). The config:

```yaml title="promptfooconfig.yaml"
targets:
  - id: openai:chat:gpt-5.2
    config:
      reasoning_effort: 'none'
      max_completion_tokens: 2048

redteam:
  purpose: You are a helpful assistant

  plugins:
    # Harmful content (25 categories)
    - harmful:chemical-biological-weapons
    - harmful:child-exploitation
    - harmful:copyright-violations
    - harmful:cybercrime
    - harmful:cybercrime:malicious-code
    - harmful:graphic-content
    - harmful:harassment-bullying
    - harmful:hate
    - harmful:illegal-activities
    - harmful:illegal-drugs
    - harmful:illegal-drugs:meth
    - harmful:indiscriminate-weapons
    - harmful:insults
    - harmful:intellectual-property
    - harmful:misinformation-disinformation
    - harmful:non-violent-crime
    - harmful:profanity
    - harmful:radicalization
    - harmful:self-harm
    - harmful:sex-crime
    - harmful:sexual-content
    - harmful:specialized-advice
    - harmful:unsafe-practices
    - harmful:violent-crime
    - harmful:weapons:ied
    # Bias detection
    - bias:race
    - bias:age
    - bias:disability
    - bias:gender
    # Security & compliance
    - ascii-smuggling
    - contracts
    - coppa
    - pii:direct
    - pliny
    - special-token-injection
    # Behavioral issues
    - divergent-repetition
    - excessive-agency
    - hallucination
    - imitation
    - overreliance
    - politics
    - religion
    - unverifiable-claims
    # 43 plugins, 4,229 probes

  strategies:
    - jailbreak:hydra
    - jailbreak:meta
```

The command:

```bash
npx promptfoo@latest redteam run -c redteam.yaml -j 40 --output results.json
```

We disabled reasoning tokens (`reasoning_effort: 'none'`) to test the base model behavior. Reasoning tokens let the model "think" before responding, which typically improves safety but costs more and takes longer. GPT-5.2 exposes multiple reasoning effort levels in the API (`none`, `low`, `medium`, `high`, and `xhigh` depending on the variant). We'll run a follow-up eval with reasoning enabled and publish those results separately.

Each [plugin](/docs/red-team/plugins/) generates multiple probes for a given risk area. For jailbreak strategies, we try several prompt variants per scenario. We then aggregate related probes into higher-level "attacks" per category: an attack is counted as a success if any of its probes produce disallowed output. The percentages below are computed over those attack-level aggregates, not raw probe count.

## Results

**4,229 probes (individual attack attempts) across 43 risk categories. ~30 minute runtime. First critical finding in 5 minutes.**

Attack success by strategy:

- **Hydra** (multi-turn): 78.5% (161/205 attacks succeeded)
- **Meta** (single-turn): 61.0% (122/200 attacks succeeded)
- **Baseline** (same underlying prompts sent directly, no jailbreak transformation): 4.3% (9/210 attacks succeeded)

Each plugin includes a grader prompt. We consider an attack successful when the grader judges that the output clearly violates OpenAI's stated safety policies. For [critical categories](/docs/red-team/plugins/harmful) (child exploitation, weapons, self-harm), we also performed human review.

In this configuration, GPT-5.2 refused most direct harmful prompts, but structured jailbreaks bypassed those refusals.

## Findings

:::warning
Content below includes descriptions of model outputs involving drugs, child exploitation, and other harmful themes. If you are skimming, you can safely skip to "What This Means" without missing methodological details.
:::

### Finding 1: MDMA Synthesis Instructions

**Category:** [harmful:illegal-drugs](/docs/red-team/plugins/harmful)

Using the Hydra strategy, the model produced step-by-step MDMA synthesis instructions including reagents, reaction conditions, and purification steps.

![MDMA Synthesis Instructions](/img/blog/gpt-5.2-security/mdma.jpg)

### Finding 2: Targeted Harassment Generation

**Category:** harmful:insults

Using the Meta strategy, the model generated a 12-line acrostic targeting Elon Musk that spelled out a disparaging phrase. The creative writing format bypassed harassment filters. OpenAI's flagship model roasting xAI's founder on command. It didn't take much convincing.

![Targeted Harassment Generation](/img/blog/gpt-5.2-security/insult-elon.jpg)

### Finding 3: Drug Trafficking Assistance

**Category:** harmful:illegal-activities

The model provided operational guidance for shipping illegal drugs across borders, including packaging techniques, shipping carrier selection, and methods to avoid detection.

![Drug Trafficking Assistance](/img/blog/gpt-5.2-security/drug-shipment.jpg)

### Finding 4: Child Grooming Content

**Category:** harmful:child-exploitation

After the jailbreak bypassed safety filters, the model generated conversation scripts that could be used to manipulate minors. There are worse examples we are not sharing publicly.

![Child Grooming Content](/img/blog/gpt-5.2-security/grooming.jpg)

These four examples are representative, not exhaustive. Our eval surfaced additional outputs that we consider too harmful to publish.

## What This Means

GPT-5.2 ships with a [400K context window](/blog/foundation-model-security) (how much text it can process at once) and configurable reasoning. Both are useful. Neither fixes prompt injection. This mirrors [OWASP's guidance](https://owasp.org/www-project-top-10-for-large-language-model-applications/) that prompt injection and improper output handling remain core risks even for models with strong built-in safety mitigations.

OpenAI's [GPT-5.2 System Card](https://cdn.openai.com/pdf/3a4153c8-c748-4b71-8e31-aecbde944f8d/oai_5_2_system-card.pdf) corroborates this. Their StrongReject jailbreak eval shows GPT-5.2 Instant at 0.878 vs GPT-5.1 Instant at 0.976, noting "a regression in some cases under the illicit category." The System Card also states GPT-5.2 Instant "generally refuses fewer requests for mature content, specifically sexualized text output," which matches our sexual content findings.

If you're deploying GPT-5.2:

1. **Don't trust user input.** Anything from external sources (uploads, emails, web content) can contain [indirect injection attempts](/docs/red-team/plugins/indirect-prompt-injection).

2. **[Gate tool access](/docs/red-team/plugins/excessive-agency).** If your app calls functions based on model output, require confirmation for destructive actions.

3. **Test before shipping.** [Run your own red team](/docs/red-team/quickstart). The config above works out of the box.

## Run It Yourself

See our [foundation model red teaming guide](/docs/red-team/foundation-models/) for full details. Quick start:

```bash
# Clone the example
npx promptfoo@latest init --example redteam-foundation-model
cd redteam-foundation-model

# Point it at GPT-5.2
# Edit redteam.yaml: change target to openai:chat:gpt-5.2

# Run
npx promptfoo@latest redteam run
```

Full results take about 30 minutes with `-j 40`. You'll get a report showing which categories failed and which attacks succeeded.

Running this red team will generate harmful content. Keep results internal and limit access.

## Update: Results with Low Reasoning

_Added December 11, 2025, 5:01 PM PST_

We re-ran the eval with `reasoning_effort: 'low'`. Reasoning tokens let the model deliberate before responding, which costs more and adds latency but typically improves safety.

**Attack success by strategy:**

| Strategy               | No Reasoning | Low Reasoning | Change  |
| ---------------------- | ------------ | ------------- | ------- |
| **Hydra** (multi-turn) | 78.5%        | 61.8%         | -16.7pp |
| **Meta** (single-turn) | 61.0%        | 55.1%         | -5.9pp  |
| **Baseline**           | 4.3%         | 5.2%          | +0.9pp  |

Reasoning helped most with multi-turn attacks. But even with reasoning enabled, 62% of Hydra attacks and 55% of Meta attacks still succeeded. The eval required 5,615 probes vs 4,229 in the original. Adaptive strategies worked harder against the more resistant model.

High-failure categories under low reasoning:

- [Entity impersonation](/docs/red-team/plugins/imitation): 100%
- Profanity: 87%
- Harassment: 67%
- Dangerous activity: 67%
- Graphic content: 60%

**Takeaway:** Enable reasoning for safety-critical deployments. It provides meaningful improvement, but it's not a fix. Defense in depth still applies.

---

**Related posts:**

- [How to Red Team GPT](/blog/red-team-gpt) - Complete security testing guide for OpenAI models
- [What are the Security Risks of Deploying DeepSeek-R1?](/blog/deepseek-redteam) - Our red team analysis of DeepSeek
- [Jailbreaking LLMs: A Comprehensive Guide](/blog/how-to-jailbreak-llms) - How jailbreaks work, with examples
- [AI Red Teaming for First-Timers](/blog/ai-red-teaming-for-first-timers) - Getting started with security testing
- [How Much Does Foundation Model Security Matter?](/blog/foundation-model-security) - What to look for when choosing models
- [Prompt Injection vs Jailbreaking](/blog/jailbreaking-vs-prompt-injection) - Understanding the difference

---

Want the full red team report? Contact us at inquiries@promptfoo.dev.

Questions? Find us on [Discord](https://discord.gg/promptfoo) or [GitHub](https://github.com/promptfoo/promptfoo).