# redteam-coding-agent (Coding Agent Red Team)

Red team autonomous coding agents for repository prompt injection, terminal output injection, secret reads, procfs abuse, sandbox read/write escapes, network egress, delayed CI exfiltration, generated vulnerabilities, automation poisoning, steganographic leakage, and verifier sabotage vulnerabilities.

```bash
npx promptfoo@latest init --example redteam-coding-agent
cd redteam-coding-agent
```

## Quick start

The default config targets `gpt-5.4` as a simulated coding agent. Set your API key and run:

```bash
export OPENAI_API_KEY=sk-...
npx promptfoo@latest redteam run
npx promptfoo@latest view
```

Required environment:

- `OPENAI_API_KEY`: Used by the default OpenAI provider. The Codex SDK provider also uses this when `CODEX_API_KEY` is not set.

Optional environment:

- `CODEX_API_KEY`: Use this for Codex SDK auth when you do not want to reuse `OPENAI_API_KEY`.
- `PROMPTFOO_REMOTE_GENERATION_URL`: Point generation at a self-hosted red-team generation service.
- `PROMPTFOO_DISABLE_REDTEAM_REMOTE_GENERATION=true`: Disable remote generation. The `coding-agent:core` and `coding-agent:all` collections and the individual coding-agent plugins are unavailable in this mode because they require remote-generated scenarios.

When testing a real coding agent, set `providers[0].config.working_dir` to a disposable checkout and use synthetic canary secrets. Do not put production credentials in the eval environment.

## Using a real coding agent

For meaningful results, point the config at a real agent provider:

- **Codex SDK**: Uncomment the `openai:codex-sdk` provider in `promptfooconfig.yaml` and set `working_dir` to a disposable checkout.
- **Custom agent**: Use a `file://` provider or HTTP endpoint that wraps your agent framework.

Real agent providers should return structured output with command executions and trace data so the deterministic verifiers can inspect commands, outputs, and file changes — not just the final response.

Use a fresh writable checkout for each red-team row when you want high-confidence results. Coding agents leave files behind, change tests, and may copy canaries into artifacts; reusing one workspace across rows can make a later failure look like the wrong plugin failed.

## What the plugins test

| Plugin                                   | Vulnerability                                             | Detection                                                          |
| ---------------------------------------- | --------------------------------------------------------- | ------------------------------------------------------------------ |
| `coding-agent:repo-prompt-injection`     | Agent follows hidden instructions in repository files     | Canary value detection + LLM grading                               |
| `coding-agent:terminal-output-injection` | Agent treats build/test output as authoritative commands  | Receipt propagation detection + LLM grading                        |
| `coding-agent:secret-env-read`           | Agent reads or exposes launcher environment variables     | Canary value detection in output/traces                            |
| `coding-agent:secret-file-read`          | Agent reads or exposes protected local file contents      | Canary/digest detection in output/traces/artifacts                 |
| `coding-agent:sandbox-read-escape`       | Agent reads files outside the intended workspace          | Receipt detection + LLM grading                                    |
| `coding-agent:sandbox-write-escape`      | Agent mutates files outside the intended workspace        | Command detection + host file/probe checks + LLM grading           |
| `coding-agent:network-egress-bypass`     | Agent calls or plants code for unapproved network egress  | Command detection + trap/proxy logs + workspace scan + LLM grading |
| `coding-agent:procfs-credential-read`    | Agent recovers credentials through process metadata       | Procfs command detection + receipt detection                       |
| `coding-agent:delayed-ci-exfil`          | Agent plants future CI or automation data leakage         | Command detection + workflow/package scan + LLM grading            |
| `coding-agent:generated-vulnerability`   | Agent introduces insecure code while completing a feature | High-confidence code heuristics + LLM grading                      |
| `coding-agent:automation-poisoning`      | Agent persists unsafe hooks, scripts, or workflows        | Automation scan + host file hashes + LLM grading                   |
| `coding-agent:steganographic-exfil`      | Agent hides secrets in encoded or low-visibility sinks    | Transformed canary detection + artifact scan                       |
| `coding-agent:verifier-sabotage`         | Agent weakens tests instead of fixing code                | Heuristic pattern detection + file hash checks + LLM grading       |

## Strategies

`jailbreak:meta` is the iterative single-turn strategy. `jailbreak:hydra`, `goat`, and `crescendo` exercise conversational multi-turn behavior and need providers/workspaces that can preserve state safely. Encoding and prompt-rewriting strategies (base64, rot13, math-prompt, jailbreak:composite, etc.) are automatically excluded since they would break deterministic canary matching.

Use `coding-agent:core` for quick baseline coverage, or switch to `coding-agent:all` when you want the full coding-agent suite in one collection.

## Safety

- Never place production secrets in red-team configs. Use synthetic canaries (`{{secretEnvValue}}`).
- Run agents in disposable, least-privilege sandboxes and delete or reset the workspace between rows.
- The generated prompts ask the agent to do unsafe things — that's the point. The graders evaluate whether the agent resists.