---
sidebar_label: CI/CD
title: CI/CD Integration for LLM Eval and Security
description: Automate LLM testing in CI/CD pipelines with GitHub Actions, GitLab CI, and Jenkins for continuous security and quality checks
keywords:
  [
    ci/cd,
    continuous integration,
    llm testing,
    automated evaluation,
    security scanning,
    github actions,
  ]
---

# CI/CD Integration for LLM Evaluation and Security

Integrate promptfoo into your CI/CD pipelines to automatically evaluate prompts, test for security vulnerabilities, and ensure quality before deployment. This guide covers modern CI/CD workflows for both quality testing and security scanning.

## Why CI/CD for LLM Apps?

- **Catch regressions early** - Test prompt changes before they reach production
- **Security scanning** - Automated red teaming and vulnerability detection
- **Quality gates** - Enforce minimum performance thresholds
- **Compliance** - Generate reports for OWASP, NIST, and other frameworks
- **Cost control** - Track token usage and API costs over time

## Quick Start

If you're using GitHub Actions, check out our [dedicated GitHub Actions guide](/docs/integrations/github-action) or the [GitHub Marketplace action](https://github.com/marketplace/actions/test-llm-outputs).

For other platforms, here's a basic example:

```bash
# Run eval (no global install required)
npx promptfoo@latest eval -c promptfooconfig.yaml -o results.json

# Run security scan (red teaming)
npx promptfoo@latest redteam run
```

## Prerequisites

- Node.js 20.20+ or 22.22+ installed in your CI environment
- LLM provider API keys (stored as secure environment variables)
- A promptfoo configuration file (`promptfooconfig.yaml`)
- (Optional) Docker for containerized environments

## Core Concepts

### 1. Eval vs Red Teaming

Promptfoo supports two main CI/CD workflows:

**Eval** - Test prompt quality and performance:

```bash
npx promptfoo@latest eval -c promptfooconfig.yaml
```

**Red Teaming** - Security vulnerability scanning:

```bash
npx promptfoo@latest redteam run
```

See our [red team quickstart](/docs/red-team/quickstart) for security testing details.

### 2. Output Formats

Promptfoo supports multiple output formats for different CI/CD needs:

```bash
# JSON for programmatic processing
npx promptfoo@latest eval -o results.json

# HTML for human-readable reports
npx promptfoo@latest eval -o report.html

# XML for enterprise tools
npx promptfoo@latest eval -o results.xml

# Multiple formats
npx promptfoo@latest eval -o results.json -o report.html
```

Learn more about [output formats and processing](/docs/configuration/outputs).

:::info Enterprise Feature

SonarQube integration is available in [Promptfoo Enterprise](/docs/enterprise/). Use the standard JSON output format and process it for SonarQube import.

:::

### 3. Quality Gates

Fail the build when quality thresholds aren't met:

```bash
# Fail on any test failures
npx promptfoo@latest eval --fail-on-error

# Custom threshold checking
npx promptfoo@latest eval -o results.json
PASS_RATE=$(jq '.results.stats.successes / (.results.stats.successes + .results.stats.failures) * 100' results.json)
if (( $(echo "$PASS_RATE < 95" | bc -l) )); then
  echo "Quality gate failed: Pass rate ${PASS_RATE}% < 95%"
  exit 1
fi
```

See [assertions and metrics](/docs/configuration/expected-outputs) for comprehensive validation options.

## Platform-Specific Guides

### GitHub Actions

```yaml title=".github/workflows/eval.yml"
name: LLM Eval
on:
  pull_request:
    paths:
      - 'prompts/**'
      - 'promptfooconfig.yaml'

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-node@v4
        with:
          node-version: '22'
          cache: 'npm'

      - name: Cache promptfoo
        uses: actions/cache@v4
        with:
          path: ~/.cache/promptfoo
          key: ${{ runner.os }}-promptfoo-${{ hashFiles('prompts/**') }}
          restore-keys: |
            ${{ runner.os }}-promptfoo-

      - name: Run eval
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          PROMPTFOO_CACHE_PATH: ~/.cache/promptfoo
        run: |
          npx promptfoo@latest eval \
            -c promptfooconfig.yaml \
            --share \
            -o results.json \
            -o report.html

      - name: Check quality gate
        run: |
          FAILURES=$(jq '.results.stats.failures' results.json)
          if [ "$FAILURES" -gt 0 ]; then
            echo "❌ Eval failed with $FAILURES failures"
            exit 1
          fi
          echo "✅ All tests passed!"

      - name: Upload results
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: eval-results
          path: |
            results.json
            report.html
```

For red teaming in CI/CD:

```yaml title=".github/workflows/redteam.yml"
name: Security Scan
on:
  schedule:
    - cron: '0 0 * * *' # Daily
  workflow_dispatch:

jobs:
  red-team:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Run red team scan
        uses: promptfoo/promptfoo-action@v1
        with:
          type: 'redteam'
          config: 'promptfooconfig.yaml'
          openai-api-key: ${{ secrets.OPENAI_API_KEY }}
          github-token: ${{ secrets.GITHUB_TOKEN }}
```

See also: [Standalone GitHub Action example](https://github.com/promptfoo/promptfoo/tree/main/examples/integration-github-action).

### GitLab CI

See our [detailed GitLab CI guide](/docs/integrations/gitlab-ci).

```yaml title=".gitlab-ci.yml"
image: node:20

evaluate:
  script:
    - |
      npx promptfoo@latest eval \
        -c promptfooconfig.yaml \
        --share \
        -o output.json
  variables:
    OPENAI_API_KEY: ${OPENAI_API_KEY}
    PROMPTFOO_CACHE_PATH: .cache/promptfoo
  cache:
    key: ${CI_COMMIT_REF_SLUG}-promptfoo
    paths:
      - .cache/promptfoo
  artifacts:
    reports:
      junit: output.xml
    paths:
      - output.json
      - report.html
```

### Jenkins

See our [detailed Jenkins guide](/docs/integrations/jenkins).

```groovy title="Jenkinsfile"
pipeline {
    agent any

    environment {
        OPENAI_API_KEY = credentials('openai-api-key')
        PROMPTFOO_CACHE_PATH = "${WORKSPACE}/.cache/promptfoo"
    }

    stages {
        stage('Evaluate') {
            steps {
                sh '''
                    npx promptfoo@latest eval \
                        -c promptfooconfig.yaml \
                        --share \
                        -o results.json
                '''
            }
        }

        stage('Quality Gate') {
            steps {
                script {
                    def results = readJSON file: 'results.json'
                    def failures = results.results.stats.failures
                    if (failures > 0) {
                        error("Eval failed with ${failures} failures")
                    }
                }
            }
        }
    }
}
```

### Other Platforms

- [Azure Pipelines](/docs/integrations/azure-pipelines)
- [AWS CodeCommit](/docs/integrations/aws-codecommit)
- [CircleCI](/docs/integrations/circle-ci)
- [Bitbucket Pipelines](/docs/integrations/bitbucket-pipelines)
- [Travis CI](/docs/integrations/travis-ci)
- [n8n workflows](/docs/integrations/n8n)
- [Looper](/docs/integrations/looper)

## Advanced Patterns

### 1. Docker-based CI/CD

Create a custom Docker image with promptfoo pre-installed:

```dockerfile title="Dockerfile"
FROM node:20-slim
WORKDIR /app
COPY . .
CMD ["npx", "promptfoo@latest", "eval"]
```

### 2. Parallel Testing

Test multiple models or configurations in parallel:

```yaml
# GitHub Actions example
strategy:
  matrix:
    model: [gpt-4, gpt-3.5-turbo, claude-3-opus]
steps:
  - name: Test ${{ matrix.model }}
    run: |
      npx promptfoo@latest eval \
        --providers.0.config.model=${{ matrix.model }} \
        -o results-${{ matrix.model }}.json
```

### 3. Scheduled Security Scans

Run comprehensive security scans on a schedule:

```yaml
# GitHub Actions
on:
  schedule:
    - cron: '0 2 * * *' # 2 AM daily

jobs:
  security-scan:
    runs-on: ubuntu-latest
    steps:
      - name: Full red team scan
        run: |
          npx promptfoo@latest redteam generate \
            --plugins harmful,pii,contracts \
            --strategies jailbreak,prompt-injection
          npx promptfoo@latest redteam run
```

### 4. SonarQube Integration

:::info Enterprise Feature

Direct SonarQube output format is available in [Promptfoo Enterprise](/docs/enterprise/). For open-source users, export to JSON and transform the results.

:::

For enterprise environments, integrate with SonarQube:

```yaml
# Export results for SonarQube processing
- name: Run promptfoo security scan
  run: |
    npx promptfoo@latest eval \
      --config promptfooconfig.yaml \
      -o results.json

# Transform results for SonarQube (custom script required)
- name: Transform for SonarQube
  run: |
    node transform-to-sonarqube.js results.json > sonar-report.json

- name: SonarQube scan
  env:
    SONAR_TOKEN: ${{ secrets.SONAR_TOKEN }}
  run: |
    sonar-scanner \
      -Dsonar.externalIssuesReportPaths=sonar-report.json
```

See our [SonarQube integration guide](/docs/integrations/sonarqube) for detailed setup.

## Processing Results

### Parsing JSON Output

The output JSON follows this schema:

```typescript
interface OutputFile {
  evalId?: string;
  results: {
    stats: {
      successes: number;
      failures: number;
      errors: number;
    };
    outputs: Array<{
      pass: boolean;
      score: number;
      error?: string;
      // ... other fields
    }>;
  };
  config: UnifiedConfig;
  shareableUrl: string | null;
}
```

Example processing script:

```javascript title="process-results.js"
const fs = require('fs');
const results = JSON.parse(fs.readFileSync('results.json', 'utf8'));

// Calculate metrics
const passRate =
  (results.results.stats.successes /
    (results.results.stats.successes + results.results.stats.failures)) *
  100;

console.log(`Pass rate: ${passRate.toFixed(2)}%`);
console.log(`Shareable URL: ${results.shareableUrl}`);

// Check for specific failures
const criticalFailures = results.results.outputs.filter(
  (o) => o.error?.includes('security') || o.error?.includes('injection'),
);

if (criticalFailures.length > 0) {
  console.error('Critical security failures detected!');
  process.exit(1);
}
```

### Posting Results

Post eval results to PR comments, Slack, or other channels:

```bash
# Extract and post results
SHARE_URL=$(jq -r '.shareableUrl' results.json)
PASS_RATE=$(jq '.results.stats.successes / (.results.stats.successes + .results.stats.failures) * 100' results.json)

# Post to GitHub PR
gh pr comment --body "
## Promptfoo Eval Results
- Pass rate: ${PASS_RATE}%
- [View detailed results](${SHARE_URL})
"
```

## Caching Strategies

Optimize CI/CD performance with proper caching [[memory:3455374]]:

```yaml
# Set cache location
env:
  PROMPTFOO_CACHE_PATH: ~/.cache/promptfoo
  PROMPTFOO_CACHE_TTL: 86400 # 24 hours

# Cache configuration
cache:
  key: promptfoo-${{ hashFiles('prompts/**', 'promptfooconfig.yaml') }}
  paths:
    - ~/.cache/promptfoo
```

## Security Best Practices

1. **API Key Management**
   - Store API keys as encrypted secrets
   - Use least-privilege access controls
   - Rotate keys regularly

2. **Network Security**
   - Use private runners for sensitive data
   - Restrict outbound network access
   - Consider on-premise deployments for enterprise

3. **Data Privacy**
   - Enable output stripping for sensitive data:

   ```bash
   export PROMPTFOO_STRIP_RESPONSE_OUTPUT=true
   export PROMPTFOO_STRIP_TEST_VARS=true
   ```

4. **Audit Logging**
   - Keep eval history
   - Track who triggered security scans
   - Monitor for anomalous patterns

## Troubleshooting

### Common Issues

| Issue         | Solution                                         |
| ------------- | ------------------------------------------------ |
| Rate limits   | Enable caching, reduce concurrency with `-j 1`   |
| Timeouts      | Increase timeout values, use `--max-concurrency` |
| Memory issues | Use streaming mode, process results in batches   |
| Cache misses  | Check cache key includes all relevant files      |

### Debug Mode

Enable detailed logging:

```bash
LOG_LEVEL=debug npx promptfoo@latest eval -c config.yaml
```

## Real-World Examples

### Automated Testing Examples

- [Self-grading example](https://github.com/promptfoo/promptfoo/tree/main/examples/eval-self-grading) - Automated LLM evaluation
- [Custom grading prompts](https://github.com/promptfoo/promptfoo/tree/main/examples/eval-custom-grading-prompt) - Complex evaluation logic
- [Store and reuse outputs](https://github.com/promptfoo/promptfoo/tree/main/examples/config-store-and-reuse-outputs) - Multi-step testing

### Security Examples

- [Red team starter](https://github.com/promptfoo/promptfoo/tree/main/examples/redteam-starter) - Basic security testing
- [RAG red team tests](https://github.com/promptfoo/promptfoo/tree/main/examples/redteam-rag) - Customer service red team testing (RBAC, competitors, harmful content)
- [DoNotAnswer dataset](https://github.com/promptfoo/promptfoo/tree/main/examples/redteam-donotanswer) - Harmful content detection

### Integration Examples

- [GitHub Action standalone](https://github.com/promptfoo/promptfoo/tree/main/examples/integration-github-action) - Custom GitHub workflows
- [JSON output processing](https://github.com/promptfoo/promptfoo/tree/main/examples/eval-json-output) - Result parsing patterns
- [CSV test data](https://github.com/promptfoo/promptfoo/tree/main/examples/simple-test) - Bulk test management

## Related Documentation

### Configuration & Testing

- [Configuration Guide](/docs/configuration/guide) - Complete setup instructions
- [Test Cases](/docs/configuration/test-cases) - Writing effective tests
- [Assertions & Metrics](/docs/configuration/expected-outputs) - Validation strategies
- [Python Assertions](/docs/configuration/expected-outputs/python) - Custom Python validators
- [JavaScript Assertions](/docs/configuration/expected-outputs/javascript) - Custom JS validators

### Security & Red Teaming

- [Red Team Architecture](/docs/red-team/architecture) - Security testing framework
- [OWASP Top 10 for LLMs](/docs/red-team/owasp-llm-top-10) - Security compliance
- [RAG Security Testing](/docs/red-team/rag) - Testing retrieval systems
- [MCP Security Testing](/docs/red-team/mcp-security-testing) - Model Context Protocol security

### Enterprise & Scaling

- [Enterprise Features](/docs/enterprise/) - Team collaboration and compliance
- [Red Teams in Enterprise](/docs/enterprise/red-teams) - Organization-wide security
- [Service Accounts](/docs/enterprise/service-accounts) - Automated access

## See Also

- [GitHub Actions Integration](/docs/integrations/github-action)
- [Red Team Quickstart](/docs/red-team/quickstart)
- [Enterprise Features](/docs/enterprise/)
- [Configuration Reference](/docs/configuration/reference)