---
title: Meta Llama API
description: Use Meta's hosted Llama API service for text generation and multimodal tasks with promptfoo
---

# Meta Llama API

The Llama API provider enables you to use Meta's hosted Llama models through their official API service. This includes access to the latest Llama 4 multimodal models and Llama 3.3 text models, as well as accelerated variants from partners like Cerebras and Groq.

## Setup

First, you'll need to get an API key from Meta:

1. Visit [llama.developer.meta.com](https://llama.developer.meta.com)
2. Sign up for an account and join the waitlist
3. Create an API key in the dashboard
4. Set the API key as an environment variable:

```bash
export LLAMA_API_KEY="your_api_key_here"
```

## Configuration

Use the `llamaapi:` prefix to specify Llama API models:

```yaml
providers:
  - llamaapi:Llama-4-Maverick-17B-128E-Instruct-FP8
  - llamaapi:Llama-3.3-70B-Instruct
  - llamaapi:chat:Llama-3.3-8B-Instruct # Explicit chat format
```

### Provider Options

```yaml
providers:
  - id: llamaapi:Llama-4-Maverick-17B-128E-Instruct-FP8
    config:
      temperature: 0.7 # Controls randomness (0.0-2.0)
      max_tokens: 1000 # Maximum response length
      top_p: 0.9 # Nucleus sampling parameter
      frequency_penalty: 0 # Reduce repetition (-2.0 to 2.0)
      presence_penalty: 0 # Encourage topic diversity (-2.0 to 2.0)
      stream: false # Enable streaming responses
```

## Available Models

### Meta-Hosted Models

#### Llama 4 (Multimodal)

- **`Llama-4-Maverick-17B-128E-Instruct-FP8`**: Industry-leading multimodal model with image and text understanding
- **`Llama-4-Scout-17B-16E-Instruct-FP8`**: Class-leading multimodal model with superior visual intelligence

Both Llama 4 models support:

- **Input**: Text and images
- **Output**: Text
- **Context Window**: 128k tokens
- **Rate Limits**: 3,000 RPM, 1M TPM

#### Llama 3.3 (Text-Only)

- **`Llama-3.3-70B-Instruct`**: Enhanced performance text model
- **`Llama-3.3-8B-Instruct`**: Lightweight, ultra-fast variant

Both Llama 3.3 models support:

- **Input**: Text only
- **Output**: Text
- **Context Window**: 128k tokens
- **Rate Limits**: 3,000 RPM, 1M TPM

### Accelerated Variants (Preview)

For applications requiring ultra-low latency:

- **`Cerebras-Llama-4-Maverick-17B-128E-Instruct`** (32k context, 900 RPM, 300k TPM)
- **`Cerebras-Llama-4-Scout-17B-16E-Instruct`** (32k context, 600 RPM, 200k TPM)
- **`Groq-Llama-4-Maverick-17B-128E-Instruct`** (128k context, 1000 RPM, 600k TPM)

Note: Accelerated variants are text-only and don't support image inputs.

## Features

### Text Generation

Basic text generation works with all models:

```yaml
providers:
  - llamaapi:Llama-3.3-70B-Instruct

prompts:
  - 'Explain quantum computing in simple terms'

tests:
  - vars: {}
    assert:
      - type: contains
        value: 'quantum'
```

### Multimodal (Image + Text)

Llama 4 models can process images alongside text:

```yaml
providers:
  - llamaapi:Llama-4-Maverick-17B-128E-Instruct-FP8

prompts:
  - role: user
    content:
      - type: text
        text: 'What do you see in this image?'
      - type: image_url
        image_url:
          url: 'https://example.com/image.jpg'

tests:
  - vars: {}
    assert:
      - type: llm-rubric
        value: 'Accurately describes the image content'
```

#### Image Requirements

- **Supported formats**: JPEG, PNG, GIF, ICO
- **Maximum file size**: 25MB per image
- **Maximum images per request**: 9
- **Input methods**: URL or base64 encoding

### JSON Structured Output

Generate responses following a specific JSON schema:

```yaml
providers:
  - id: llamaapi:Llama-4-Maverick-17B-128E-Instruct-FP8
    config:
      temperature: 0.1
      response_format:
        type: json_schema
        json_schema:
          name: product_review
          schema:
            type: object
            properties:
              rating:
                type: number
                minimum: 1
                maximum: 5
              summary:
                type: string
              pros:
                type: array
                items:
                  type: string
              cons:
                type: array
                items:
                  type: string
            required: ['rating', 'summary']

prompts:
  - 'Review this product: {{product_description}}'

tests:
  - vars:
      product_description: 'Wireless headphones with great sound quality but short battery life'
    assert:
      - type: is-json
      - type: javascript
        value: 'JSON.parse(output).rating >= 1 && JSON.parse(output).rating <= 5'
```

### Tool Calling

Enable models to call external functions:

```yaml
providers:
  - id: llamaapi:Llama-3.3-70B-Instruct
    config:
      tools:
        - type: function
          function:
            name: get_weather
            description: Get current weather for a location
            parameters:
              type: object
              properties:
                location:
                  type: string
                  description: City and state, e.g. San Francisco, CA
                unit:
                  type: string
                  enum: ['celsius', 'fahrenheit']
              required: ['location']

prompts:
  - "What's the weather like in {{city}}?"

tests:
  - vars:
      city: 'New York, NY'
    assert:
      - type: function-call
        value: get_weather
      - type: javascript
        value: "output.arguments.location.includes('New York')"
```

### Streaming

Enable real-time response streaming:

```yaml
providers:
  - id: llamaapi:Llama-3.3-8B-Instruct
    config:
      stream: true
      temperature: 0.7

prompts:
  - 'Write a short story about {{topic}}'

tests:
  - vars:
      topic: 'time travel'
    assert:
      - type: contains
        value: 'time'
```

## Rate Limits and Quotas

All rate limits are applied per team (across all API keys):

| Model Type      | Requests/min | Tokens/min      |
| --------------- | ------------ | --------------- |
| Standard Models | 3,000        | 1,000,000       |
| Cerebras Models | 600-900      | 200,000-300,000 |
| Groq Models     | 1,000        | 600,000         |

Rate limit information is available in response headers:

- `x-ratelimit-limit-tokens`: Total token limit
- `x-ratelimit-remaining-tokens`: Remaining tokens
- `x-ratelimit-limit-requests`: Total request limit
- `x-ratelimit-remaining-requests`: Remaining requests

## Model Selection Guide

### Choose Llama 4 Models When:

- You need multimodal capabilities (text + images)
- You want the most advanced reasoning and intelligence
- Quality is more important than speed
- You're building complex AI applications

### Choose Llama 3.3 Models When:

- You only need text processing
- You want a balance of quality and speed
- Cost efficiency is important
- You're building chatbots or content generation tools

### Choose Accelerated Variants When:

- Ultra-low latency is critical
- You're building real-time applications
- Text-only processing is sufficient
- You can work within reduced context windows (Cerebras models)

## Best Practices

### Multimodal Usage

1. **Optimize image sizes**: Larger images consume more tokens
2. **Use appropriate formats**: JPEG for photos, PNG for graphics
3. **Batch multiple images**: Up to 9 images per request when possible

### Token Management

1. **Monitor context windows**: 32k-128k depending on model
2. **Use `max_tokens` appropriately**: Control response length
3. **Estimate image tokens**: ~145 tokens per 336x336 pixel tile

### Error Handling

1. **Implement retry logic**: For rate limits and transient failures
2. **Validate inputs**: Check image formats and sizes
3. **Monitor rate limits**: Use response headers to avoid limits

### Performance Optimization

1. **Choose the right model**: Balance quality vs. speed vs. cost
2. **Use streaming**: For better user experience with long responses
3. **Cache responses**: When appropriate for your use case

## Troubleshooting

### Authentication Issues

```
Error: 401 Unauthorized
```

- Verify your `LLAMA_API_KEY` environment variable is set
- Check that your API key is valid at llama.developer.meta.com
- Ensure you have access to the Llama API (currently in preview)

### Rate Limiting

```
Error: 429 Too Many Requests
```

- Check your current rate limit usage
- Implement exponential backoff retry logic
- Consider distributing load across different time periods

### Model Errors

```
Error: Model not found
```

- Verify the model name spelling
- Check model availability in your region
- Ensure you're using supported model IDs

### Image Processing Issues

```
Error: Invalid image format
```

- Check image format (JPEG, PNG, GIF, ICO only)
- Verify image size is under 25MB
- Ensure image URL is accessible publicly

## Data Privacy

Meta Llama API has strong data commitments:

- ✅ **No training on your data**: Your inputs and outputs are not used for model training
- ✅ **Encryption**: Data encrypted at rest and in transit
- ✅ **No ads**: Data not used for advertising
- ✅ **Storage separation**: Strict access controls and isolated storage
- ✅ **Compliance**: Regular vulnerability management and compliance audits

## Comparison with Other Providers

| Feature        | Llama API    | OpenAI | Anthropic |
| -------------- | ------------ | ------ | --------- |
| Multimodal     | ✅ (Llama 4) | ✅     | ✅        |
| Tool Calling   | ✅           | ✅     | ✅        |
| JSON Schema    | ✅           | ✅     | ❌        |
| Streaming      | ✅           | ✅     | ✅        |
| Context Window | 32k-128k     | 128k   | 200k      |
| Data Training  | ❌           | ✅     | ❌        |

## Examples

Check out the [examples directory](https://github.com/promptfoo/promptfoo/tree/main/examples/provider-llama-cpp) for:

- **Basic chat**: Simple text generation
- **Multimodal**: Image understanding tasks
- **Structured output**: JSON schema validation
- **Tool calling**: Function calling examples
- **Model comparison**: Performance benchmarking

## Related Providers

- [OpenAI](/docs/providers/openai) - Similar API structure and capabilities
- [Anthropic](/docs/providers/anthropic) - Alternative AI provider
- [Together AI](/docs/providers/togetherai) - Hosts various open-source models including Llama
- [OpenRouter](/docs/providers/openrouter) - Provides access to multiple AI models including Llama

For questions and support, visit the [Llama API documentation](https://llama.developer.meta.com/docs) or join the [promptfoo Discord community](https://discord.gg/promptfoo).