---
title: 'Evaluating ElevenLabs Voice AI'
description: 'Step-by-step guide for testing ElevenLabs voice AI with Promptfoo - from TTS quality testing to conversational agent evaluation'
---

# Evaluating ElevenLabs voice AI

This guide walks you through testing ElevenLabs voice AI capabilities using Promptfoo, from basic text-to-speech quality testing to advanced conversational agent evaluation.

## Part 1: Text-to-Speech Quality Testing

Let's start by comparing different voice models and measuring their quality.

### Step 1: Setup

Install Promptfoo and set your API key:

```sh
npm install -g promptfoo
export ELEVENLABS_API_KEY=your_api_key_here
```

### Step 2: Create Your First Config

Create `promptfooconfig.yaml`:

```yaml
description: 'Compare ElevenLabs TTS models for customer service greetings'

prompts:
  - "Thank you for calling TechSupport Inc. My name is Alex, and I'll be assisting you today. How can I help?"

providers:
  - label: Flash Model (Fastest)
    id: elevenlabs:tts:rachel
    config:
      modelId: eleven_flash_v2_5
      outputFormat: mp3_44100_128

  - label: Turbo Model (Best Quality)
    id: elevenlabs:tts:rachel
    config:
      modelId: eleven_turbo_v2_5
      outputFormat: mp3_44100_128

tests:
  - description: Both models complete within 3 seconds
    assert:
      - type: latency
        threshold: 3000

  - description: Cost is under $0.01 per greeting
    assert:
      - type: cost
        threshold: 0.01
```

### Step 3: Run Your First Eval

```sh
promptfoo eval
```

You'll see results comparing both models:

```text
┌─────────────────────────┬──────────┬──────────┐
│ Prompt                  │ Flash    │ Turbo    │
├─────────────────────────┼──────────┼──────────┤
│ Thank you for calling...│ ✓ Pass   │ ✓ Pass   │
│ Latency: <3s            │ 847ms    │ 1,234ms  │
│ Cost: <$0.01            │ $0.003   │ $0.004   │
└─────────────────────────┴──────────┴──────────┘
```

### Step 4: View Results

Open the web UI to listen to the audio:

```sh
promptfoo view
```

## Part 2: Voice Customization

Now let's optimize voice settings for different use cases.

### Step 5: Add Voice Settings

Update your config:

```yaml
description: 'Test voice settings for different scenarios'

prompts:
  - 'Welcome to our automated system.' # Formal announcement
  - 'Hey there! Thanks for reaching out.' # Casual greeting
  - 'I understand your frustration. Let me help.' # Empathetic response

providers:
  - label: Professional Voice
    id: elevenlabs:tts:rachel
    config:
      modelId: eleven_flash_v2_5
      voiceSettings:
        stability: 0.8 # Consistent tone
        similarity_boost: 0.85
        speed: 0.95

  - label: Friendly Voice
    id: elevenlabs:tts:rachel
    config:
      modelId: eleven_flash_v2_5
      voiceSettings:
        stability: 0.4 # More variation
        similarity_boost: 0.75
        speed: 1.1 # Slightly faster

  - label: Empathetic Voice
    id: elevenlabs:tts:rachel
    config:
      modelId: eleven_flash_v2_5
      voiceSettings:
        stability: 0.5
        similarity_boost: 0.7
        style: 0.8 # More expressive
        speed: 0.9 # Slower, calmer

tests:
  - vars:
      scenario: formal
    provider: Professional Voice
    assert:
      - type: javascript
        value: output.includes("Welcome") || output.includes("system")

  - vars:
      scenario: casual
    provider: Friendly Voice
    assert:
      - type: latency
        threshold: 2000

  - vars:
      scenario: empathy
    provider: Empathetic Voice
    assert:
      - type: cost
        threshold: 0.01
```

Run the eval:

```sh
promptfoo eval
promptfoo view  # Compare the different voice styles
```

## Part 3: Speech-to-Text Accuracy

Test transcription accuracy by creating a TTS → STT pipeline.

### Step 6: Create Transcription Pipeline

Create `transcription-test.yaml`:

```yaml
description: 'Test TTS → STT accuracy pipeline'

prompts:
  - |
    The quarterly sales meeting is scheduled for Thursday, March 15th at 2:30 PM.
    Please bring your laptop, quarterly reports, and the Q4 projections spreadsheet.
    Conference room B has been reserved for this meeting.

providers:
  # Step 1: Generate audio
  - label: tts-generator
    id: elevenlabs:tts:rachel
    config:
      modelId: eleven_flash_v2_5

tests:
  - description: Generate audio and verify quality
    provider: tts-generator
    assert:
      - type: javascript
        value: |
          // Verify audio was generated
          const result = JSON.parse(output);
          return result.audio && result.audio.sizeBytes > 0;
```

Now add STT to verify accuracy. Create a second config `stt-accuracy.yaml`:

```yaml
description: 'Test STT accuracy'

prompts:
  - file://audio/generated-speech.mp3 # Audio from previous eval

providers:
  - id: elevenlabs:stt
    config:
      modelId: eleven_speech_to_text_v1
      calculateWER: true

tests:
  - vars:
      referenceText: 'The quarterly sales meeting is scheduled for Thursday, March 15th at 2:30 PM. Please bring your laptop, quarterly reports, and the Q4 projections spreadsheet. Conference room B has been reserved for this meeting.'
    assert:
      - type: javascript
        value: |
          const result = JSON.parse(output);
          // Check Word Error Rate is under 5%
          if (result.wer_result) {
            console.log('WER:', result.wer_result.wer);
            return result.wer_result.wer < 0.05;
          }
          return false;
```

Run the STT eval:

```sh
promptfoo eval -c stt-accuracy.yaml
```

## Part 4: Conversational Agent Testing

Test a complete voice agent with evaluation criteria.

### Step 7: Create Agent Config

Create `agent-test.yaml`:

```yaml
description: 'Test customer support agent performance'

prompts:
  - |
    User: Hi, I'm having trouble with my account
    User: I can't log in with my password
    User: My email is user@example.com
    User: I already tried resetting it twice

providers:
  - id: elevenlabs:agents
    config:
      # Create an ephemeral agent for testing
      agentConfig:
        name: Support Agent
        prompt: |
          You are a helpful customer support agent for TechCorp.
          Your job is to:
          1. Greet customers warmly
          2. Understand their issue
          3. Collect necessary information (email, account number)
          4. Provide clear next steps
          5. Maintain a professional, empathetic tone

          Never make promises you can't keep. Always set clear expectations.
        voiceId: 21m00Tcm4TlvDq8ikWAM # Rachel
        llmModel: gpt-5-mini

      # Define evaluation criteria
      evaluationCriteria:
        - name: greeting
          description: Agent greets the user warmly
          weight: 0.8
          passingThreshold: 0.8

        - name: information_gathering
          description: Agent asks for email or account details
          weight: 1.0
          passingThreshold: 0.9

        - name: empathy
          description: Agent acknowledges user frustration
          weight: 0.9
          passingThreshold: 0.7

        - name: next_steps
          description: Agent provides clear next steps
          weight: 1.0
          passingThreshold: 0.9

        - name: professionalism
          description: Agent maintains professional tone
          weight: 0.8
          passingThreshold: 0.8

      # Limit conversation for testing
      maxTurns: 8
      timeout: 60000

tests:
  - description: Agent passes all critical evaluation criteria
    assert:
      - type: javascript
        value: |
          const result = JSON.parse(output);
          const criteria = result.analysis.evaluation_criteria_results;

          // Check that critical criteria passed
          const critical = ['information_gathering', 'next_steps', 'professionalism'];
          const criticalPassed = criteria
            .filter(c => critical.includes(c.name))
            .every(c => c.passed);

          console.log('Criteria Results:');
          criteria.forEach(c => {
            console.log(`  ${c.name}: ${c.passed ? '✓' : '✗'} (score: ${c.score.toFixed(2)})`);
          });

          return criticalPassed;

  - description: Agent conversation stays within turn limit
    assert:
      - type: javascript
        value: |
          const result = JSON.parse(output);
          return result.transcript.length <= 8;

  - description: Agent responds within reasonable time
    assert:
      - type: latency
        threshold: 60000
```

Run the agent eval:

```sh
promptfoo eval -c agent-test.yaml
```

### Step 8: Review Agent Performance

View detailed results:

```sh
promptfoo view
```

In the web UI, you'll see:

- Full conversation transcript
- Evaluation criteria scores
- Pass/fail for each criterion
- Conversation duration and cost
- Audio playback for each turn

## Part 5: Tool Mocking

### Step 9: Add Tool Mocking

Create `agent-with-tools.yaml`:

```yaml
description: "Test agent with order lookup tool"

prompts:
  - |
    User: What's the status of my order?
    User: Order number ORDER-12345

providers:
  - id: elevenlabs:agents
    config:
      agentConfig:
        name: Support Agent with Tools
        prompt: You are a support agent. Use the order_lookup tool to check order status.
        voiceId: 21m00Tcm4TlvDq8ikWAM
        llmModel: gpt-5

        # Define available tools
        tools:
          - type: function
            function:
              name: order_lookup
              description: Look up order status by order number
              parameters:
                type: object
                properties:
                  order_number:
                    type: string
                    description: The order number (format: ORDER-XXXXX)
                required:
                  - order_number

      # Mock tool responses for testing
      toolMockConfig:
        order_lookup:
          response:
            order_number: "ORDER-12345"
            status: "Shipped"
            tracking_number: "1Z999AA10123456784"
            expected_delivery: "2024-03-20"

      evaluationCriteria:
        - name: uses_tool
          description: Agent uses the order_lookup tool
          weight: 1.0
          passingThreshold: 0.9

        - name: provides_tracking
          description: Agent provides tracking information
          weight: 1.0
          passingThreshold: 0.9

tests:
  - description: Agent successfully looks up order
    assert:
      - type: javascript
        value: |
          const result = JSON.parse(output);
          // Verify tool was called
          const toolCalls = result.transcript.filter(t =>
            t.role === 'tool_call'
          );
          return toolCalls.length > 0;

      - type: contains
        value: "1Z999AA10123456784"  # Tracking number from mock
```

Run with tool mocking:

```sh
promptfoo eval -c agent-with-tools.yaml
```

## Next Steps

You've learned to:

- ✅ Compare TTS models and voices
- ✅ Customize voice settings for different scenarios
- ✅ Test STT accuracy with WER calculation
- ✅ Evaluate conversational agents with criteria
- ✅ Mock tools for agent testing

### Explore More

- **Audio processing**: Use isolation for noise removal
- **Regression testing**: Track agent performance over time
- **Production monitoring**: Set up continuous testing

### Example Projects

Check out complete examples:

- [examples/provider-elevenlabs/tts-advanced](https://github.com/promptfoo/promptfoo/tree/main/examples/provider-elevenlabs/tts-advanced)
- [examples/provider-elevenlabs/agents](https://github.com/promptfoo/promptfoo/tree/main/examples/provider-elevenlabs/agents)

### Resources

- [ElevenLabs Provider Reference](/docs/providers/elevenlabs)
- [Promptfoo Documentation](https://www.promptfoo.dev/docs/intro)
- [ElevenLabs API Docs](https://elevenlabs.io/docs)

## Troubleshooting

### Common Issues

**Agent conversations timeout:**

- Increase `maxTurns` and `timeout` in config
- Simplify evaluation criteria
- Use faster LLM models

**High costs during testing:**

- Use `gpt-5-mini` instead of `gpt-5`
- Enable caching for repeated tests
- Implement LLM cascading
- Test with shorter prompts first

**Evaluation criteria always failing:**

- Start with simple, objective criteria
- Lower passing thresholds during development
- Review agent transcript to understand behavior
- Add more specific criteria descriptions

**Audio quality issues:**

- Try different `outputFormat` settings
- Adjust voice settings (stability, similarity_boost)
- Test with different models
- Consider using Turbo over Flash for quality

### Getting Help

- [GitHub Issues](https://github.com/promptfoo/promptfoo/issues)
- [Discord Community](https://discord.gg/promptfoo)
- [ElevenLabs Support](https://elevenlabs.io/support)