# provider-elevenlabs/stt (ElevenLabs Speech-to-Text) This example demonstrates how to use ElevenLabs STT provider for audio transcription testing. ## Quick Start ```bash npx promptfoo@latest init --example provider-elevenlabs/stt cd provider-elevenlabs/stt export ELEVENLABS_API_KEY=your_api_key_here npx promptfoo@latest eval ``` ## Features - **Audio Transcription**: Convert speech to text with high accuracy - **Speaker Diarization**: Identify and separate multiple speakers in audio - **Word Error Rate (WER)**: Measure transcription accuracy against reference text - **Multi-format Support**: MP3, WAV, FLAC, M4A, OGG, OPUS, WebM ## Setup 1. **Set your API key**: ```bash export ELEVENLABS_API_KEY=your_api_key_here ``` 2. **Prepare audio files**: Create an `audio/` directory with your test audio files: ```bash mkdir -p audio # Place your audio files in the audio/ directory ``` 3. **Run the evaluation**: ```bash promptfoo eval ``` ## Configuration ### Basic Transcription ```yaml providers: - id: elevenlabs:stt:basic config: modelId: eleven_speech_to_text_v1 language: en # ISO 639-1 language code ``` ### Speaker Diarization Identify and label different speakers in your audio: ```yaml providers: - id: elevenlabs:stt:diarization config: modelId: eleven_speech_to_text_v1 diarization: true maxSpeakers: 3 # Optional: hint for expected number of speakers ``` The response will include speaker segments: ```json { "text": "Full transcription...", "diarization": [ { "speaker_id": "speaker_0", "text": "Hello, how are you?", "start_time_ms": 0, "end_time_ms": 2500, "confidence": 0.95 }, { "speaker_id": "speaker_1", "text": "I'm doing well, thanks!", "start_time_ms": 2500, "end_time_ms": 5000, "confidence": 0.92 } ] } ``` ### Accuracy Testing with WER Word Error Rate (WER) measures transcription accuracy. Lower is better (0 = perfect). ```yaml providers: - id: elevenlabs:stt:accuracy config: modelId: eleven_speech_to_text_v1 calculateWER: true referenceText: The quick brown fox jumps over the lazy dog ``` **WER Formula**: `(Substitutions + Deletions + Insertions) / Total Words` The response includes detailed WER metrics: ```json { "wer": 0.05, // 5% error rate "substitutions": 1, "deletions": 0, "insertions": 0, "correct": 19, "totalWords": 20, "details": { "reference": "the quick brown fox jumps", "hypothesis": "the quick green fox jumps", "alignment": "REF: the quick brown fox jumps\nHYP: the quick green fox jumps\nOPS: SSSSS" } } ``` **WER Interpretation**: - **0.00 - 0.05**: Excellent (95%+ accurate) - **0.05 - 0.10**: Good (90-95% accurate) - **0.10 - 0.20**: Fair (80-90% accurate) - **0.20+**: Poor (< 80% accurate) ## Supported Audio Formats | Format | Extension | Notes | | --------- | ---------- | -------------------------- | | MP3 | .mp3 | Widely compatible | | MP4 Audio | .mp4, .m4a | AAC/MPEG-4 audio | | WAV | .wav | Uncompressed, high quality | | FLAC | .flac | Lossless compression | | OGG | .ogg | Open format | | Opus | .opus | Modern, efficient codec | | WebM | .webm | Web-optimized | ## Audio Input Methods ### Method 1: Config-level ```yaml providers: - id: elevenlabs:stt config: audioFile: path/to/audio.mp3 ``` ### Method 2: Prompt-level ```yaml prompts: - audio/sample1.mp3 - audio/sample2.wav ``` ### Method 3: Vars-level ```yaml tests: - vars: audioFile: audio/sample.mp3 ``` ## Testing Assertions ### Cost Threshold ```yaml tests: - assert: - type: cost threshold: 0.05 # Max $0.05 per transcription ``` ### Latency Threshold ```yaml tests: - assert: - type: latency threshold: 10000 # Max 10 seconds ``` ### Transcription Quality ```yaml tests: - assert: - type: contains value: expected phrase - type: not-contains value: incorrect phrase ``` ### WER Threshold ```yaml tests: - assert: - type: javascript value: | const wer = context.vars.metadata?.wer?.wer || 1; wer < 0.1 // Less than 10% error ``` ### Speaker Count ```yaml tests: - assert: - type: javascript value: | const diarization = context.vars.metadata?.transcription?.diarization || []; const uniqueSpeakers = new Set(diarization.map(s => s.speaker_id)); uniqueSpeakers.size === 2 // Expect 2 speakers ``` ## Language Support ElevenLabs STT supports 30+ languages. Specify using ISO 639-1 codes: ```yaml config: language: en # English # language: es # Spanish # language: fr # French # language: de # German # language: it # Italian # language: pt # Portuguese # language: ja # Japanese # language: ko # Korean # language: zh # Chinese ``` **Auto-detection**: Omit `language` to let the API detect the language automatically. ## Cost Information STT pricing is based on audio duration: - **Free tier**: 1 hour/month - **Paid tiers**: ~$0.10 per minute (~$0.00167 per second) The provider automatically tracks and reports costs in the evaluation results. ## Advanced Usage ### Batch Transcription ```yaml prompts: - audio/batch1.mp3 - audio/batch2.mp3 - audio/batch3.mp3 providers: - id: elevenlabs:stt config: modelId: eleven_speech_to_text_v1 # Test all files with consistent assertions tests: - assert: - type: cost threshold: 0.10 - type: latency threshold: 15000 ``` ### Multi-language Testing ```yaml providers: - id: elevenlabs:stt:english config: language: en - id: elevenlabs:stt:spanish config: language: es - id: elevenlabs:stt:autodetect config: # No language specified = auto-detect prompts: - audio/english_sample.mp3 - audio/spanish_sample.mp3 ``` ### Accuracy Comparison Compare transcription accuracy across different audio qualities: ```yaml prompts: - audio/high_quality_48khz.wav - audio/medium_quality_16khz.mp3 - audio/low_quality_8khz.mp3 providers: - id: elevenlabs:stt config: calculateWER: true referenceText: This is the expected transcription text tests: - description: High quality should have WER < 5% vars: audioFile: audio/high_quality_48khz.wav assert: - type: javascript value: (context.vars.metadata?.wer?.wer || 1) < 0.05 - description: Medium quality should have WER < 10% vars: audioFile: audio/medium_quality_16khz.mp3 assert: - type: javascript value: (context.vars.metadata?.wer?.wer || 1) < 0.10 ``` ## Troubleshooting ### API Key Issues ```bash # Verify your API key is set echo $ELEVENLABS_API_KEY # Or set it inline ELEVENLABS_API_KEY=your_key promptfoo eval ``` ### Audio File Not Found ```text Error: Failed to read audio file: ENOENT: no such file or directory ``` **Solution**: Use absolute paths or paths relative to the config file: ```yaml prompts: - /absolute/path/to/audio.mp3 - ./relative/path/to/audio.mp3 ``` ### Unsupported Format ```text Error: Unsupported audio format ``` **Solution**: Convert your audio to a supported format (MP3, WAV, etc.) using tools like `ffmpeg`: ```bash ffmpeg -i input.video -vn -acodec mp3 output.mp3 ``` ### High WER on Clear Audio If you're getting unexpectedly high WER: 1. **Check reference text** - ensure it exactly matches the audio (including punctuation) 2. **Specify language** - auto-detection may choose the wrong language 3. **Audio quality** - ensure audio is clear with minimal background noise 4. **Normalization** - WER calculation normalizes text (lowercase, removes punctuation) ## API Reference ### Config Options | Option | Type | Default | Description | | --------------- | ------- | ------------------------------ | ------------------------------ | | `modelId` | string | `eleven_speech_to_text_v1` | STT model to use | | `language` | string | auto-detect | ISO 639-1 language code | | `diarization` | boolean | `false` | Enable speaker identification | | `maxSpeakers` | number | - | Expected number of speakers | | `audioFile` | string | - | Path to audio file | | `audioFormat` | string | auto-detect | Audio format override | | `referenceText` | string | - | Expected transcription for WER | | `calculateWER` | boolean | `false` | Calculate Word Error Rate | | `baseUrl` | string | `https://api.elevenlabs.io/v1` | API endpoint | | `timeout` | number | `120000` | Request timeout (ms) | | `retries` | number | `3` | Number of retry attempts | ## Related Examples - [ElevenLabs TTS](../tts/) - Text-to-Speech synthesis - [ElevenLabs Isolation](../isolation/) - Audio cleanup quality comparison ## Resources - [ElevenLabs STT Documentation](https://elevenlabs.io/docs/speech-to-text) - [Supported Languages](https://elevenlabs.io/languages) - [Word Error Rate (WER)](https://en.wikipedia.org/wiki/Word_error_rate)