# compare-deepseek-r1-vs-openai-o1 (DeepSeek-R1 vs OpenAI o1 Comparison) You can run this example with: ```bash npx promptfoo@latest init --example compare-deepseek-r1-vs-openai-o1 cd compare-deepseek-r1-vs-openai-o1 ``` This example demonstrates how to benchmark DeepSeek's R1 model against OpenAI's o1 model using the Massive Multitask Language Understanding (MMLU) benchmark, focusing on reasoning-heavy subjects. ## Prerequisites - promptfoo CLI installed (`npm install -g promptfoo` or `brew install promptfoo`) - OpenAI API key set as `OPENAI_API_KEY` - DeepSeek API key set as `DEEPSEEK_API_KEY` - Hugging Face account and access token (for MMLU dataset) ## Hugging Face Authentication To access the MMLU dataset, you'll need to authenticate with Hugging Face: 1. Create a Hugging Face account at [huggingface.co](https://huggingface.co) if you don't have one 2. Generate an access token at [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens) 3. Set your token as an environment variable: ```bash export HF_TOKEN=your_token_here ``` Or add it to your `.env` file: ```env HF_TOKEN=your_token_here ``` ## Running the Eval 1. Get a local copy of the promptfooconfig. You can clone this repository and from the root directory run: ```bash cd examples/compare-deepseek-r1-vs-openai-o1 ``` or you can get the example with: ```bash promptfoo init --example compare-deepseek-r1-vs-openai-o1 ``` 2. Run the evaluation: ```bash promptfoo eval ``` 3. View the results in a web interface: ```bash promptfoo view ``` ## What's Being Tested This comparison evaluates both models on reasoning tasks from the MMLU benchmark, specifically: 1. **Abstract Algebra**: Advanced mathematical reasoning 2. **Formal Logic**: Logical statement analysis 3. **High School Mathematics**: Core problem-solving 4. **College Mathematics**: Advanced mathematical concepts 5. **Logical Fallacies**: Flaw identification in reasoning Each subject uses 10 questions to keep the test manageable. You can edit this in `promptfooconfig.yaml`. ## Test Structure The configuration in `promptfooconfig.yaml`: 1. **Prompt Template**: Encourages step-by-step reasoning for multiple choice questions 2. **Quality Checks**: - 60-second timeout per question - Required step-by-step reasoning - Clear final answer format 3. **Evaluation Metrics**: - Accuracy - Reasoning quality - Response time - Format adherence ## Customizing You can modify the test by editing `promptfooconfig.yaml`: 1. Add more MMLU subjects: ```yaml tests: - huggingface://datasets/cais/mmlu?split=test&subset=physics ``` 2. Try different prompting strategies: ```yaml prompts: # Zero-shot with step-by-step reasoning (default) - | You are an expert test taker. Please solve the following multiple choice question step by step. Question: {{question}} Options: A) {{choices[0]}} B) {{choices[1]}} C) {{choices[2]}} D) {{choices[3]}} Think through this step by step, then provide your final answer in the format "Therefore, the answer is A/B/C/D." # Zero-shot with direct answer - | Question: {{question}} A) {{choices[0]}} B) {{choices[1]}} C) {{choices[2]}} D) {{choices[3]}} Answer with just the letter (A/B/C/D) of the correct option. ``` 3. Change the number of questions: ```yaml tests: - huggingface://datasets/cais/mmlu?split=test&subset=physics&limit=20 # Test 20 questions per subject ``` 4. Adjust quality requirements: ```yaml defaultTest: assert: - type: latency threshold: 30000 # Stricter 30-second timeout ``` ## Additional Resources - [DeepSeek provider documentation](https://promptfoo.dev/docs/providers/deepseek) - [MMLU benchmark](https://huggingface.co/datasets/cais/mmlu)