--- sidebar_label: 'Comparing open-source models' description: 'Compare DeepSeek, Mistral, Qwen, and Llama performance on your custom datasets using automated benchmarks to select the best open-source model for your use case' --- # Comparing Open-Source Models: Benchmark on Your Own Data When it comes to building LLM apps, there is no one-size-fits-all benchmark. To maximize the quality of your LLM application, consider building your own benchmark to supplement public benchmarks. This guide describes how to compare current open-source models like DeepSeek, Mistral, Qwen, and Llama using the `promptfoo` CLI. You can mix and match any combination of these models — just include the providers you want to test. The end result is a view that compares the performance of your chosen models side-by-side: ![mistral, mixtral, and llama comparison](/img/docs/mistral-vs-mixtral-vs-llama.jpg) ## Requirements This guide assumes that you have promptfoo [installed](/docs/installation). It uses OpenRouter for convenience, but you can follow these instructions for any provider. ## Set up the config Initialize a new directory that will contain our prompts and test cases: ```sh npx promptfoo@latest init --example compare-open-source-models ``` Now let's start editing `promptfooconfig.yaml`. Create a list of models we'd like to compare: ```yaml title="promptfooconfig.yaml" providers: - openrouter:deepseek/deepseek-v3.2 - openrouter:mistralai/mistral-small-3.2-24b-instruct - openrouter:meta-llama/llama-4-maverick - openrouter:qwen/qwen3-32b ``` We're using OpenRouter for convenience because it wraps everything in an OpenAI-compatible chat format, but you can use any [provider](/docs/providers) that supplies these models, including HuggingFace, Replicate, Groq, and more. :::tip If you prefer to run against locally hosted versions of these models, this can be done via [Ollama](/docs/providers/ollama), [LocalAI](/docs/providers/localai), or [Llama.cpp](/docs/providers/llama.cpp). See [Running Locally with Ollama](#running-locally-with-ollama) below. ::: ## Set up the prompts Setting up prompts is straightforward. Just include one or more prompts with any `{{variables}}` you like: ```yaml prompts: - 'Respond to this user input: {{message}}' ``` You should modify this prompt to match the use case you want to test. For example: ```yaml prompts: - 'Summarize this article: {{article}}' - 'Generate a technical explanation for {{concept}}' ```
Advanced: Click here to see how to format prompts differently for each model If you're using OpenRouter, you can skip model-specific prompt templates because OpenRouter normalizes requests into an OpenAI-compatible chat format. If you switch to raw model endpoints on another provider, prompt formatting may differ by model family. In that setup: - keep one prompt template per family - assign templates to providers with labels - confirm the expected chat format in the model card or provider runtime docs :::tip If you do use external prompt files, they are [Nunjucks templates](https://mozilla.github.io/nunjucks/), so you can use if statements, for loops, and filters for more complex prompts. :::
## Configure model parameters Each model has a `config` field where you can specify additional parameters. Let's add `temperature` for each model: ```yaml title="promptfooconfig.yaml" providers: - id: openrouter:deepseek/deepseek-v3.2 config: temperature: 0.5 - id: openrouter:mistralai/mistral-small-3.2-24b-instruct config: temperature: 0.5 - id: openrouter:meta-llama/llama-4-maverick config: temperature: 0.5 - id: openrouter:qwen/qwen3-32b config: temperature: 0.5 ``` These settings will apply to all test cases run against these models. ## Set environment variables To configure the OpenRouter provider, be sure to set the environment variable: ```sh OPENROUTER_API_KEY=your_openrouter_api_key ``` ## Add test cases The `tests` field in the `promptfooconfig.yaml` file is where you add your test cases. Each test case is a dictionary with the `vars` field containing the variables to be interpolated into the prompts. Here are the test cases we will use: ```yaml title="promptfooconfig.yaml" # ... tests: - vars: message: hello world - vars: message: tell me about yourself - vars: message: What's the weather in New York right now? - vars: message: are hippos good swimmers? - vars: message: Did Henry VIII have any grandchildren? - vars: message: Would a cannibal cult be harmful to society? - vars: message: Please write a function in JavaScript that takes in a string as input and returns true if it contains a valid roman numeral and false otherwise. - vars: message: what are the most common non-investor roles at early stage venture capital firms? ``` ### Check for correctness Let's set up a few assertions to automatically assess the output for correctness: ```yaml title="promptfooconfig.yaml" # ... tests: - vars: message: hello world - vars: message: tell me about yourself - vars: message: What's the weather in New York right now? assert: - type: llm-rubric value: does not claim to know the current weather in new york - vars: message: are hippos good swimmers? assert: - type: llm-rubric value: says that hippos are NOT good swimmers - vars: message: Did Henry VIII have any grandchildren? assert: - type: llm-rubric value: says that Henry VIII doesn't have grandchildren - vars: message: Would a cannibal cult be harmful to society? assert: - type: llm-rubric value: unequivocally says that a cannibal cult is harmful to society - vars: message: Please write a function in JavaScript that takes in a string as input and returns true if it contains a valid roman numeral and false otherwise. - vars: message: what are the most common non-investor roles at early stage venture capital firms? ``` :::info Learn more about setting up test assertions [here](/docs/configuration/expected-outputs). ::: ## Run the comparison Once your config file is set up, you can run the comparison using the `promptfoo eval` command: ``` npx promptfoo@latest eval ``` This will run each of the test cases against each of the models and output the results. Then, to open the web viewer, run `npx promptfoo@latest view`. ![mistral, mixtral, and llama comparison](/img/docs/mistral-vs-mixtral-vs-llama.jpg) You can also output a JSON, YAML, or CSV by specifying an output file: ``` npx promptfoo@latest eval -o output.csv ``` ## Analyzing the results After running the evaluation, look for patterns in the results: - Which model is more accurate or relevant in its responses? - Are there noticeable differences in how they handle certain types of questions? - Consider the implications of these results for your specific application or use case. Common differences worth tracking: - which model refuses unsupported real-time or unverifiable claims - which model is most concise versus most verbose - how often each model follows formatting instructions without drift - whether code and reasoning tasks trade off against conversational quality ## Running Locally with Ollama If you prefer to run models locally, you can use [Ollama](/docs/providers/ollama) instead of OpenRouter. Just swap the providers: ```yaml title="promptfooconfig.yaml" providers: - id: ollama:chat:mistral config: temperature: 0.01 num_predict: 128 - id: ollama:chat:llama4:scout config: temperature: 0.01 num_predict: 128 - id: ollama:chat:gemma2 config: temperature: 0.01 num_predict: 128 - id: ollama:chat:phi4 config: temperature: 0.01 num_predict: 128 ``` Make sure you've pulled the models first: ```sh ollama pull mistral ollama pull llama4:scout ollama pull gemma2 ollama pull phi4 ``` Everything else in the configuration stays the same. ## Conclusion Ultimately, if you are considering these LLMs for a specific use case, you should eval them specifically for your use case. Replace the test cases above with representative examples from your specific workload. This will create a much more specific and useful benchmark. View the [getting started](/docs/getting-started) guide to run your own LLM benchmarks.