--- title: Getting started description: Learn how to set up your first promptfoo config file, create prompts, configure providers, and run your first LLM evaluation. keywords: [getting started, setup, configuration, prompts, providers, evaluation, llm testing] sidebar_position: 5 --- import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; # Getting started This guide will walk you through creating a working eval that tests prompts across multiple models and opens a web view for comparing outputs. After [installing](/docs/installation) promptfoo, you can set up your first config file in a few ways: ## Running an example Set up your first config file with a pre-built example by running this command with [npx](https://nodejs.org/en/download), [npm](https://nodejs.org/en/download), or [brew](https://brew.sh/): ```bash npx promptfoo@latest init --example getting-started ``` ```bash npm install -g promptfoo promptfoo init --example getting-started ``` ```bash brew install promptfoo promptfoo init --example getting-started ``` This will create a new directory with a [basic example](https://github.com/promptfoo/promptfoo/tree/main/examples/getting-started) that tests translation prompts across different models. The example includes: - A configuration file `promptfooconfig.yaml` with sample prompts, providers, and test cases. - A `README.md` file explaining how the example works. Most providers need authentication. For OpenAI: ```sh export OPENAI_API_KEY=sk-abc123 ``` Then navigate to the example directory, run the eval, and view results: ```bash cd getting-started npx promptfoo@latest eval npx promptfoo@latest view ``` ```bash cd getting-started promptfoo eval promptfoo view ``` ```bash cd getting-started promptfoo eval promptfoo view ``` ## Set up via the CLI To start from scratch, run `promptfoo init` to create a config through an interactive CLI walkthrough: ```bash npx promptfoo@latest init ``` ```bash promptfoo init ``` ```bash promptfoo init ``` ## Set up via the Web UI If you prefer a visual interface, run `promptfoo eval setup` to configure your first eval through the web UI: ```bash npx promptfoo@latest eval setup ``` ```bash promptfoo eval setup ``` ```bash promptfoo eval setup ``` This opens a browser-based setup flow that walks you through creating prompts, choosing providers, and adding test cases.
Promptfoo eval setup Web UI
## Configuration Now that you've created an initial configuration, you can update `promptfooconfig.yaml` with your own prompts, providers, and test cases: 1. **Set up your prompts**: Open `promptfooconfig.yaml` and add prompts that you want to test. Use double curly braces for variable placeholders: `{{variable_name}}`. For example: ```yaml prompts: - 'Convert the following English text to {{language}}: {{input}}' ``` [» More information on setting up prompts](/docs/configuration/prompts) 2. **Add providers**: Add `providers` to specify AI models you want to test. Promptfoo supports 60+ providers including OpenAI, Anthropic, Google, and many others: ```yaml providers: - openai:chat:gpt-5.4 - openai:chat:gpt-5.4-mini - anthropic:messages:claude-opus-4-6 - google:gemini-3.1-pro-preview # Or use your own custom provider - file://path/to/custom/provider.py ``` This includes cloud APIs, local models like [Ollama](/docs/providers/ollama), and custom [Python](/docs/providers/python) or [JavaScript](/docs/providers/custom-api) code. [» See all providers](/docs/providers) 3. **Add test inputs**: Add some example inputs for your prompts. Optionally, add [assertions](/docs/configuration/expected-outputs) to set output requirements that are checked automatically. For example: ```yaml tests: - vars: language: French input: Hello world assert: - type: contains value: 'Bonjour le monde' - vars: language: Spanish input: Where is the library? assert: - type: icontains value: 'Dónde está la biblioteca' ``` When writing test cases, think of core use cases and potential failures that you want to make sure your prompts handle correctly. [» More information on setting up tests](/docs/configuration/guide) 4. **Run the evaluation**: Make sure you're in the directory containing `promptfooconfig.yaml`, then run: ```bash npx promptfoo@latest eval ``` ```bash promptfoo eval ``` ```bash promptfoo eval ``` This tests every prompt, model, and test case. 5. **Review outputs**: After the evaluation is complete, open the web viewer to review the outputs: ```bash npx promptfoo@latest view ``` ```bash promptfoo view ``` ```bash promptfoo view ``` ![Promptfoo Web UI showing evaluation results](/img/docs/custom-example-view.png) ### Asserts The YAML configuration format runs each prompt through a series of test cases and checks if they meet the specified [asserts](/docs/configuration/expected-outputs/). Asserts are _optional_. Many people get value out of reviewing outputs manually, and the web UI helps facilitate this. :::tip See the [Configuration docs](/docs/configuration/guide) for a detailed guide. ::: ## Examples The examples below cover a few common eval patterns: prompt quality, model quality, RAG quality, and agent quality. ### Prompt quality In [this example](https://github.com/promptfoo/promptfoo/tree/main/examples/eval-self-grading), we evaluate whether adding adjectives to the personality of an assistant bot affects the responses. You can quickly set up this example by running: ```bash npx promptfoo@latest init --example eval-self-grading ``` ```bash promptfoo init --example eval-self-grading ``` ```bash promptfoo init --example eval-self-grading ```
Show YAML file for this example ```yaml # yaml-language-server: $schema=https://promptfoo.dev/config-schema.json description: Automatic response evaluation using LLM rubric scoring # Load prompts prompts: - file://prompts.txt providers: - openai:chat:gpt-5.4 defaultTest: assert: - type: llm-rubric value: Do not mention that you are an AI or chat assistant - type: javascript # Shorter is better value: Math.max(0, Math.min(1, 1 - (output.length - 100) / 900)); tests: - vars: name: Bob question: Can you help me find a specific product on your website? - vars: name: Jane question: Do you have any promotions or discounts currently available? - vars: name: Ben question: Can you check the availability of a product at a specific store location? - vars: name: Dave question: What are your shipping and return policies? - vars: name: Jim question: Can you provide more information about the product specifications or features? - vars: name: Alice question: Can you recommend products that are similar to what I've been looking at? - vars: name: Sophie question: Do you have any recommendations for products that are currently popular or trending? - vars: name: Jessie question: How can I track my order after it has been shipped? - vars: name: Kim question: What payment methods do you accept? - vars: name: Emily question: Can you help me with a problem I'm having with my account or order? ```
From the newly created directory, run `npx promptfoo@latest eval` to execute this example: ![promptfoo command line](/img/docs/self-grading.gif) This command will evaluate the prompts, substituting variable values, and output the results in your terminal. You can also output a [spreadsheet](https://docs.google.com/spreadsheets/d/1nanoj3_TniWrDl1Sj-qYqIMD6jwm5FBy15xPFdUTsmI/edit?usp=sharing), [JSON](https://github.com/promptfoo/promptfoo/blob/main/examples/simple-cli/output.json), YAML or HTML. ### Model quality In [this next example](https://github.com/promptfoo/promptfoo/tree/main/examples/compare-openai-models), we evaluate the difference between GPT-5.4 and GPT-5.4 Mini outputs for a given prompt: You can quickly set up this example by running: ```bash npx promptfoo@latest init --example compare-openai-models ``` ```bash promptfoo init --example compare-openai-models ``` ```bash promptfoo init --example compare-openai-models ```
Show YAML file for this example ```yaml # yaml-language-server: $schema=https://promptfoo.dev/config-schema.json description: Comparing OpenAI flagship and mini models performance on riddles prompts: - 'Solve this riddle: {{riddle}}' providers: - openai:chat:gpt-5.4 - openai:chat:gpt-5.4-mini defaultTest: assert: # Inference should always cost less than this (USD) - type: cost threshold: 0.002 # Inference should always be faster than this (milliseconds) - type: latency threshold: 3000 tests: - vars: riddle: 'I speak without a mouth and hear without ears. I have no body, but I come alive with wind. What am I?' assert: # Make sure the LLM output contains this word - type: contains value: echo # Use model-graded assertions to enforce free-form instructions - type: llm-rubric value: Do not apologize - vars: riddle: "You see a boat filled with people. It has not sunk, but when you look again you don't see a single person on the boat. Why?" assert: - type: llm-rubric value: explains that there are no single people (they are all married) - vars: riddle: 'The more of this there is, the less you see. What is it?' assert: - type: contains value: darkness - vars: riddle: >- I have keys but no locks. I have space but no room. You can enter, but can't go outside. What am I? - vars: riddle: >- I am not alive, but I grow; I don't have lungs, but I need air; I don't have a mouth, but water kills me. What am I? - vars: riddle: What can travel around the world while staying in a corner? - vars: riddle: Forward I am heavy, but backward I am not. What am I? - vars: riddle: >- The person who makes it, sells it. The person who buys it, never uses it. The person who uses it, doesn't know they're using it. What is it? - vars: riddle: I can be cracked, made, told, and played. What am I? - vars: riddle: What has keys but can't open locks? - vars: riddle: >- I'm light as a feather, yet the strongest person can't hold me for much more than a minute. What am I? - vars: riddle: >- I can fly without wings, I can cry without eyes. Whenever I go, darkness follows me. What am I? - vars: riddle: >- I am taken from a mine, and shut up in a wooden case, from which I am never released, and yet I am used by almost every person. What am I? - vars: riddle: >- David's father has three sons: Snap, Crackle, and _____? What is the name of the third son? - vars: riddle: >- I am light as a feather, but even the world's strongest man couldn't hold me for much longer than a minute. What am I? ```
Navigate to the newly created directory and run `npx promptfoo@latest eval` or `promptfoo eval`. Also note that you can override parameters directly from the command line. For example, if you run this command: ```bash npx promptfoo@latest eval -r google:gemini-3.1-pro-preview google:gemini-2.5-pro ``` ```bash promptfoo eval -r google:gemini-3.1-pro-preview google:gemini-2.5-pro ``` ```bash promptfoo eval -r google:gemini-3.1-pro-preview google:gemini-2.5-pro ``` It produces the following table, with Gemini models replacing the GPT models in the config: ![Side-by-side eval of LLM model quality, gemini-3.0-pro vs gemini-2.5](/img/cl-provider-override.png) A similar approach can be used to run other model comparisons. For example, you can: - Compare GPT-5.4 reasoning effort settings (see [GPT reasoning effort comparison](https://github.com/promptfoo/promptfoo/tree/main/examples/compare-gpt-reasoning-effort)) - Compare models with different temperatures (see [GPT temperature comparison](https://github.com/promptfoo/promptfoo/tree/main/examples/compare-gpt-temperature)) - Compare open-source models (see [Comparing Open-Source Models](/docs/guides/compare-open-source-models)) - Compare Retrieval-Augmented Generation (RAG) with LangChain vs. regular GPT (see [LangChain example](/docs/configuration/testing-llm-chains)) ### RAG quality In [this example](https://github.com/promptfoo/promptfoo/tree/main/examples/eval-rag), we evaluate whether RAG outputs are factual, relevant, and grounded in the retrieved context. You can quickly set up this example by running: ```bash npx promptfoo@latest init --example eval-rag ``` ```bash promptfoo init --example eval-rag ``` ```bash promptfoo init --example eval-rag ``` From the newly created directory, run `npx promptfoo@latest eval` or `promptfoo eval` to grade outputs on factuality, answer relevance, context recall, context relevance, and context faithfulness. For a deeper walkthrough, see the [RAG evaluation guide](/docs/guides/evaluate-rag). ### Agent quality In [this example](https://github.com/promptfoo/promptfoo/tree/main/examples/openai-agents-basic), we evaluate an OpenAI Agents SDK workflow that uses tools for dice rolls, inventory checks, scene descriptions, and character stats. You can quickly set up this example by running: ```bash npx promptfoo@latest init --example openai-agents-basic ``` ```bash promptfoo init --example openai-agents-basic ``` ```bash promptfoo init --example openai-agents-basic ``` From the newly created directory, run `npm install`, then `npx promptfoo@latest eval` or `promptfoo eval` to test tool use and response quality across multi-turn scenarios. For task completion and trajectory checks, see the [agent evaluation guide](/docs/guides/evaluate-coding-agents) and [tracing docs](/docs/tracing). ## Next steps Now that you've run your first eval, here are some ways to go deeper: **Customize your setup:** - [Configuration guide](/docs/configuration/guide) - Detailed walkthrough of all config options - [Providers documentation](/docs/providers) - All 60+ supported AI models and services - [Assertions & Metrics](/docs/configuration/expected-outputs) - Automatically grade outputs on a pass/fail basis **Explore use cases:** - [Agent evaluation](/docs/guides/evaluate-coding-agents) - Test whether agents complete tasks and follow expected trajectories - [RAG evaluation](/docs/guides/evaluate-rag) - Test retrieval-augmented generation pipelines - [Red teaming quickstart](/docs/red-team/quickstart) - Scan your LLM app for security vulnerabilities - [CI/CD integration](/docs/integrations/github-action) - Run evals automatically on every PR **Learn from examples:** - [More examples](https://github.com/promptfoo/promptfoo/tree/main/examples) in our GitHub repository