---
sidebar_label: Guides
title: Eval Guides
description: Step-by-step tutorials for evaluating LLMs, comparing models, and testing integrations with promptfoo
---

# Eval Guides

Practical tutorials to help you get the most out of promptfoo evals. Each guide walks through a real-world scenario with working configuration examples you can adapt to your own use case.

## Evaluation Techniques

Learn how to measure specific aspects of LLM output quality.

- [Evaluating RAG Pipelines](/docs/guides/evaluate-rag) — Test retrieval-augmented generation end-to-end
- [Preventing Hallucinations](/docs/guides/prevent-llm-hallucinations) — Measure and reduce hallucination rates with perplexity, RAG, and controlled decoding
- [Evaluating JSON Outputs](/docs/guides/evaluate-json) — Validate structured LLM outputs against schemas
- [Evaluating Coding Agents](/docs/guides/evaluate-coding-agents) — Test AI coding assistants
- [Evaluating OSWorld with Inspect](/docs/guides/evaluate-osworld-with-inspect) — Run desktop computer-use benchmark tasks through Inspect
- [Testing Agent Skills](/docs/guides/test-agent-skills) — Compare Claude and Codex skill versions with routing, quality, cost, latency, and trace checks
- [Evaluating Factuality](/docs/guides/factuality-eval) — Score responses for factual accuracy
- [LLM as a Judge](/docs/guides/llm-as-a-judge) — Build reliable model-graded evals with rubrics, calibration, and multi-judge checks
- [Testing LLM Chains](/docs/configuration/testing-llm-chains) — Evaluate multi-step LLM pipelines
- [Choosing the Right Temperature](/docs/guides/evaluate-llm-temperature) — Find the optimal temperature setting for your use case
- [Evaluating Text-to-SQL](/docs/guides/text-to-sql-evaluation) — Measure SQL generation accuracy
- [Sandboxed Code Evaluations](/docs/guides/sandboxed-code-evals) — Safely run and test LLM-generated code
- [HLE Benchmark](/docs/guides/hle-benchmark) — Evaluate models on the Humanity's Last Exam benchmark

## Model Comparisons

Compare LLM performance on your own data to make informed model selection decisions.

- [GPT vs Claude vs Gemini](/docs/guides/gpt-vs-claude-vs-gemini) — Benchmark the leading commercial models side-by-side
- [DeepSeek Benchmark](/docs/guides/deepseek-benchmark) — Evaluate DeepSeek against leading models
- [Choosing the Best GPT Model](/docs/guides/choosing-best-gpt-model) — Pick the right GPT variant for your workload
- [GPT-5.2 vs o3](/docs/guides/gpt-vs-reasoning-model) — Compare standard vs reasoning models
- [Comparing Open-Source Models](/docs/guides/compare-open-source-models) — Evaluate Mistral, Llama, Gemma, and Phi on custom datasets
- [GPT Model Tiers MMLU-Pro](/docs/guides/gpt-mmlu-comparison) — Measure GPT model tier performance on MMLU-Pro
- [Qwen vs Llama vs GPT](/docs/guides/qwen-benchmark) — Run a custom benchmark across model families
- [OpenAI vs Azure Benchmark](/docs/guides/azure-vs-openai) — Test whether Azure-hosted models match OpenAI directly
- [Mixtral vs GPT](/docs/guides/mixtral-vs-gpt) — Pit Mixtral against GPT on real tasks
- [Censored vs Uncensored Models](/docs/guides/censored-vs-uncensored-ollama) — Compare content filtering behavior with Ollama

## Integrations

Use promptfoo with popular frameworks and services.

- [Using LangChain PromptTemplate](/docs/guides/langchain-prompttemplate) — Integrate LangChain prompt templates with promptfoo
- [Evaluating OpenAI Assistants](/docs/guides/evaluate-openai-assistants) — Evaluate OpenAI's Assistants API
- [Evaluating CrewAI Agents](/docs/guides/evaluate-crewai) — Red team and evaluate CrewAI multi-agent workflows
- [Evaluating LangGraph](/docs/guides/evaluate-langgraph) — Test LangGraph agent applications
- [Evaluating ElevenLabs Voice AI](/docs/guides/evaluate-elevenlabs) — Test text-to-speech and voice AI outputs