--- sidebar_label: Guides title: Eval Guides description: Step-by-step tutorials for evaluating LLMs, comparing models, and testing integrations with promptfoo --- # Eval Guides Practical tutorials to help you get the most out of promptfoo evals. Each guide walks through a real-world scenario with working configuration examples you can adapt to your own use case. ## Evaluation Techniques Learn how to measure specific aspects of LLM output quality. - [Evaluating RAG Pipelines](/docs/guides/evaluate-rag) — Test retrieval-augmented generation end-to-end - [Preventing Hallucinations](/docs/guides/prevent-llm-hallucinations) — Measure and reduce hallucination rates with perplexity, RAG, and controlled decoding - [Evaluating JSON Outputs](/docs/guides/evaluate-json) — Validate structured LLM outputs against schemas - [Evaluating Coding Agents](/docs/guides/evaluate-coding-agents) — Test AI coding assistants - [Evaluating OSWorld with Inspect](/docs/guides/evaluate-osworld-with-inspect) — Run desktop computer-use benchmark tasks through Inspect - [Testing Agent Skills](/docs/guides/test-agent-skills) — Compare Claude and Codex skill versions with routing, quality, cost, latency, and trace checks - [Evaluating Factuality](/docs/guides/factuality-eval) — Score responses for factual accuracy - [LLM as a Judge](/docs/guides/llm-as-a-judge) — Build reliable model-graded evals with rubrics, calibration, and multi-judge checks - [Testing LLM Chains](/docs/configuration/testing-llm-chains) — Evaluate multi-step LLM pipelines - [Choosing the Right Temperature](/docs/guides/evaluate-llm-temperature) — Find the optimal temperature setting for your use case - [Evaluating Text-to-SQL](/docs/guides/text-to-sql-evaluation) — Measure SQL generation accuracy - [Sandboxed Code Evaluations](/docs/guides/sandboxed-code-evals) — Safely run and test LLM-generated code - [HLE Benchmark](/docs/guides/hle-benchmark) — Evaluate models on the Humanity's Last Exam benchmark ## Model Comparisons Compare LLM performance on your own data to make informed model selection decisions. - [GPT vs Claude vs Gemini](/docs/guides/gpt-vs-claude-vs-gemini) — Benchmark the leading commercial models side-by-side - [DeepSeek Benchmark](/docs/guides/deepseek-benchmark) — Evaluate DeepSeek against leading models - [Choosing the Best GPT Model](/docs/guides/choosing-best-gpt-model) — Pick the right GPT variant for your workload - [GPT-5.2 vs o3](/docs/guides/gpt-vs-reasoning-model) — Compare standard vs reasoning models - [Comparing Open-Source Models](/docs/guides/compare-open-source-models) — Evaluate Mistral, Llama, Gemma, and Phi on custom datasets - [GPT Model Tiers MMLU-Pro](/docs/guides/gpt-mmlu-comparison) — Measure GPT model tier performance on MMLU-Pro - [Qwen vs Llama vs GPT](/docs/guides/qwen-benchmark) — Run a custom benchmark across model families - [OpenAI vs Azure Benchmark](/docs/guides/azure-vs-openai) — Test whether Azure-hosted models match OpenAI directly - [Mixtral vs GPT](/docs/guides/mixtral-vs-gpt) — Pit Mixtral against GPT on real tasks - [Censored vs Uncensored Models](/docs/guides/censored-vs-uncensored-ollama) — Compare content filtering behavior with Ollama ## Integrations Use promptfoo with popular frameworks and services. - [Using LangChain PromptTemplate](/docs/guides/langchain-prompttemplate) — Integrate LangChain prompt templates with promptfoo - [Evaluating OpenAI Assistants](/docs/guides/evaluate-openai-assistants) — Evaluate OpenAI's Assistants API - [Evaluating CrewAI Agents](/docs/guides/evaluate-crewai) — Red team and evaluate CrewAI multi-agent workflows - [Evaluating LangGraph](/docs/guides/evaluate-langgraph) — Test LangGraph agent applications - [Evaluating ElevenLabs Voice AI](/docs/guides/evaluate-elevenlabs) — Test text-to-speech and voice AI outputs