--- sidebar_label: LLM Rubric description: 'Create flexible custom rubrics using natural language to evaluate LLM outputs against specific quality and safety criteria' --- # LLM Rubric `llm-rubric` is promptfoo's general-purpose grader for "LLM as a judge" evaluation. It is similar to OpenAI's [model-graded-closedqa](/docs/configuration/expected-outputs) prompt, but can be more effective and robust in certain cases. ## How to use it To use the `llm-rubric` assertion type, add it to your test configuration like this: ```yaml assert: - type: llm-rubric # Specify the criteria for grading the LLM output: value: Is not apologetic and provides a clear, concise answer ``` This assertion will use a language model to grade the output based on the specified rubric. ## How it works Under the hood, `llm-rubric` uses a model to evaluate the output based on the criteria you provide. By default, it uses different models depending on which API keys are available: - **OpenAI API key**: `gpt-5` - **Codex/ChatGPT login**: `openai:codex-sdk` when the Codex SDK package is installed, Codex is signed in, and no higher-priority API credentials are set - **Anthropic API key**: `claude-sonnet-4-5-20250929` - **Google AI Studio API key**: `gemini-2.5-pro` (GEMINI_API_KEY, GOOGLE_API_KEY, or PALM_API_KEY) - **Google Vertex credentials**: `gemini-2.5-pro` (service account credentials) - **Mistral API key**: `mistral-large-latest` - **GitHub token**: `openai/gpt-5` - **Azure credentials**: Your configured Azure GPT deployment You can override this by setting the `provider` option (see below). Codex/ChatGPT login fallback is text-only. Assertions that need embeddings or moderation still require an API-key-backed provider override. It asks the model to output a JSON object that looks like this: ```json { "reason": "", "score": 0.5, // 0.0-1.0 "pass": true // true or false } ``` Use your knowledge of this structure to give special instructions in your rubric, for example: ```yaml assert: - type: llm-rubric value: | Evaluate the output based on how funny it is. Grade it on a scale of 0.0 to 1.0, where: Score of 0.1: Only a slight smile. Score of 0.5: Laughing out loud. Score of 1.0: Rolling on the floor laughing. Anything funny enough to be on SNL should pass, otherwise fail. ``` ## Using variables in the rubric You can incorporate test variables into your LLM rubric. This is particularly useful for detecting hallucinations or ensuring the output addresses specific aspects of the input. Here's an example: ```yaml providers: - openai:gpt-5 prompts: - file://prompt1.txt - file://prompt2.txt defaultTest: assert: - type: llm-rubric value: 'Provides a direct answer to the question: "{{question}}" without unnecessary elaboration' tests: - vars: question: What is the capital of France? - vars: question: How many planets are in our solar system? ``` ## Overriding the LLM grader By default, `llm-rubric` uses `gpt-5` for grading. You can override this in several ways: 1. Using the `--grader` CLI option: ```sh promptfoo eval --grader openai:gpt-5-mini ``` 2. Using `test.options` or `defaultTest.options`: ```yaml defaultTest: // highlight-start options: provider: openai:gpt-5-mini // highlight-end tests: - assert: - type: llm-rubric value: Is written in a professional tone ``` 3. Using `assertion.provider`: ```yaml tests: - description: Evaluate output using LLM assert: - type: llm-rubric value: Is written in a professional tone // highlight-start provider: openai:gpt-5-mini // highlight-end ``` ### Setting grader parameters (temperature, etc.) To pin `temperature`, `max_tokens`, or other provider-specific parameters on the grader, expand the `provider` shorthand into an object with `id` and `config`. This is the supported way to push grading toward reproducibility when swapping in a custom judge: ```yaml assert: - type: llm-rubric value: Is not apologetic and provides a clear, concise answer // highlight-start provider: id: openai:gpt-5-mini config: temperature: 0 // highlight-end ``` The same shape works under `defaultTest.options.provider` and `test.options.provider`. Provider precedence is exact: `assertion.provider` overrides `test.options.provider`, which overrides `defaultTest.options.provider`. If your default grader is a full object with `config`, do not add a shorthand `provider: openai:chat:...` on the assertion unless you also repeat the full object there. :::note The built-in OpenAI grader already defaults to `temperature=0`, so this override is only needed when you're pointing at a different model or provider whose default differs. GPT-5 series reasoning models ignore `temperature` and do not need it set. ::: Custom `llm-rubric` providers can also return a `metadata` object in their `ProviderResponse`. promptfoo copies those keys onto the assertion's `GradingResult.metadata` alongside `renderedGradingPrompt`, which makes per-assertion fields such as upload IDs or trace IDs available in hooks like `afterEach`. ### OpenAI-compatible judges with thinking output Some self-hosted OpenAI-compatible judges, including vLLM servers configured with reasoning parsers, return hidden reasoning separately from final content. Promptfoo includes that reasoning in provider output by default. For a judge, that can confuse JSON parsing if the reasoning contains scratchpad objects before the final `{"pass": ..., "score": ..., "reason": ...}` verdict. Set `showThinking: false` on the judge provider. See the [vLLM provider guide](/docs/providers/vllm#use-vllm-as-an-llm-judge) for a complete local judge recipe, including truncated `` output and request-level thinking controls. The same rule applies to other model-graded assertions that use a text judge; see the [model-graded overview](/docs/configuration/expected-outputs/model-graded#openai-compatible-thinking-judges) for the full metric list. ## Customizing the rubric prompt For more control over the `llm-rubric` evaluation, you can set a custom prompt using the `rubricPrompt` property: ```yaml defaultTest: options: rubricPrompt: > [ { "role": "system", "content": "Evaluate the following output based on these criteria:\n1. Clarity of explanation\n2. Accuracy of information\n3. Relevance to the topic\n\nProvide a score out of 10 for each criterion and an overall assessment." }, { "role": "user", "content": "Output to evaluate: {{output}}\n\nRubric: {{rubric}}" } ] ``` ## Non-English Evaluation To get evaluation output in languages other than English, you can use different approaches: ### Option 1: rubricPrompt Override (Recommended) For reliable multilingual output with compatible assertion types: ```yaml defaultTest: options: rubricPrompt: | [ { "role": "system", // German: "You evaluate outputs based on criteria. Respond with JSON: {\"reason\": \"string\", \"pass\": boolean, \"score\": number}. ALL responses in German." "content": "Du bewertest Ausgaben nach Kriterien. Antworte mit JSON: {\"reason\": \"string\", \"pass\": boolean, \"score\": number}. ALLE Antworten auf Deutsch." }, { "role": "user", // German: "Output: {{ output }}\nCriterion: {{ rubric }}" "content": "Ausgabe: {{ output }}\nKriterium: {{ rubric }}" } ] assert: - type: llm-rubric # German: "Responds politely and helpfully" value: 'Antwortet höflich und hilfreich' ``` ### Option 2: Language Instructions in Rubric ```yaml assert: - type: llm-rubric value: 'Responds politely and helpfully. Provide your evaluation reason in German.' ``` ### Option 3: Full Native Language Rubric ```yaml # German assert: - type: llm-rubric # German: "Responds politely and helpfully. Provide reasoning in German." value: "Antwortet höflich und hilfreich. Begründung auf Deutsch geben." # Japanese assert: - type: llm-rubric # Japanese: "Does not contain harmful content. Please provide evaluation reasoning in Japanese." value: "有害なコンテンツを含まない。評価理由は日本語で答えてください。" ``` **Note:** Option 1 works with `llm-rubric`, `g-eval`, and `model-graded-closedqa`. For other assertion types like `factuality` or `context-recall`, create assertion-specific prompts that match their expected formats. ### Assertion-Specific Prompts For assertions requiring specific output formats: ```yaml # factuality - requires {"category": "A/B/C/D/E", "reason": "..."} tests: - options: rubricPrompt: | [ { "role": "system", // German: "You compare factual accuracy. Respond with JSON: {\"category\": \"A/B/C/D/E\", \"reason\": \"string\"}. A=subset, B=superset, C=identical, D=contradiction, E=irrelevant. ALL responses in German." "content": "Du vergleichst Faktentreue. Antworte mit JSON: {\"category\": \"A/B/C/D/E\", \"reason\": \"string\"}. A=Teilmenge, B=Obermenge, C=identisch, D=Widerspruch, E=irrelevant. ALLE Antworten auf Deutsch." }, { "role": "user", // German: "Expert answer: {{ rubric }}\nSubmitted answer: {{ output }}" "content": "Expertenantwort: {{ rubric }}\nEingereichte Antwort: {{ output }}" } ] assert: - type: factuality # German: "Berlin is the capital of Germany" value: 'Berlin ist die Hauptstadt von Deutschland' ``` ### Object handling in rubric prompts When using `{{output}}` or `{{rubric}}` variables that contain objects, promptfoo automatically converts them to JSON strings by default to prevent display issues. If you need to access specific properties of objects in your rubric prompts, you can enable object property access: ```bash export PROMPTFOO_DISABLE_OBJECT_STRINGIFY=true promptfoo eval ``` With this enabled, you can access object properties directly in your rubric prompts: ```yaml rubricPrompt: > [ { "role": "user", "content": "Evaluate this answer: {{output.text}}\nFor the question: {{rubric.question}}\nCriteria: {{rubric.criteria}}" } ] ``` For more details, see the [object template handling guide](/docs/usage/troubleshooting#object-template-handling). ## Threshold Support The `llm-rubric` assertion type supports an optional `threshold` property that sets a minimum score requirement. When specified, the output must achieve a score greater than or equal to the threshold to pass. For example: ```yaml assert: - type: llm-rubric value: Is not apologetic and provides a clear, concise answer threshold: 0.8 # Requires a score of 0.8 or higher to pass ``` The threshold is applied to the score returned by the LLM (which ranges from 0.0 to 1.0). If the LLM returns an explicit pass/fail status, the threshold will still be enforced - both conditions must be met for the assertion to pass. ## Pass vs. Score Semantics - PASS is determined by the LLM's boolean `pass` field unless you set a `threshold`. - If the model omits `pass`, promptfoo assumes `pass: true` by default. - `score` is a numeric metric that does not affect PASS/FAIL unless you set `threshold`. - When `threshold` is set, both must be true for the assertion to pass: - `pass === true` - `score >= threshold` This means that without a `threshold`, a result like `{ pass: true, score: 0 }` will pass. If you want the numeric score (e.g., 0/1 rubric) to drive PASS/FAIL, set a `threshold` accordingly or have the model return explicit `pass`. :::caution If the model omits `pass` and you don't set `threshold`, the assertion passes even with `score: 0`. ::: ### Common misconfiguration ```yaml # ❌ Problem: Returns 0/1 scores but no threshold set assert: - type: llm-rubric value: | Return 0 if the response is incorrect Return 1 if the response is correct # Missing threshold - always passes due to pass defaulting to true ``` **Fixes:** ```yaml # ✅ Option A: Add threshold assert: - type: llm-rubric value: | Return 0 if the response is incorrect Return 1 if the response is correct threshold: 1 # ✅ Option B: Control pass explicitly assert: - type: llm-rubric value: | Return {"pass": true, "score": 1} if the response is correct Return {"pass": false, "score": 0} if the response is incorrect ``` ## Negation with `not-llm-rubric` Prepend `not-` to invert the assertion — useful for "must not" criteria: ```yaml assert: - type: not-llm-rubric value: Apologizes or hedges before answering ``` `not-llm-rubric` passes when the rubric criterion does **not** match. Transport or parse failures from the grader are reported as failures in both directions — a grader error is not treated as evidence that the criterion was or was not met, so inversion never silently turns a failed grader call into a pass. ## Further reading See [model-graded metrics](/docs/configuration/expected-outputs/model-graded) for more options.