--- sidebar_label: vLLM description: "Run vLLM's OpenAI-compatible server with promptfoo, including local chat targets, self-hosted judges, and thinking-model grading." --- # vLLM [vLLM's OpenAI-compatible server](https://docs.vllm.ai/en/latest/serving/openai_compatible_server/) implements `/v1/chat/completions`, `/v1/completions`, `/v1/responses`, and `/v1/embeddings`. Promptfoo connects to it through the OpenAI provider by changing `apiBaseUrl`. Use this page when: - vLLM is the model under test - vLLM is the local LLM-as-a-judge provider for model-graded assertions such as `llm-rubric`, `g-eval`, `factuality`, `answer-relevance`, `context-*`, or `select-best` - your vLLM model returns a separate `reasoning` field and promptfoo should grade only the final `content` ## Start a vLLM server ```bash vllm serve Qwen/Qwen3-8B \ --host 0.0.0.0 \ --port 8000 \ --served-model-name qwen3-8b \ --api-key token-abc123 ``` Then verify the server directly: ```bash curl http://localhost:8000/v1/chat/completions \ -H 'Authorization: Bearer token-abc123' \ -H 'Content-Type: application/json' \ -d '{ "model": "qwen3-8b", "messages": [{"role": "user", "content": "Reply with OK"}], "max_tokens": 8 }' ``` `apiBaseUrl` should be the `/v1` root. Promptfoo appends `/chat/completions`, `/completions`, or `/embeddings` depending on the provider type. For a wiring-only smoke test, a tiny reasoning model such as `Qwen/Qwen3-0.6B` can verify that promptfoo reaches vLLM and that `showThinking` behaves correctly. Do not use a tiny model as a real judge; use it only to test the endpoint, parser, and config shape: ```bash vllm serve Qwen/Qwen3-0.6B \ --host 0.0.0.0 \ --port 8000 \ --served-model-name llm_judge \ --reasoning-parser qwen3 \ --max-model-len 4096 \ --api-key token-abc123 ``` ### Example judge models Keep `--served-model-name` short and stable; promptfoo uses that alias in `openai:chat:`. #### GPT-OSS For GPT-OSS, use the Hugging Face model name as the vLLM model and expose a short served alias. The example below uses `openai/gpt-oss-20b`; use a larger GPT-OSS checkpoint the same way when your host has enough memory: ```bash vllm serve openai/gpt-oss-20b \ --host 0.0.0.0 \ --port 8000 \ --served-model-name gpt-oss-20b \ --api-key token-abc123 ``` Prefer a Linux CUDA or ROCm host for GPT-OSS with vLLM. If CPU or ARM serving fails, check the [vLLM GPT-OSS recipe](https://docs.vllm.ai/projects/recipes/en/latest/OpenAI/GPT-OSS.html) for the backend support notes that match your vLLM release. #### GLM-4.7 For GLM-4.7, use a vLLM and Transformers combination that supports the exact GLM checkpoint you are serving. The [vLLM GLM recipe](https://docs.vllm.ai/projects/recipes/en/latest/GLM/GLM.html) keeps the install guidance for GLM releases: ```bash vllm serve zai-org/GLM-4.7-FP8 \ --host 0.0.0.0 \ --port 8000 \ --served-model-name glm-4.7 \ --tensor-parallel-size 4 \ --tool-call-parser glm47 \ --reasoning-parser glm45 \ --enable-auto-tool-choice \ --api-key token-abc123 ``` For `zai-org/GLM-4.7-Flash`, use a served name such as `glm-4.7-flash`. If you do not need tool calling, the tool flags are optional for ordinary model-graded assertions; keep `--reasoning-parser glm45` when you want vLLM to split reasoning from final content. ## Use vLLM as the target model ```yaml title="promptfooconfig.yaml" prompts: - '{{question}}' providers: - id: openai:chat:qwen3-8b label: vLLM qwen3-8b config: apiBaseUrl: http://localhost:8000/v1 apiKey: token-abc123 temperature: 0.2 max_tokens: 512 tests: - vars: question: 'What is the capital of France?' assert: - type: contains value: Paris ``` For completions models, use `openai:completion:` instead of `openai:chat:`. You can also put the endpoint in an environment variable: ```bash export OPENAI_BASE_URL=http://localhost:8000/v1 export OPENAI_API_KEY=token-abc123 ``` ## Use vLLM as an LLM judge Model-graded assertions call a separate grading provider. Configure that provider under `defaultTest.options.provider` when every model-graded assertion should use the same vLLM judge: ```yaml title="promptfooconfig.yaml" prompts: - '{{answer}}' providers: # System under test. This can be any provider. - echo defaultTest: options: provider: id: openai:chat:llm_judge label: Judge: llm_judge @ vLLM config: apiBaseUrl: http://localhost:8000/v1 apiKey: token-abc123 temperature: 0 max_tokens: 10000 showThinking: false tests: - vars: answer: 'Use the Forgot password link and verify by email or SMS.' assert: - type: llm-rubric value: | Pass if the answer explains how to reset a password and mentions a verification step. ``` Do not repeat `provider: openai:chat:llm_judge` on an assertion when the full provider object already lives in `defaultTest.options.provider`. An assertion-level `provider` overrides the default provider object, so the `apiBaseUrl`, `apiKey`, `showThinking`, and other config values above will not be used. If only one assertion should use vLLM, put the full object on that assertion: ```yaml assert: - type: llm-rubric value: 'Answer gives correct password reset instructions' provider: id: openai:chat:llm_judge config: apiBaseUrl: http://localhost:8000/v1 apiKey: token-abc123 temperature: 0 max_tokens: 10000 showThinking: false ``` ## Thinking and reasoning models When vLLM is started with a reasoning parser, responses may include: - `message.reasoning_content` or `message.reasoning`: hidden reasoning extracted by vLLM - `message.content`: final answer Promptfoo's OpenAI-compatible chat provider includes reasoning in the returned output by default: ```text Thinking: ``` That is useful when vLLM is the target model, because assertions can inspect the full visible output. It is usually wrong when vLLM is the judge, because model-graded assertions consume the judge output as the material to parse, embed, classify, or score. If the reasoning text contains JSON-looking scratchpad content, attribution markers, candidate sentences, or numeric choices, promptfoo can use that scratchpad before the final answer. Set `showThinking: false` on vLLM judge providers so promptfoo discards reasoning fields and parses only `content`. This depends on vLLM successfully splitting the response. If the request stops before the model closes its thinking block, vLLM can return raw `...` text in `message.content` instead of `reasoning_content`. In that case `showThinking: false` cannot distinguish scratchpad text from final content. Increase the server `--max-model-len` and provider `max_tokens`, or disable thinking for judge calls with `chat_template_kwargs.enable_thinking: false`. `search-rubric` is special because it requires web search. A plain vLLM chat server is not a web-search-capable grader; promptfoo will prefer or load a search-capable provider instead. The `showThinking` guidance applies to the search provider that actually grades the assertion. This applies to every model-graded assertion that consumes text from the judge. JSON-first metrics can parse scratchpad JSON, RAG metrics can score scratchpad sentences or attribution markers, `answer-relevance` can embed generated questions with `Thinking:` prepended, and `select-best` can read a scratchpad number as the winning index. ### Disable thinking at the vLLM API level `showThinking: false` only changes what promptfoo reads from the response; the model may still spend tokens thinking. For small local judges and CI smoke tests, disabling thinking is often faster and avoids truncated `` content. Qwen3 and GLM chat templates support disabling thinking per request through `chat_template_kwargs`: ```yaml defaultTest: options: provider: id: openai:chat:llm_judge config: apiBaseUrl: http://localhost:8000/v1 apiKey: token-abc123 showThinking: false passthrough: chat_template_kwargs: enable_thinking: false ``` For GPT-OSS-style chat completions, request-level reasoning controls use different fields: ```yaml defaultTest: options: provider: id: openai:chat:gpt-oss-20b config: apiBaseUrl: http://localhost:8000/v1 apiKey: token-abc123 showThinking: false passthrough: include_reasoning: false reasoning_effort: low ``` Keep `showThinking: false` even when you pass model-specific controls such as `include_reasoning: false` or `chat_template_kwargs.enable_thinking: false`. Those controls save tokens when vLLM honors them; `showThinking: false` is the promptfoo-side guard. ## Provider maps for text and embeddings Some assertions need a text judge and an embedding model. Use a provider map when a single eval uses both text-graded assertions and embedding-based assertions such as `answer-relevance` or `similar`: ```yaml defaultTest: options: provider: text: id: openai:chat:llm_judge config: apiBaseUrl: http://localhost:8000/v1 apiKey: token-abc123 temperature: 0 showThinking: false embedding: id: openai:embedding:intfloat/e5-large-v2 config: apiBaseUrl: http://localhost:8000/v1 apiKey: token-abc123 ``` ## Troubleshooting | Symptom | Fix | | -------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `API key is not set` | Set `apiKey` in provider config, or set `OPENAI_API_KEY`. If vLLM was started without `--api-key`, any placeholder such as `empty` is fine. | | `ECONNREFUSED` | Use `127.0.0.1` instead of `localhost`, verify the vLLM port, and confirm Docker or a remote host exposes the port. | | Promptfoo calls OpenAI instead of vLLM | Put `apiBaseUrl: http://.../v1` on the provider object, or set `OPENAI_BASE_URL`. Do not set `apiBaseUrl` to `/v1/chat/completions`. | | Judge returns `Could not extract JSON`, wrong categories, odd RAG scores, or wrong `select-best` winners | Set `showThinking: false` on the judge provider and keep the full provider object in `defaultTest.options.provider` or `assert.provider`. | | Judge output still starts with `` even with `showThinking: false` | The generation was truncated before vLLM split reasoning into `reasoning_content`. Increase `--max-model-len` / `max_tokens`, or disable thinking via `passthrough.chat_template_kwargs.enable_thinking: false`. | | `search-rubric` uses a different provider than vLLM | This is expected unless the configured provider has web-search capability. Plain vLLM chat is not a search provider; configure a web-search-capable grader for `search-rubric`. | | `assert.provider` appears to ignore `defaultTest.options.provider.config` | This is expected precedence. Use the full provider object at the assertion level, or remove `assert.provider` so the default provider object is inherited. | Run with `--no-cache` while debugging: ```bash promptfoo eval -c promptfooconfig.yaml --no-cache -o results.json ``` Then inspect the judge result: ```bash jq '.results.results[].gradingResult.componentResults[] | {pass, score, reason}' results.json ```