--- sidebar_label: 'GPT-5.2 vs o3' description: 'Benchmark OpenAI o3 reasoning model against GPT-5.2 for cost, latency, and accuracy to optimize model selection decisions' slug: gpt-vs-reasoning-model --- # GPT-5.2 vs o3: Benchmark on Your Own Data OpenAI's o3 is a reasoning model designed to spend more time thinking before responding, excelling at complex math and logic tasks. GPT-5.2 outperforms o3 on most general benchmarks, but o3 still leads on deep reasoning tasks like advanced math and multi-step logic problems. This guide describes how to compare `o3` against `gpt-5.2` using promptfoo, with a focus on performance, cost, and latency. The end result will be a side-by-side comparison that looks similar to this: ![o3 vs gpt-5.2 comparison](/img/docs/o1-vs-gpt.jpg) ## Prerequisites Before we begin, you'll need: - promptfoo CLI installed. If not, refer to the [installation guide](/docs/installation). - An active OpenAI API key set as the `OPENAI_API_KEY` environment variable. ## Step 1: Setup Create a new directory for your comparison project: ```sh mkdir openai-o3-comparison cd openai-o3-comparison ``` ## Step 2: Configure the Comparison Create a `promptfooconfig.yaml` file to define your comparison. 1. **Prompts**: Define the prompt template that will be used for all test cases. In this example, we're using riddles: ```yaml prompts: - 'Solve this riddle: {{riddle}}' ``` The `{{riddle}}` placeholder will be replaced with specific riddles in each test case. 1. **Providers**: Specify the models you want to compare. In this case, we're comparing gpt-5.2 and o3: ```yaml providers: - openai:gpt-5.2 - openai:o3 ``` 1. **Default Test Assertions**: Set up default assertions that will apply to all test cases. Given the cost and speed of o3, we're setting thresholds for cost and latency: ```yaml defaultTest: assert: # Inference should always cost less than this (USD) - type: cost threshold: 0.02 # Inference should always be faster than this (milliseconds) - type: latency threshold: 30000 ``` These assertions will flag any responses that exceed $0.02 in cost or 30 seconds in response time. 1. **Test Cases**: Now, define your test cases. In this specific example, each test case includes: - The riddle text (assigned to the `riddle` variable) - Specific assertions for that test case (optional) Here's an example of a test case with assertions: ```yaml tests: - vars: riddle: 'I speak without a mouth and hear without ears. I have no body, but I come alive with wind. What am I?' assert: - type: contains value: echo - type: llm-rubric value: Do not apologize ``` This test case checks if the response contains the word "echo" and uses an LLM-based rubric to ensure the model doesn't apologize in its response. See [deterministic metrics](/docs/configuration/expected-outputs/deterministic/) and [model-graded metrics](/docs/configuration/expected-outputs/model-graded/) for more details. Add multiple test cases to thoroughly evaluate the models' performance on different types of riddles or problems. Now, let's put it all together in the final configuration: ```yaml title="promptfooconfig.yaml" description: 'GPT-5.2 vs o3 comparison' prompts: - 'Solve this riddle: {{riddle}}' providers: - openai:gpt-5.2 - openai:o3 defaultTest: assert: # Inference should always cost less than this (USD) - type: cost threshold: 0.02 # Inference should always be faster than this (milliseconds) - type: latency threshold: 30000 tests: - vars: riddle: 'I speak without a mouth and hear without ears. I have no body, but I come alive with wind. What am I?' assert: - type: contains value: echo - type: llm-rubric value: Do not apologize - vars: riddle: 'The more of this there is, the less you see. What is it?' assert: - type: contains value: darkness - vars: riddle: >- Suppose I have a cabbage, a goat and a lion, and I need to get them across a river. I have a boat that can only carry myself and a single other item. I am not allowed to leave the cabbage and lion alone together, and I am not allowed to leave the lion and goat alone together. How can I safely get all three across? - vars: riddle: 'The surgeon, who is the boy''s father says, "I can''t operate on this boy, he''s my son!" Who is the surgeon to the boy?' assert: - type: llm-rubric value: "output must state that the surgeon is the boy's father" ``` This configuration sets up a comprehensive comparison between gpt-5.2 and o3 using a variety of riddles, with cost and latency requirements. We strongly encourage you to revise this with your own test cases and assertions! ## Step 3: Run the Comparison Execute the comparison using the `promptfoo eval` command: ```sh npx promptfoo@latest eval ``` This will run each test case against both models and output the results. To view the results in a web interface, run: ```sh npx promptfoo@latest view ``` ![o3 vs gpt-5.2 comparison](/img/docs/o1-vs-gpt.jpg) ## What's next? By running this comparison, you'll gain insights into how o3 performs against gpt-5.2 on tasks requiring logical reasoning and problem-solving. You'll also see the trade-offs in terms of cost and latency. Reasoning models like o3 excel at complex multi-step problems, but for simpler tasks the extra thinking time and cost may not be worth it. GPT-5.2 is often the better choice when speed and cost matter more than deep reasoning. Ultimately, the best model is going to depend a lot on your application. There's no substitute for testing these models on your own data, rather than relying on general-purpose benchmarks.