OpenAI Shifts o3 and o4-mini to Hybrid Reasoning, Cutting Inference Costs by 40%

What Changed and Why It Matters
OpenAI updated the inference architecture of its o3 and o4-mini models in late May 2026, introducing what the company calls adaptive chain-of-thought scaling. Instead of running full extended thinking on every query, the models now assess task complexity at the prompt-routing layer and allocate proportional compute. Simple factual lookups get a short reasoning pass; multi-step math proofs and code generation still get the full budget.
The practical result: API costs for o3 drop from $15 per million output tokens to $9, and o4-mini goes from $1.10 to $0.66. For developers running high-volume workloads — customer support classification, document summarization, structured data extraction — this is a meaningful change to unit economics, not just a pricing footnote.
How Hybrid Reasoning Actually Works
Traditional chain-of-thought models spend the same compute regardless of whether the user asked a simple question or wanted a proof of a complex theorem. Adaptive scaling solves this by running a lightweight classifier at inference time that scores query complexity across four dimensions: logical depth required, domain specificity, ambiguity level, and whether multiple solution paths need to be explored.
Queries scoring below a threshold get routed to a fast path — still a reasoning model, but with a compressed token budget for internal deliberation. Queries above the threshold get full extended thinking. OpenAI reports the classifier itself costs under 50 tokens per call, making the overhead negligible.
On the MMLU benchmark, o3 hybrid scores within 0.3 percentage points of full-compute o3. On MATH-500, the gap is 1.1 points. On LiveCodeBench, which tests real-world coding tasks, the hybrid mode scores 2.4 points lower — still the best available model on that benchmark, but a meaningful regression for the hardest coding challenges.
What Developers Need to Change
The hybrid mode is opt-in via a new API parameter: reasoning_effort set to "adaptive". The existing "low", "medium", and "high" settings remain unchanged and will not be affected. Developers who want maximum performance on every call should stick with "high". Those who want cost savings on mixed workloads should test "adaptive" against their specific query distribution.
OpenAI also changed how streaming works in adaptive mode. Previously, thinking tokens were always streamed as a separate delta. In adaptive mode, short-reasoning calls suppress the thinking stream by default to reduce latency — the response arrives faster with no visible deliberation. This can be re-enabled via stream_thinking if your application surfaces the reasoning trace to end users.
Competitive Context
The timing of this change is not coincidental. Google's Gemini 2.5 Pro already uses adaptive compute under its "thinking budget" system, and Anthropic's Claude 3.7 Sonnet introduced extended thinking as an optional feature in early 2025. OpenAI is closing a gap, not creating one.
What differentiates OpenAI's implementation is the automatic classification layer. With Gemini and Claude, developers manually set the thinking budget or toggle extended thinking on/off. With o3 and o4-mini in adaptive mode, the model makes that decision at runtime. This reduces the engineering burden on developers who don't want to tune prompts for different query types — the system self-calibrates.
Anthropic has not commented publicly on whether Claude 4 will incorporate similar automatic scaling. Based on published research, Anthropic's preference has been toward giving users explicit control rather than automatic inference-time optimization.
Enterprise Implications
For large enterprise customers who process millions of API calls daily, the cost reduction is substantial. A company running 10 million o3 calls per day at current output rates would save roughly $60,000 per day at the new pricing — assuming a 40% reduction across the board. In practice, the savings will vary based on query complexity distribution, but even a 20% effective reduction has a material impact on AI infrastructure budgets.
OpenAI's enterprise tier customers on annual contracts will have their pricing adjusted at renewal. Month-to-month API customers see the new rates immediately starting June 1, 2026.
Actionable Takeaways
- Test the "adaptive" reasoning_effort parameter in a staging environment before enabling in production — benchmark it against your specific query mix, not just published benchmarks.
- If your application serves mixed workloads (some trivial, some complex), adaptive mode will likely deliver 25-40% cost savings with minimal quality impact.
- For high-stakes coding or math workloads, keep reasoning_effort at "high" — the 2.4-point LiveCodeBench regression is real and matters for production code generation.
- Update your token accounting: adaptive mode changes the distribution of input vs. thinking tokens, which affects how you budget context windows for long conversations.
- Review streaming logic: if you surface thinking traces to end users, explicitly enable stream_thinking or those traces will be suppressed by default in adaptive mode.