GPT-5.5: OpenAI's Unified Frontier Model for Agents, Code, and Long-Horizon Work
Just seven weeks after shipping GPT-5.4, OpenAI has released GPT-5.5 — a release the company is positioning as the first truly unified frontier model in the GPT-5 line. Instead of juggling Instant, Thinking, Pro, Codex, and Mini variants, GPT-5.5 collapses the lineup into a single routed family with one API surface and one pricing tier per plan. For developers who have been tracking the GPT-5.x cadence, this is the most consequential change of the cycle so far, and arguably the first release where “just use the default model” is the right answer for most production workloads.
Announced on April 24, 2026, GPT-5.5 is rolling out to ChatGPT, the API, and Codex simultaneously. Here’s what’s new, what’s actually different from GPT-5.4, and where it lands against Claude Opus 4.7 and Gemini 3.1 Pro.
From Variants to a Routed Family
The GPT-5.x line has been accumulating variants at a steady clip: Instant, Thinking, Pro, Codex, Codex-Mini, Mini, Nano. GPT-5.5 retires most of that taxonomy.
There are now two public names:
gpt-5.5— the unified model. Handles chat, reasoning, coding, and computer use. Internally routes between fast and deep paths based on task signal.gpt-5.5-pro— a higher-budget tier for Pro and Enterprise plans, with larger reasoning budgets, longer tool-use horizons, and priority on agentic tasks.
Under the hood, gpt-5.5 is a single model with a new adaptive reasoning scheduler that OpenAI describes as “sub-linear” — reasoning cost grows slower than task complexity for a large class of prompts. In practice, this means the model no longer burns thinking tokens on lookup-style questions, but ramps aggressively for multi-file refactors, proofs, or long-running agent loops.
For developers, the routing happens server-side. You don’t pick -thinking vs -chat-latest anymore — you send a message and the model decides how hard to work on it. A new optional reasoning_effort parameter (minimal, standard, high, max) lets you pin the budget when you need determinism, but OpenAI’s guidance is to leave it unset for most traffic.
2M Token Context, and It’s Actually Usable
GPT-5.4 brought OpenAI’s context window to 1M tokens, matching Gemini. GPT-5.5 doubles it to 2M tokens on the API — and, more importantly, ships the first version where long-context quality seems to hold up across the window.
OpenAI’s reported numbers on internal retrieval benchmarks:
- Needle-in-a-haystack at 1.5M tokens: 99.2% (GPT-5.4 was 94.1%)
- Multi-hop retrieval at 1M tokens: 87.6% (GPT-5.4 was 71.8%)
- Full-codebase reasoning at 750K tokens: 82.4% task accuracy (GPT-5.4 was 64.0%)
The long-context story on frontier models has historically been “big number on the box, sharp quality cliff after 200K.” GPT-5.5 is the first OpenAI model where the quality curve is comparatively flat across the full window. For codebase-scale refactors and multi-document legal or financial work, this is the headline feature.
Built-In Background Agents
GPT-5.5 ships with first-class background agents — long-running sessions that persist outside of a single chat turn. This replaces the ad-hoc “assistants + runs + threads” pattern from the API with a simpler primitive:
from openai import OpenAI
client = OpenAI()
agent = client.agents.create(
model="gpt-5.5",
instructions="Triage incoming GitHub issues and draft responses.",
tools=[{"type": "computer_use"}, {"type": "code_interpreter"}],
schedule="every 15m",
)
Agents can be suspended, resumed, and inspected. Each agent has its own context, memory store, and tool registry, and OpenAI exposes a streaming events API so you can pipe an agent’s activity into your own dashboards. This is a direct response to the way developers have been using Codex, Claude Code, and custom harnesses — a recognition that “agents that run while you sleep” is the dominant production pattern now, not interactive chat.
The pricing model is metered per active reasoning token, not wall-clock time. A suspended agent costs nothing.
Coding: Codex, Absorbed
GPT-5.4 was the first mainline model to absorb Codex’s coding capability. GPT-5.5 goes further and retires the Codex name entirely from the API. gpt-5.5 is the coding model.
The headline numbers:
- SWE-bench Verified: 78.4% (GPT-5.4 was 71.2%; Claude Opus 4.7 is at 80.1%)
- SWE-bench Multimodal: 64.7% (GPT-5.4 was 52.3%)
- LiveCodeBench: 92.1% on competition-style problems
- Aider Polyglot: 88.6% across multi-file edits in 10 languages
OpenAI still trails Anthropic on raw SWE-bench Verified, but the gap is the tightest it’s been in a year, and GPT-5.5’s multimodal coding score is now state of the art — that matters for frontend work where the model needs to look at a Figma export or a screenshot of a failing UI.
Codex-the-product (the agentic coding tool inside ChatGPT) upgrades to GPT-5.5 automatically. OpenAI is positioning the standalone codex CLI as the on-ramp for developers who want to run the same model locally against their own repos.
Native Computer Use, Now Useful for Long Sessions
GPT-5.4 introduced native computer use. GPT-5.5 focuses on making it reliable for long-horizon tasks — the 30-minute, 50-step workflows that competitors have been edging toward.
Reported improvements:
- OSWorld-Verified: 72.1% (GPT-5.4 was 58.0%)
- WebArena Long: 61.4% on sessions longer than 20 steps
- Session recovery: the model can now resume a computer-use session after a tool timeout or page navigation failure without losing state
OpenAI also shipped a sandbox provider API so enterprises can point computer-use sessions at their own VMs rather than OpenAI’s hosted environment. That unblocks the obvious compliance question that stalled a lot of computer-use pilots.
Accuracy and Hallucination
One of the more quietly important changes: GPT-5.5 claims the largest single-generation drop in hallucination rate of the GPT-5.x cycle.
- 51% fewer hallucinated claims versus GPT-5.4 on OpenAI’s internal factuality suite
- 38% fewer hallucinated tool calls in agentic settings (invented function names, wrong argument shapes)
- 27% improvement on citation accuracy when the model is given retrieval tools
The tool-call number is the one to watch. A large share of “the agent broke” incidents in production come from the model fabricating a tool name or passing malformed arguments. If OpenAI’s numbers hold up, that’s a meaningful reduction in the class of bugs that currently require a retry-and-validate harness around every agent.
Pricing
GPT-5.5 pricing is structured around the unified model:
| Tier | Input (per 1M) | Output (per 1M) | Cached input |
|---|---|---|---|
gpt-5.5 | $2.25 | $9.00 | $0.45 |
gpt-5.5-pro | $8.00 | $40.00 | $1.60 |
Input token pricing actually drops slightly from GPT-5.4’s $2.50. Output is roughly flat. Given that GPT-5.5 uses fewer tokens per task thanks to the sub-linear scheduler, OpenAI claims most existing workloads will see a 15–25% cost reduction on a per-task basis compared to GPT-5.4.
The cached input tier is aggressive — 80% off on cache hits — and is clearly aimed at agent workloads where the system prompt and tool definitions stay stable across thousands of turns.
Availability
| Plan | Access |
|---|---|
| ChatGPT Free | gpt-5.5 with daily message limits |
| ChatGPT Plus | gpt-5.5 unlimited, gpt-5.5-pro metered |
| ChatGPT Team | gpt-5.5 + gpt-5.5-pro |
| ChatGPT Pro | Full access, higher rate limits |
| Enterprise | Full access + sandbox providers + audit logs |
| API | gpt-5.5, gpt-5.5-pro (up to 2M context) |
GPT-5.4 Thinking and GPT-5.4 Pro remain available in the Legacy Models picker for three months, with retirement scheduled for July 24, 2026. GPT-5.2 is retired as previously announced on June 5, 2026.
How It Stacks Up
The frontier is tight again. A rough snapshot as of late April 2026:
- Coding (SWE-bench Verified): Claude Opus 4.7 leads at ~80%, GPT-5.5 at 78%, Gemini 3.1 Pro at ~74%.
- Long context quality: GPT-5.5 and Gemini 3.1 Pro are roughly tied at the 1M+ range. Claude Opus 4.7 tops out at 500K but stays very sharp across the full window.
- Computer use: GPT-5.5 and Claude Opus 4.7 are now essentially peers; a year ago Anthropic was alone in this space.
- Agentic reliability: Claude’s harness ecosystem (Claude Code, Cowork) is more mature. OpenAI’s new built-in agents primitive is a direct play to close that gap.
- Price-per-capability: GPT-5.5 is the most aggressive pricing of the three at the flagship tier.
The honest read: no frontier model is strictly dominant right now. Choice of model has become a function of which ecosystem you’re already in and which specific task shape matters most.
What This Means for Developers
A few practical notes if you’re planning to migrate:
Simplify your model selection logic. If you have routing code that picks between gpt-5.4, gpt-5.4-thinking, and gpt-5.3-codex, you can probably delete it. The unified model plus optional reasoning_effort replaces almost all of it.
Revisit your retrieval pipeline. With 2M tokens of usable context, a chunk of retrieval-augmented-generation work that existed to paper over context limits may no longer earn its complexity. Benchmark feeding a full repo or full contract into context against your current vector-store pipeline before assuming RAG is still the right answer.
Budget for cached input. If you’re running an agent, structure your prompts so the system prompt and tool definitions are stable and cache-able. The 5x price gap between cached and uncached input is large enough to reshape architecture decisions.
Treat gpt-5.5 agents as a real primitive, not a demo. The background agents API is production-grade on day one according to OpenAI, and the pricing favors long-running, suspended workflows. This is a category the Assistants API never quite delivered on.
Don’t migrate agentic code blindly. Tool-call hallucination is down, but it’s not zero. Keep your validation harnesses. The models get better; production hygiene does not get cheaper.
The Bottom Line
GPT-5.5 is the release that makes the GPT-5.x line feel coherent. The variant sprawl is gone. The context window is genuinely usable across its full range. Agents are a first-class primitive. Coding and computer use are competitive with the best models on the market. Pricing moved in developers’ favor.
It’s not a GPT-4-to-GPT-5 step change — we are clearly in the consolidation phase of this generation, not the breakthrough phase. But for people shipping products, consolidation is exactly what’s needed right now. Fewer knobs, better defaults, longer memory, cheaper tokens. That’s a good release.
The frontier race between OpenAI, Anthropic, and Google is not going to slow down — Claude Opus 4.8 and Gemini 3.2 are both rumored for Q2. But for the next couple of months, GPT-5.5 is the default I’d reach for on any new project that doesn’t already have a strong reason to be elsewhere.