OpenAI Just Dropped GPT-5.5. The Agentic Coding War Just Ended.
OpenAI shipped GPT-5.5 today. Terminal-Bench 2.0 score: 82.7%. That is 17 points above GPT-5 and 17 points above Claude Opus 4.6.
The agentic coding war is not heating up. It just ended.
Here is exactly what shipped, what the benchmarks actually mean, and what every team using Cursor, Claude Code, or Codex has to decide this week.
What Actually Shipped
Three things landed in the same launch:
- GPT-5.5 — the new flagship. Available today in ChatGPT (Plus, Team, Enterprise) and via API.
- GPT-5.5 Pro — a new top-tier model with extended reasoning, 200-step agent loops, and full computer-use access. Available in ChatGPT Pro ($200/mo) and as a separate API SKU.
- Codex update — OpenAI's coding agent gets native multi-file refactor, persistent project memory across sessions, and direct access to GPT-5.5 Pro on the Codex CLI.
The launch is the most aggressive product update OpenAI has shipped since GPT-4o. It is also the most directly targeted at Anthropic's coding-agent franchise.
The Benchmarks Are the Story
Five numbers that matter:
- Terminal-Bench 2.0: 82.7% — up from 65.4% on GPT-5. This is the benchmark that measures real shell-driven multi-step tasks. It is the closest proxy we have to 'is this model a useful agent?' GPT-5.5 just took the lead by a margin that is not noise.
- SWE-bench Verified: 81.9% — narrowly ahead of Claude Opus 4.6 (80.8%) and DeepSeek V4-Pro (80.6%). The three frontier models are now within two points of each other on SWE-bench. The benchmark is saturating.
- LiveCodeBench: 91.4% — strong but slightly below DeepSeek V4-Pro (93.5%). DeepSeek still wins on pure algorithmic coding.
- Computer-use task completion: 73% on OSWorld — a 14-point lift over GPT-5. Computer-use is the second axis OpenAI is pushing this year.
- 200K context with full attention — needle-in-a-haystack accuracy stays above 92% across the entire 200K window. That is the first frontier model to hold accuracy at that depth without lost-in-the-middle degradation.
Read the Terminal-Bench number twice. A 17-point jump in six months is not an iterative improvement. It is the kind of step change that resets product roadmaps.
The Pro Tier Is the Real Story
GPT-5.5 Pro is where OpenAI made the hardest bet. It is a separate model with extended reasoning, 200-step agent loops, and computer-use access. ChatGPT Pro subscribers get unlimited access. API users pay a $0.50 surcharge per agent step on top of normal token costs.
That pricing structure is new. OpenAI is unbundling the agent loop from the model. You are not paying for tokens. You are paying for steps. A 200-step debugging session at $0.50 per step is $100 of agent fees on top of whatever the tokens cost.
That number sounds high until you compare it to a senior engineer's hourly rate. If GPT-5.5 Pro completes a multi-file refactor in 30 minutes that would take a human 3 hours, the math works for any team that values engineering time above $200/hour. Every YC-backed startup will install it this week.
Codex Becomes the Cursor Killer
The Codex update is the part of the launch that should worry Cursor and Windsurf the most. Three new features:
- Native multi-file refactor. Codex now does what Cursor's 'Composer' has been the standout feature for — propose a coordinated edit across an entire codebase, show a unified diff, apply atomically. OpenAI shipped it as a first-party feature, not an extension.
- Persistent project memory. Codex remembers your codebase conventions, recent decisions, and unresolved issues across sessions. No more re-explaining your architecture every morning.
- GPT-5.5 Pro on the CLI. The Pro model is available via
codex --model gpt-5.5-pro. That is the model with 200-step agent loops and computer-use access. It runs your local shell, browser, and IDE.
For Cursor, this is an existential moment. Cursor's pitch has been 'Claude in a beautiful IDE.' Codex now does multi-file refactor better, with persistent memory, with a model that beats Claude on Terminal-Bench by 17 points, at OpenAI pricing. Cursor has to ship a counter — likely tighter integration with Anthropic's computer-use features — within weeks.
The Pricing Math
API pricing for GPT-5.5: $5 per million input tokens, $30 per million output. GPT-5.5 Pro adds a $0.50 per-agent-step surcharge.
Compare that to the field:
- Claude Opus 4.6: $15 input, $25 output
- GPT-5.5: $5 input, $30 output
- DeepSeek V4-Pro: $0.50 input, $3.48 output
OpenAI cut input pricing by 67% versus Claude Opus 4.6 but charges 20% more on output. That is not an accident. Output tokens are where the agent does its work — the loop, the tool calls, the code generation. OpenAI is signaling that their value is on the output side, where the new capabilities show up.
The bigger problem is DeepSeek. V4-Pro shipped three days ago at 14% of Claude's price and benchmarks within noise of GPT-5.5 on SWE-bench. OpenAI has to defend a 9x output-token premium on capability alone. The new Terminal-Bench result is exactly the capability story they need.
What Anthropic Has to Do This Week
Anthropic now sits in the middle of a price-quality squeeze. DeepSeek undercuts on price by 86%. OpenAI undercuts on capability by 17 Terminal-Bench points. Claude Opus 4.6 needs a response.
Three plausible moves, in order of likelihood:
- Ship Claude Opus 4.7 ahead of schedule. Anthropic was reportedly targeting a Q3 launch. That timeline is now Q2.
- Cut output pricing on Opus 4.6. $25 to $15 would close half the OpenAI gap and most of the DeepSeek gap on the compliance-friendly tier.
- Push computer-use harder. Anthropic launched computer-use in October 2024 and OpenAI just caught up. The lead has shrunk to weeks.
Expect at least two of those three by mid-May.
What You Should Change This Week
Three concrete actions:
- Run GPT-5.5 against your hardest agentic tests. If you have an internal eval suite for coding agents, run it tonight. The Terminal-Bench number is real, but your workload is what matters. Most teams will see a clear lift on multi-step debugging and shell-driven tasks.
- Try the new Codex on a real refactor. Pick a refactor you have been postponing — the kind that touches 20 files and breaks tests in three places. Hand it to Codex with GPT-5.5 Pro. Watch what happens. The whole point of the launch is that this is now plausible.
- Re-architect your cost tier. The smart 2026 stack is DeepSeek V4-Flash on the high-volume tier, GPT-5.5 or Claude Opus 4.6 on the critical path, and GPT-5.5 Pro for genuinely hard agentic work where the per-step surcharge is justified by output value. Mono-vendor stacks are over.
Verdict
GPT-5.5 is the most consequential coding-model release of 2026 so far. It rebases the agentic-coding ceiling, hands OpenAI the Terminal-Bench crown, and turns Codex into a credible Cursor competitor. The Pro tier and per-step pricing model is a bet that will reshape how every AI coding tool sells.
The era of one model winning every benchmark is over. GPT-5.5 wins agentic coding. DeepSeek V4-Pro wins price-per-token. Claude wins compliance and long-horizon planning. Pick your stack accordingly — and update it again in 30 days, because the next move from Anthropic is coming fast.
Related Resources
- Tool: Google Gemini Enterprise Agents — the enterprise-side competitor that pairs against GPT-5.5 Pro on agent deployment.
- Article: DeepSeek V4-Pro launch — the open-weight model now squeezing OpenAI's pricing from below.
- Repo: Microsoft Magentic-One — the multi-agent orchestrator you can now drive with GPT-5.5 Pro.
- MCP server: Anthropic Claude Code MCP — embed Claude Code abilities inside any agent that uses GPT-5.5.
- Skill: Anthropic Data Analysis Skills — structured data-science skills any frontier model (including GPT-5.5) can consume.
Frequently Asked Questions
What is GPT-5.5?
GPT-5.5 is OpenAI's flagship language model launched April 27, 2026. It scores 82.7% on Terminal-Bench 2.0, ships in ChatGPT and the API, and adds a new GPT-5.5 Pro tier with 200-step agent loops and computer-use access. It is the first model to hold above 92% needle-in-a-haystack accuracy across a full 200K context window.
How does GPT-5.5 compare to Claude Opus 4.6?
GPT-5.5 leads on Terminal-Bench 2.0 (82.7% vs 65.4%) and matches Claude on SWE-bench Verified within one point (81.9% vs 80.8%). Claude still leads on long-horizon planning and compliance-bound enterprise work. Pricing is roughly comparable on output ($30 vs $25 per million tokens) but GPT-5.5 cuts input cost by 67%.
How much does GPT-5.5 cost?
API pricing is $5 per million input tokens and $30 per million output tokens. GPT-5.5 Pro adds a $0.50 surcharge per agent step on top of token costs. ChatGPT Pro subscribers ($200/month) get unlimited GPT-5.5 Pro access. Plus and Team subscribers get GPT-5.5 with usage caps.
Is GPT-5.5 Pro worth the per-step surcharge?
For genuinely hard agentic work — multi-file refactors, long-running debugging sessions, computer-use tasks — yes. A 200-step session at $0.50 per step is $100 in agent fees, which is cheaper than 30 minutes of senior engineer time. For routine code generation and Q&A, the standard GPT-5.5 tier is enough.
What are the best alternatives to GPT-5.5 in 2026?
Claude Opus 4.6 for compliance-bound enterprise work and long-horizon planning. DeepSeek V4-Pro for price-sensitive agentic coding at $3.48 per million output tokens. Gemini 3.1 Pro for long-context multimodal work. The smart 2026 stack uses all three with a routing policy by task type.
Skila AI Editorial Team
The Skila AI editorial team researches and writes original content covering AI tools, model releases, open-source developments, and industry analysis. Our goal is to cut through the noise and give developers, product teams, and AI enthusiasts accurate, timely, and actionable information about the fast-moving AI ecosystem.
About Skila AI →Related Resources
Weekly AI Digest
Get the top AI news, tool reviews, and developer insights delivered every week. No spam, unsubscribe anytime.
Join 1,000+ AI enthusiasts. Free forever.