GPT-5.4 Beat Humans at Using Computers. OpenAI's Pricing Math Should Terrify Anthropic.
GPT-5.4 Computer Use: The Numbers That Changed Everything
OpenAI's GPT-5.4 just scored 75.0% on OSWorld-Verified — a benchmark measuring how well AI models navigate real desktop environments. Human performance on the same test? 72.4%.
Read that again. A language model is now better than the average human at clicking through software, filling out forms, editing spreadsheets, and navigating operating systems. And it costs $2.50 per million input tokens — half what Cursor users pay for Claude Opus 4.6 through the API.
Released on March 5, 2026, GPT-5.4 isn't just an incremental upgrade. It's the first general-purpose AI model that consolidates coding, reasoning, and autonomous computer operation into a single system. OpenAI merged the programming strengths of GPT-5.3-Codex, enhanced reasoning capabilities, and native desktop interaction — then priced it to undercut every competitor.
The Benchmark Blitz: Where GPT-5.4 Stands
Here's what jumped out from the benchmark results:
OSWorld-Verified (computer use): 75.0% — up from GPT-5.2's 47.3%. That's a 59% improvement in two model generations. More critically, it surpasses the human baseline of 72.4%. The model can operate applications through both Playwright code execution and direct mouse/keyboard commands from screenshots.
GDPval (professional knowledge work): 83.0% across 44 professional occupations spanning the top 9 U.S. GDP industries. Claude Opus 4.6 scored 78.0% on the same benchmark. GPT-5.4 matches or exceeds industry professionals in 83% of comparisons — up from 70.9% with GPT-5.2.
GPQA Diamond (scientific reasoning): 92.8%. This is a graduate-level science benchmark where PhD holders frequently struggle.
SWE-Bench Pro (coding): 57.7% — the test that simulates real-world software engineering tasks from actual GitHub issues.
FrontierMath Tier 4: 27.1% (Pro variant hits 38.0%). This is the hardest mathematical reasoning tier — problems designed to stump research mathematicians.
MMMU-Pro (visual reasoning): 81.2%. The model can interpret charts, diagrams, and UI screenshots with near-expert accuracy.
The Pricing War OpenAI Just Escalated
GPT-5.4's API pricing tells the real story: $2.50 per million input tokens, $15 per million output tokens.
For context, Claude Opus 4.6 — Anthropic's flagship — costs $5/$25 per million tokens. That means GPT-5.4 is literally half the price for inputs and 40% cheaper for outputs, while outscoring Opus 4.6 on GDPval by 5 percentage points.
If you're an enterprise team evaluating AI providers right now, this pricing gap is difficult to ignore. OpenAI isn't just competing on capability — they're attacking the cost structure that keeps large-scale AI deployment expensive.
The token efficiency story gets even more interesting. GPT-5.4 introduced a feature called "tool search" that lets the model receive a lightweight tool list and look up full definitions on demand instead of loading every tool schema upfront. The result? 47% reduction in token usage for tool-heavy workflows. You're paying less per token AND burning fewer tokens per task.
Computer Use: What It Actually Looks Like
When OpenAI says "computer use," they mean something specific. GPT-5.4 can:
- Navigate web browsers — clicking links, filling forms, placing orders
- Edit spreadsheets, documents, and presentations inside real applications
- Write and execute code in development environments
- Control mouse and keyboard through visual understanding of screenshots
- Handle multi-step workflows across multiple applications
This isn't theoretical. OpenAI demonstrated GPT-5.4 autonomously completing complex tasks like building financial models in Excel, navigating enterprise software dashboards, and executing multi-step research workflows across browser tabs.
The model's computer use operates through two modes: Playwright-based code execution for structured web automation, and direct screenshot interpretation with mouse/keyboard commands for everything else. The dual approach means it can handle both predictable web workflows and messy, ad-hoc desktop tasks.
The Enterprise Angle: Spreadsheets, Finance, and Moody's
OpenAI launched ChatGPT for Excel and Google Sheets in beta alongside GPT-5.4. This isn't a plugin — it's a native integration that lets the model build, analyze, and update complex financial models directly within spreadsheet applications.
New ChatGPT app integrations with FactSet, MSCI, Third Bridge, and Moody's give enterprise teams the ability to consolidate market data, company intelligence, and internal analytics into unified AI-powered workflows. This directly challenges Anthropic, which introduced similar enterprise financial products back in July 2025.
The message from OpenAI is clear: GPT-5.4 isn't an AI you talk to — it's an AI that works for you. It opens your spreadsheet, pulls the data, runs the analysis, and formats the report. You review, approve, and move on.
1 Million Tokens: Context That Actually Matters
GPT-5.4 expands the API context window to 1 million tokens — nearly a 4x increase from GPT-5.3's 272,000. That's roughly 750,000 words of context, or about 15 full novels.
For developers using Cursor or Context7, this means entire codebases can fit inside a single conversation. You can load your full project — every file, every test, every config — and ask GPT-5.4 to refactor, debug, or extend it with complete context awareness.
The practical impact? Fewer hallucinations from context truncation, better architectural decisions based on full-project understanding, and the ability to maintain coherent multi-hour coding sessions without losing thread.
The Safety Tradeoff Nobody's Talking About
OpenAI classified GPT-5.4 as having "high cyber capability" — their internal designation requiring enhanced monitoring and restricted access protocols. This is the first time a general-purpose model received this classification.
The concern is real: a model that can autonomously navigate computers, execute code, and handle multi-step tasks can also be misused for automated cyberattacks, social engineering, or unauthorized system access. OpenAI introduced "chain-of-thought controllability" testing to detect whether models conceal their reasoning processes — essentially checking if the AI is hiding its intentions.
Interruptible reasoning — the ability to pause and redirect the model mid-task without restarting — is partly a usability feature, partly a safety mechanism. If the model starts doing something unexpected during autonomous computer use, you can stop it immediately.
What This Means for the AI Landscape
GPT-5.4 consolidates three separate capabilities — coding, reasoning, and computer use — that previously required different models or awkward multi-step pipelines. Anthropic's Claude can code. Google's Gemini CLI can reason. But neither offers native computer use at this performance level.
The pricing pressure is equally significant. At half the cost of Claude Opus 4.6 for inputs, OpenAI is betting that enterprises will choose the cheaper, more capable option — even if they've already invested in Anthropic's ecosystem. The integration with FactSet, Moody's, and Excel signals that OpenAI is going directly after the enterprise finance market that Anthropic has been cultivating.
For developers building AI agents with tools like browser-use or CrewAI, GPT-5.4's native computer use could eliminate the need for custom browser automation frameworks entirely. Why build a complex agent pipeline when the model itself can navigate software?
The Catch
GPT-5.4 isn't perfect. The 75% computer use accuracy means 1 in 4 desktop tasks still fails — acceptable for assisted workflows, risky for fully autonomous operations. The "high cyber capability" classification may lead to access restrictions that limit its utility for some use cases.
And the $15 per million output tokens adds up fast for agentic workflows that generate verbose chain-of-thought reasoning. A complex 30-minute autonomous task could easily burn through $5-10 in API costs. The token efficiency improvements help, but they don't eliminate the fundamental cost of extended autonomous operation.
Claude Opus 4.6 still leads on some benchmarks, particularly in creative writing and nuanced instruction following. And Google's Gemini 3.1 Pro remains competitive on multimodal tasks at an even lower price point. The AI model market isn't a winner-take-all game — yet.
Key Takeaways
Key Takeaways
- ✓GPT-5.4 scores 75.0% on OSWorld-Verified computer use, surpassing the human baseline of 72.4% — the first general-purpose model to do so.
- ✓API pricing of $2.50/$15 per million tokens undercuts Claude Opus 4.6 ($5/$25) by 50% on inputs and 40% on outputs.
- ✓The 1 million token context window is nearly 4x larger than GPT-5.3's 272K, enabling full-codebase analysis in a single conversation.
- ✓Tool search reduces token usage by 47% for complex multi-tool workflows — you're paying less per token AND using fewer tokens.
- ✓GDPval benchmark: 83% across 44 professional occupations vs. Opus 4.6's 78%, matching or exceeding human professionals in 83% of comparisons.
- ✓OpenAI classified GPT-5.4 as 'high cyber capability' — the first general-purpose model to receive enhanced monitoring requirements.
- ✓Enterprise integrations with FactSet, MSCI, Moody's, and native Excel/Sheets support signal OpenAI is targeting Anthropic's enterprise finance market directly.
Skila AI Editorial Team
The Skila AI editorial team researches and writes original content covering AI tools, model releases, open-source developments, and industry analysis. Our goal is to cut through the noise and give developers, product teams, and AI enthusiasts accurate, timely, and actionable information about the fast-moving AI ecosystem.
About Skila AI →Related Resources
Weekly AI Digest
Get the top AI news, tool reviews, and developer insights delivered every week. No spam, unsubscribe anytime.
Join 1,000+ AI enthusiasts. Free forever.