OpenAI Just Admitted It: AI Hallucinations Are Mathematically Impossible to Fix
OpenAI just proved its own product can't be fixed. Not in a press release. In a math paper.
On September 4, 2025, Adam Kalai, Ofir Nachum, and Edwin Zhang from OpenAI — plus Santosh Vempala from Georgia Tech — published arXiv:2509.04664, "Why Language Models Hallucinate". The paper does something the industry has spent three years dancing around. It proves that for any large language model trained on next-token prediction, hallucinations are not a bug. They are a mathematically unavoidable feature.
Eight months later, the implications are landing. On May 12, 2026, Anthropic shipped 12 new legal Claude plugins and 20 legal connectors — deposition prep, bar-exam coaching, case-law research, file drafting, plus integrations with DocuSign, Thomson Reuters, Harvey, and Everlaw. AI is now being pushed into the single highest-stakes hallucination zone in the economy. The same week, Damien Charlotin's public legal-hallucination database ticked past 120 documented court cases where AI tools fabricated quotes, made up case names, or invented citations that don't exist.
The myth this article busts is the most expensive one in AI: "hallucinations will be fixed in the next model." They won't. And the math says why.
The Myth Everyone Believes
Walk into any room where AI buyers gather. Boardrooms. Investor calls. Procurement meetings. The same sentence shows up: "GPT-5 still hallucinates a bit, but the next version will solve it."
That sentence funds budgets. It silences risk officers. It postpones every serious workflow audit by another quarter. And it is wrong in a way that can be proven on a whiteboard.
Here is the version of the myth the industry quietly sells you:
- Hallucinations are a temporary engineering flaw.
- Better data + bigger models + more RLHF = the rate drops to zero.
- One day soon, an AI will be "trustworthy enough" to ship without human review.
OpenAI itself just published the paper that demolishes all three claims.
What the OpenAI Paper Actually Proves
The Kalai & Vempala paper has two arguments. The first is statistical. The second is sociological. Both kill the myth.
1. Generating sentences is mathematically harder than answering yes/no
Imagine a model that's wrong on a simple binary question ("Did X happen, yes or no?") 5% of the time. The paper proves something brutal: when that same model has to generate a sentence containing the same fact, its error rate is at least double the binary rate. Errors compound across every prediction it has to chain together.
This is the "Is-It-Valid reduction." The paper formally shows that generating a valid sentence cannot be easier than checking whether one is valid. So whatever your best binary classifier can do, your generator is, by definition, twice as bad. There is no architecture choice that escapes this. It's a floor, not a ceiling.
2. 9 of 10 frontier benchmarks reward guessing over honesty
The paper's second argument is the one that should haunt every CTO. The authors surveyed 10 widely-used benchmarks that frontier labs train against. Nine of the ten use binary grading. Right answer = 1 point. Wrong answer = 0 points. "I don't know" = 0 points.
Under that scoring, a model that guesses confidently always beats a model that hedges. So gradient descent does what gradient descent does: it learns to guess. Confidently. Even when it shouldn't. The training signal literally rewards making things up over saying "I'm not sure."
That's why even GPT-5, Claude Opus 4.7, and Gemini 3 still hallucinate. Not because the engineering is sloppy. Because the scoreboard is broken.
The 2026 Hallucination Rates Nobody Wants to Print
So how bad is it in practice, eight months after the paper dropped? Worse than the marketing implies.
Two independent 2026 benchmark studies — Suprmind's hallucination tracker and Digital Applied's 5-model study — tested every frontier model on factual tasks. The results:
- Frontier hallucination range: 3.1% to 19.1% depending on task family.
- Citation accuracy is the worst-performing task — 12.4% hallucination rate even with extended thinking enabled.
- "Extended thinking" / reasoning modes help on math, but barely move citation accuracy. The reasoning loop just produces more confidently wrong citations.
- Smaller open-source models hallucinate 2-4x more than frontier ones — but the floor is still non-zero.
Translation: if you have an AI workflow that produces a 1,000-word deliverable with 8 factual claims, you should expect 0.25 to 1.5 hallucinations per output from the best models on the market. That number is not trending to zero. It hasn't moved meaningfully since GPT-4. The paper explains why it won't.
The May 12 News That Makes This Urgent
On May 12, 2026, Anthropic announced 12 new Claude plugins and 20 connectors for legal work. Bar-exam prep. Deposition drafting. Case-law research. Direct connectors to DocuSign, Thomson Reuters, Harvey, and Everlaw.
This is the most aggressive push of generative AI into a high-stakes citation-heavy domain anyone has ever shipped. And it lands in a world where Damien Charlotin's live tracker of fabricated AI citations in real court filings just passed 120 cases. Lawyers have been sanctioned. Briefs have been thrown out. Judges have started ordering AI-disclosure declarations.
Two things are now true at the same time:
- The math says hallucinations are permanent.
- The biggest AI labs are accelerating into the domains where hallucinations cause the most damage.
This is the gap. And it's not closing.
Why "Just Use RAG" Doesn't Save You
Every time someone shows a hallucination, an AI engineer says "that's why we use RAG" (retrieval-augmented generation — let the model cite a real document). It helps. It doesn't fix the underlying problem.
The Kalai/Vempala result still applies inside a RAG pipeline. The model still has to:
- Generate a query against the retrieval index (can hallucinate the query intent)
- Decide which retrieved chunk is relevant (binary classification — still error-prone)
- Synthesize an answer from the chunks (generation — still 2x the binary error)
- Cite the chunk correctly (separate generation task — also 2x)
That's why independent audits of RAG-based legal AI still find hallucination rates of 6-17% in the wild. RAG is a multiplier on accuracy, not a fix.
What This Means for Your AI Strategy
If hallucinations are permanent, your AI roadmap can't be "wait for the model to get better." That roadmap is dead. Three new principles replace it.
Principle 1: Calibrate uncertainty, don't suppress it
Models trained on binary benchmarks suppress uncertainty signals because the scoring punishes them. You can undo this in your prompts. Force the model to output a confidence score with every claim. Reject any output below your threshold. Yes, you'll get fewer answers. The remaining answers will be 5-10x more trustworthy. IMD's 2026 strategy brief calls this the only viable path forward.
Principle 2: Make verification cheaper than generation
If your workflow generates 100 AI outputs an hour but a human can only verify 10, the other 90 are unverified slop entering production. Invert it. Use AI to generate and verify, but cap generation throughput at human review capacity. You'll ship slower. Nothing you ship will embarrass you in court.
Principle 3: Buy tools that assume the math, not tools that deny it
This is the buying-decision layer. Vendors who pitch "99% accuracy" and "hallucination-free" are pitching against a published math proof. They will lose. Buy from vendors who tell you their hallucination rate, show you their human-in-the-loop workflow, and ship audit logs by default. Legora, for example, doesn't pretend its AI never makes things up — it ships citation verification as a core feature because it has to. That's the new vendor profile to look for.
The Bigger Lesson: Stop Outsourcing Math to Marketing
Three years of AI hype trained the market to believe every limitation is a roadmap item. "Context window too small? Wait six months." "Cost too high? Wait six months." "Hallucinates? Wait six months."
For context and cost, that worked. Both dropped by orders of magnitude. For hallucinations, the math is structurally different. There is no Moore's-law curve here. There is a proof.
The companies who keep pretending will burn money, shipping unreliable agents into production and paying the cleanup bill. The companies who internalize the OpenAI paper will quietly build workflows where AI does 80% of the work and humans verify the last 20% — and they will dominate every regulated industry over the next five years.
The myth was: "AI hallucinations will be fixed soon."
The truth is: AI hallucinations are a permanent feature of next-token prediction. Design for them. Buy around them. Ship anyway.
Related Reading on Skila AI
- Legora — legal AI platform designed around hallucination verification
- Fastest AI image generators ranked — the same generative-AI surface, ranked on a different axis
- Claude Design — the productivity surface paired with Anthropic's legal plugin wave
- Matt Pocock's Claude skills — how senior engineers structure prompts to minimize hallucination risk
- Blender MCP — another piece of the Claude expansion wave the legal plugins are part of
Frequently Asked Questions
Are AI hallucinations getting better in 2026?
Marginally. Frontier hallucination rates still range from 3.1% to 19.1% depending on task family, and citation accuracy — the most consequential category — sits at 12.4% even with extended thinking enabled. The rate has barely moved since GPT-4. The OpenAI paper explains why: the training objective rewards guessing over honesty, so the floor is structural, not engineering-driven.
Why did OpenAI say hallucinations are mathematically inevitable?
The September 2025 paper by Kalai, Nachum, Zhang, and Vempala proves that generating a valid sentence is at least twice as hard as checking whether one is valid. So whatever a model's error rate is on a binary yes/no task, its hallucination rate on sentence generation is mathematically bounded at twice that or more. Errors compound across predictions. No architecture or training trick removes the floor.
What is the hallucination rate of GPT-5 and Claude Opus 4.7 in 2026?
Independent 2026 benchmarks place frontier models in the 3.1-19.1% range, with citation tasks at the bottom (12.4%). GPT-5 and Claude Opus 4.7 perform similarly — both improved on math and code reasoning, neither meaningfully improved on citation accuracy. Suprmind and Digital Applied published the most recent comparative studies in early 2026.
How many real court cases have involved AI-hallucinated citations?
Damien Charlotin's public legal-hallucination database logs 120+ court cases as of May 2026 where AI tools produced fabricated quotes, invented case names, or made up legal citations. Lawyers have been sanctioned, briefs thrown out, and judges in multiple jurisdictions now require AI-disclosure declarations. The database is updated weekly.
Will a future GPT or Claude model eventually stop hallucinating?
No, not while next-token prediction and binary-graded benchmarks remain the training paradigm. The OpenAI paper formally proves this. A different training paradigm — one that rewards calibrated uncertainty instead of confident guessing — could lower the rate, but the labs that would have to adopt it are the same labs whose products lose on benchmark leaderboards if they say "I don't know." The incentive structure blocks the fix.
What is the best way to use AI when hallucinations can't be eliminated?
Three rules. First, force the model to output a confidence score with every claim and reject low-confidence outputs. Second, cap generation throughput at human-verification capacity — never ship more than a human can review. Third, prefer vendors who publish hallucination rates and ship citation-verification as a core feature instead of marketing "99% accuracy." Design around the math, don't fight it.
Key Takeaways
- ✓OpenAI's Sept 4 2025 paper (Kalai, Nachum, Zhang, Vempala) proves hallucinations are mathematically inevitable — total sentence error is at least 2x the simple yes/no error rate.
- ✓9 of 10 frontier AI benchmarks award zero points for 'I don't know' — they explicitly train models to guess instead of admit uncertainty.
- ✓Frontier 2026 hallucination rates are still 3.1-19.1%. Citation accuracy is the worst — 12.4% even with extended thinking.
- ✓Damien Charlotin's public database logs 120+ court cases with AI-fabricated citations as of May 2026.
- ✓Anthropic shipped 12 legal Claude plugins + 20 connectors on May 12, 2026 — pushing AI deeper into the highest-stakes hallucination zone in the economy.
- ✓Stop waiting for a model that hallucinates zero. Design workflows that assume hallucination will happen and verify outputs before they hit a deliverable.
Skila AI Editorial Team
The Skila AI editorial team researches and writes original content covering AI tools, model releases, open-source developments, and industry analysis. Our goal is to cut through the noise and give developers, product teams, and AI enthusiasts accurate, timely, and actionable information about the fast-moving AI ecosystem.
About Skila AI →Related Resources
Weekly AI Digest
Get the top AI news, tool reviews, and developer insights delivered every week. No spam, unsubscribe anytime.
Join 1,000+ AI enthusiasts. Free forever.