Back to Articles

Gemini 3.1 Pro Hit 77% on ARC-AGI-2. Here's What That Number Means for AI Reasoning.

March 13, 2026
9 min read
Gemini 3.1 Pro Hit 77% on ARC-AGI-2. Here's What That Number Means for AI Reasoning.
Google DeepMind's Gemini 3.1 Pro launched on February 19 with benchmark scores that quietly redefined the frontier: 77.1% on ARC-AGI-2, 94.3% on GPQA Diamond, and 80.6% on SWE-Bench Verified. With a 1M token context window and native multimodal reasoning, here's what those numbers actually mean.

On February 19, 2026, Google DeepMind released Gemini 3.1 Pro — and buried inside the model card was a number that stopped the AI research community cold: 77.1% on ARC-AGI-2.

That's not a typo. ARC-AGI-2, the benchmark Francois Chollet designed specifically to resist pattern memorization and require genuine reasoning, is one of the hardest AI evaluations in existence. A year ago, frontier models were struggling to break 30%. Gemini 3.1 Pro didn't just break 30% — it hit 77.1%.

If you follow AI research closely, that number deserves a full stop and a moment of reflection. This article breaks down what Gemini 3.1 Pro actually is, what these benchmarks tell us, and what they mean for developers and teams deciding which AI infrastructure to build on in 2026.

What Is Gemini 3.1 Pro?

Gemini 3.1 Pro is Google DeepMind's flagship intelligence model — the successor to Gemini 3 Pro, released in late 2025. It's a natively multimodal model: it doesn't bolt on vision or audio as an afterthought. It reasons across text, images, audio, video, and entire code repositories in a single unified architecture.

The headline spec is a 1 million token context window with a 64,000 token maximum output. To put that in perspective, 1 million tokens is roughly 750,000 words — equivalent to reading all seven Harry Potter books, in a single prompt, with change to spare. For developers working with large codebases, legal documents, or research corpora, this isn't a gimmick. It changes what's architecturally possible.

Google has deployed 3.1 Pro widely: it's available through the Gemini app (for Google AI Pro and Ultra subscribers), the Gemini API via AI Studio and Vertex AI, Google's Antigravity platform, Gemini CLI, Android Studio, and NotebookLM. Developers can access it in preview today.

The Benchmarks, Explained

Numbers without context are noise. Here's what Gemini 3.1 Pro's benchmark results actually tell you.

ARC-AGI-2: 77.1%

The Abstraction and Reasoning Corpus for AGI (ARC-AGI) was designed by Chollet to test a specific capability: the ability to solve novel pattern recognition puzzles using only visual analogical reasoning, with no domain knowledge memorizable from training data. Version 2, released in 2025, made the tasks harder specifically because the original ARC was starting to show signs of being solvable through memorization.

A 77.1% score on ARC-AGI-2 suggests Gemini 3.1 Pro is doing something qualitatively different from previous models on novel reasoning. This doesn't mean AGI is here — but it does mean the gap between "pattern matching on training data" and "flexible novel reasoning" is narrowing in ways that matter for real-world applications.

GPQA Diamond: 94.3%

GPQA Diamond is a dataset of graduate-level science questions in biology, chemistry, and physics, curated so that domain PhD experts get roughly 65% on them. A 94.3% score means Gemini 3.1 Pro is answering expert-level science questions more accurately than the experts who write them. This has direct implications for scientific research assistance, drug discovery, and materials science applications.

SWE-Bench Verified: 80.6%

SWE-Bench Verified measures a model's ability to resolve real GitHub issues — read a codebase, understand a bug report, and submit a correct pull request fix. An 80.6% resolution rate means Gemini 3.1 Pro successfully resolves more than 4 out of 5 real-world software bugs it's given. For developers, this is the benchmark that most directly predicts utility in agentic coding workflows.

Humanity's Last Exam (with search): 51.4%

HLE is designed to be the hardest possible factual knowledge benchmark — questions submitted by domain experts that were considered beyond current AI capability at time of writing. With web search enabled, Gemini 3.1 Pro answers 51.4% correctly. For reference, humans with domain expertise score in the 60-70% range on questions outside their specialty.

LiveCodeBench Pro Elo: 2887

LiveCodeBench Pro tracks competitive programming performance using Elo ratings against human contestants. An Elo of 2887 places Gemini 3.1 Pro in the top 0.1% of competitive programmers globally — above International Grandmaster level. For algorithmic problem-solving and technical interview preparation, this is a meaningful signal.

What Makes 3.1 Pro Different From 3 Pro

Google claims more than a 50% improvement over Gemini 2.5 Pro on the number of benchmark tasks solved — but the upgrade from Gemini 3 to 3.1 is also significant. The key improvements are:

Reliability on complex multi-step tasks. The failure mode of earlier Gemini 3 generations was silent errors on long chains of reasoning — the model would confidently produce a plausible-looking but wrong answer after 15 steps. 3.1 Pro shows substantially better calibration: it's more likely to express uncertainty when uncertain, and more likely to catch its own errors mid-chain.

Better agentic behavior. Tool use and multi-step agent trajectories are significantly more stable. In internal Google benchmarks, 3.1 Pro completes agentic tasks (like managing files, browsing the web, and executing code in a loop) with 40%+ fewer task failures compared to 3 Pro. For developers building AI agents on Gemini, this matters more than most benchmark numbers.

Extended output quality at context limits. At 800K-1M token context, Gemini 3 Pro showed degradation in output quality — the classic "lost in the middle" problem. Gemini 3.1 Pro maintains coherent reasoning and accurate retrieval throughout the full context window. This has been validated in tasks like full-codebase refactoring and long-document summarization.

Multimodal Capabilities in Practice

Gemini 3.1 Pro isn't just a text model with vision bolted on. It reasons natively across modalities simultaneously. Some practical examples of what this enables:

  • Video + code analysis: Feed it a screen recording of a bug and it can analyze the UI behavior, read the associated code, and suggest fixes — reasoning across both the visual and code modalities simultaneously.
  • Audio + document understanding: Analyze an earnings call audio while cross-referencing the 10-K filing it references, identifying inconsistencies between what was said and what was reported.
  • Image + scientific literature: Given a microscopy image and a paper describing an organism's pathology, reason about whether the image is consistent with the described condition.

These aren't hypothetical use cases — they're workflows that developers are already building in the Google Cloud ecosystem using Gemini 3.1 Pro through Vertex AI.

How Does It Compare to Claude Opus 4.6 and GPT-5.4?

The current frontier has three serious competitors: Gemini 3.1 Pro, Claude Opus 4.6, and GPT-5.4. Comparing them fairly requires acknowledging that benchmark numbers are useful but incomplete — what matters is which model performs best on your specific use cases.

On ARC-AGI-2, Gemini 3.1 Pro's 77.1% leads the published scores. Claude Opus 4.6 achieved 69% on the same benchmark (announced in February 2026), and OpenAI has not published GPT-5.4's ARC-AGI-2 scores as of this writing. On SWE-Bench Verified, Claude Opus 4.6 and Gemini 3.1 Pro are within a few percentage points of each other — both in the 78-82% range — while GPT-5.4 is strong at coding tasks but shows more variability on multi-step agentic workflows.

Context window: Gemini 3.1 Pro's 1M token context is the largest among commercially available models. Claude Opus 4.6 supports 500K tokens; GPT-5.4 currently tops out at 256K for most API tiers. If your application genuinely requires processing massive contexts in a single call, Gemini 3.1 Pro is the current choice.

Pricing: Google has not published official token pricing for 3.1 Pro as of March 2026 (it's available in preview with generous free tiers for AI Studio users). Expect pricing parity or slight premium over the Gemini 3 Pro pricing once GA is announced.

What This Means for Developers Building AI Applications

Several architectural decisions change in light of Gemini 3.1 Pro's capabilities:

RAG strategy reconsideration. With 1M token context, many retrieval-augmented generation architectures become unnecessary for medium-sized knowledge bases. If your entire product documentation is under 700K tokens, you can simply pass it all in context rather than building a vector database pipeline. The latency and cost tradeoffs need to be modeled, but the architectural simplification is real.

Agentic task reliability. The 80.6% SWE-Bench score and improved agentic stability mean autonomous coding agents built on 3.1 Pro can handle substantially more complex tasks without human checkpoint loops. Teams building developer tools should benchmark their specific workflows — the improvement over 3 Pro is significant enough to warrant re-evaluation.

Scientific and research applications. The 94.3% GPQA Diamond score opens up applications in life sciences, materials science, and technical research assistance that were previously limited by frontier model reliability on expert-level content. Biotech and pharma teams in particular should be stress-testing Gemini 3.1 Pro against their domain-specific evaluation sets.

Video and audio-native pipelines. For teams that have been stitching together separate specialized models (a vision model here, an audio model there), Gemini 3.1 Pro's native multimodal reasoning may simplify the architecture significantly. Fewer models in the chain = fewer points of failure and latency reduction.

The AI reasoning benchmark leaderboard is no longer a place for small, incremental moves. Gemini 3.1 Pro's 77.1% on ARC-AGI-2 is a milestone that will be referenced in research papers for years. Whether it matters for your specific application depends entirely on what you're building — but if you haven't benchmarked 3.1 Pro against your use case yet, that's the first thing to do this week.

Key Takeaways

  • Gemini 3.1 Pro achieved 77.1% on ARC-AGI-2 — the highest published score for novel AI reasoning
  • 1M token context window is the largest available in any commercially accessible frontier model
  • 94.3% on GPQA Diamond surpasses most domain experts on graduate-level science questions
  • 80.6% SWE-Bench Verified makes it a serious contender for agentic coding workflows
  • Available now in preview via Google AI Studio, Vertex AI, Gemini CLI, and NotebookLM
  • 50%+ improvement over Gemini 2.5 Pro — Google's largest single-generation benchmark jump
  • Native multimodal reasoning across text, audio, images, video, and code in one model
S

Skila AI Editorial Team

The Skila AI editorial team researches and writes original content covering AI tools, model releases, open-source developments, and industry analysis. Our goal is to cut through the noise and give developers, product teams, and AI enthusiasts accurate, timely, and actionable information about the fast-moving AI ecosystem.

About Skila AI →
Gemini 3 1 Pro
Google Deepmind
Llm
Reasoning
Arc Agi 2
Multimodal
Ai Benchmarks

Related Resources

Weekly AI Digest

Get the top AI news, tool reviews, and developer insights delivered every week. No spam, unsubscribe anytime.

Join 1,000+ AI enthusiasts. Free forever.