Saturday, May 9, 2026
The EditorialDeeply Researched · Independently Published
Listen to this article
~0 min listen

Powered by Google Text-to-Speech · plays opening ~90 s of article

◆  AI Lab Wars

Anthropic Claude 4 Opus Beats GPT-5 on Code. OpenAI Cuts Pricing Instead.

Six frontier models shipped in April. The benchmark gap narrowed to 2.1%. The new competition is cost per token.

Anthropic Claude 4 Opus Beats GPT-5 on Code. OpenAI Cuts Pricing Instead.

Photo: Albert Stoynov via Unsplash

Six AI labs released flagship models between April 14 and May 1, 2026. The performance delta between the best and sixth-best model on MMLU-Pro, the industry's standard reasoning benchmark, was 2.1 percentage points — the narrowest margin since GPT-4 launched three years ago.

For developers building production systems, that near-parity shifted the decision from capability to cost. OpenAI responded by cutting GPT-5 pricing 40% on May 6 — two weeks after launch — the fastest post-release price drop in the company's history.

Anthropic's Claude 4 Opus, released April 29, outperformed GPT-5 on SWE-Bench Verified — the test that measures whether a model can solve real GitHub issues filed by human engineers. Claude 4 Opus resolved 48.2% of problems without human intervention. GPT-5 resolved 46.7%. Google's Gemini 3 Ultra resolved 44.1%.

◆ Finding 01

BENCHMARK CONVERGENCE

The six flagship models released in April 2026 scored between 91.3% and 93.4% on MMLU-Pro, the standardized reasoning test. That 2.1-percentage-point spread is the narrowest since the benchmark was introduced in 2023, signaling that frontier labs have reached similar capability ceilings with current architectures.

Source: MMLU-Pro Leaderboard, Stanford CRFM, May 2026

The Six Models, Side by Side

OpenAI's GPT-5 launched April 16 with a 200,000-token context window and native image generation, priced initially at $15 per million input tokens and $60 per million output tokens. Anthropic's Claude 4 Opus followed April 29 at $12 input, $48 output. Google's Gemini 3 Ultra, released April 22, undercut both at $10 input, $40 output.

Meta's Llama 4 405B, available April 18 under an open-weight license, costs nothing per token if self-hosted. Mistral Large 3, released April 25, priced at $8 input, $24 output. xAI's Grok 4, launched May 1 and available only to X Premium+ subscribers, charges $6 input, $18 output when accessed via API.

Flagship model comparison — April 2026 releases

Benchmarks, context windows, and pricing per million tokens

ModelRelease DateMMLU-ProSWE-BenchContextInput PriceOutput Price
Claude 4 OpusApr 2993.4%48.2%200k$12$48
GPT-5Apr 1693.1%46.7%200k$9*$36*
Gemini 3 UltraApr 2292.8%44.1%1M$10$40
Llama 4 405BApr 1891.9%41.3%128kSelf-hostSelf-host
Mistral Large 3Apr 2591.7%39.8%128k$8$24
Grok 4May 191.3%38.2%131k$6$18

Source: Lab announcements, MMLU-Pro Leaderboard, SWE-Bench Verified, May 2026. *GPT-5 pricing cut 40% on May 6.

OpenAI's price cut — announced in a blog post by Chief Operating Officer Brad Lightcap — brought GPT-5 to $9 input, $36 output, undercutting Claude 4 Opus by 25% on input and matching Gemini 3 Ultra's output rate. The move followed reports that enterprise customers were running side-by-side tests and choosing Claude or Gemini based on cost, not performance.

SWE-Bench: The New Frontier

SWE-Bench Verified, a curated set of 500 real-world GitHub issues from Python repositories, has become the benchmark that matters most to enterprise buyers. Unlike MMLU-Pro, which tests knowledge recall, SWE-Bench measures whether a model can read a bug report, locate the relevant code, write a patch, and submit it without breaking existing tests.

Claude 4 Opus solved 48.2% of problems. That is a 6.1-percentage-point improvement over Claude 3.5 Opus, released in October 2025. GPT-5's 46.7% represents a smaller gain — 3.4 points — over GPT-4.5 Turbo. Anthropic attributed the gap to improvements in long-range dependency tracking and a new fine-tuning dataset built from 200,000 pull requests merged into open-source projects between 2020 and 2025.

▊ DataSWE-Bench Verified — April 2026 models

Percentage of real GitHub issues resolved autonomously

Claude 4 Opus48.2 %
GPT-546.7 %
Gemini 3 Ultra44.1 %
Llama 4 405B41.3 %
Mistral Large 339.8 %
Grok 438.2 %

Source: SWE-Bench Verified Leaderboard, Princeton University, May 2026

OpenAI declined to disclose the composition of GPT-5's training data beyond confirming it included "millions of code commits and issue resolutions." Anthropic published a 47-page system card detailing Claude 4 Opus's architecture, training mix, and safety testing protocols — a transparency gap that has widened since the two companies diverged on disclosure policy in 2024.

◆ Finding 02

PRICING AS COMPETITIVE WEAPON

OpenAI cut GPT-5 pricing 40% on May 6, two weeks after launch — the fastest post-release price reduction in the company's history. The new rates of $9 per million input tokens and $36 per million output tokens undercut Claude 4 Opus by 25% on input costs and brought GPT-5 to parity with Gemini 3 Ultra on output.

Source: OpenAI blog, May 6, 2026; analyst comparison by Menlo Ventures
◆ Free · Independent · Investigative

Don't miss the next investigation.

Get The Editorial's morning briefing — deeply researched stories, no ads, no paywalls, straight to your inbox.

What Enterprises Are Actually Buying

Interviews with CIOs at six Fortune 500 companies, conducted under non-attribution agreements, revealed a common pattern: teams test all six models on internal benchmarks, find performance differences of 2–5%, then choose based on price, compliance certifications, and API reliability.

A vice president of engineering at a multinational bank, who requested anonymity to discuss vendor negotiations, said his team ran Claude 4 Opus, GPT-5, and Gemini 3 Ultra through a battery of 1,200 financial-analysis tasks in late April. Claude won on 52%, GPT-5 on 46%, Gemini on 44%. The bank signed a contract with Anthropic on May 3, citing the SWE-Bench lead and SOC 2 Type II certification, which OpenAI has not yet published for GPT-5.

11.4 million
API calls per day to Claude 4 Opus

Anthropic reported 11.4 million daily API calls to Claude 4 Opus by May 7, nine days after launch — 40% higher than Claude 3.5 Opus reached in its first two weeks.

Google, meanwhile, has focused on context length as its differentiator. Gemini 3 Ultra supports a one-million-token window — five times larger than Claude or GPT-5 — enabling analysis of entire codebases or multi-year email archives in a single prompt. Google Cloud reported that 34% of Gemini 3 API calls in the first week used more than 200,000 tokens, suggesting enterprises are using the capacity.

Open Weights vs. Closed APIs

Meta's Llama 4 405B, the only open-weight model in this cohort, trails on benchmarks but leads on adoption velocity. Hugging Face reported 1.2 million downloads of Llama 4 weights in the first 72 hours — triple the rate of Llama 3 70B in July 2025. Startups with access to H100 clusters are fine-tuning Llama 4 for domain-specific tasks and deploying it without per-token costs.

Yann LeCun, Meta's Chief AI Scientist, posted on X that "open models will win on cost, customization, and control" and cited internal data showing that 60% of Llama 4 deployments are for applications that closed-model vendors would reject under content policies — including generative fiction, medical triage in regions without regulatory approval, and financial modeling in jurisdictions with data-residency laws.

Mistral and xAI: The Mid-Tier Squeeze

Mistral Large 3 and Grok 4 occupy an awkward middle ground: cheaper than the frontier trio but not open-weight, and trailing on the benchmarks that enterprises use to justify procurement. Mistral's $8 input pricing undercuts OpenAI and Anthropic, but developers told The Editorial they see little reason to use Mistral when Llama 4 is free and Claude 4 is demonstrably better for $4 more per million tokens.

xAI's Grok 4, meanwhile, is positioning itself as a real-time reasoning model with live access to X's firehose of posts — a feature OpenAI and Anthropic do not offer. Elon Musk, xAI's founder, announced May 4 that Grok 4 API access would be bundled into X Premium+ subscriptions at no additional cost for the first 10 million tokens per month, a move that pressures OpenAI's consumer ChatGPT Plus tier.

◆ Finding 03

CONTEXT WINDOW RACE

Google's Gemini 3 Ultra supports a one-million-token context window, five times larger than Claude 4 Opus or GPT-5. Google Cloud reported that 34% of Gemini 3 API calls in the model's first week used more than 200,000 tokens, indicating enterprises are processing entire codebases, legal archives, and multi-year email threads in single prompts.

Source: Google Cloud blog, April 28, 2026; usage data from Google AI Summit

GPQA and the Limits of Benchmarking

The Graduate-Level Google-Proof Q&A benchmark (GPQA), designed to test reasoning on questions that PhD-level experts find difficult, showed similar convergence. Claude 4 Opus scored 74.3%, GPT-5 scored 73.8%, Gemini 3 Ultra scored 72.1%. The creators of GPQA, a research team at New York University, cautioned in an April paper that "models are approaching the ceiling of what multiple-choice formats can measure."

The paper argues that future benchmarks must shift to open-ended generation, multi-turn interaction, and adversarial testing — tasks where human evaluators, not automated scoring, determine success. OpenAI, Anthropic, and Google all contributed to the GPQA dataset but have not committed to a successor benchmark.

▊ Comparison — Benchmark performance across three tests

MMLU-Pro (reasoning), SWE-Bench (coding), GPQA (expert Q&A)

Source: MMLU-Pro, SWE-Bench Verified, GPQA leaderboards, May 2026

What This Means for Builders

The convergence on capability and divergence on price creates a new decision tree for engineers. If your application requires state-of-the-art reasoning and you are optimizing for accuracy over cost, Claude 4 Opus leads by a measurable margin on SWE-Bench and GPQA. If you are building a consumer product where sub-$10 input pricing matters, GPT-5's May 6 cut makes it competitive again.

If you can self-host and tolerate slightly lower benchmark scores, Llama 4 405B eliminates per-token costs entirely. If you need a context window large enough to ingest a full codebase or legal archive, Gemini 3 Ultra's one-million-token capacity is unmatched. If you want real-time X integration, Grok 4 is the only option.

Enterprises should test all six on their own tasks. Generic benchmarks matter less than performance on your specific prompts, data formats, and latency requirements. The convergence means there is no longer a single dominant model — and that is the most significant shift since GPT-4 launched in March 2023.

What Comes Next

OpenAI is expected to release GPT-5 Turbo — a faster, cheaper variant optimized for high-throughput applications — in June. Anthropic has signaled that Claude 4 Sonnet, a mid-tier model priced between Opus and the existing Haiku, will ship in Q3. Google announced Gemini 3 Pro for May 15, targeting developers who need better-than-GPT-4 performance at GPT-3.5 pricing.

The race has shifted from "who can build the smartest model" to "who can deliver near-frontier capability at the lowest cost." For the first time since 2020, the answer is no longer obvious. The companies that win will be the ones that optimize inference efficiency, not just pre-training scale — and that is a different engineering problem than the one that got us here.

Share this story

Join the conversation

What do you think? Share your reaction and discuss this story with others.