
The arms race to build smarter AI models has a measurement problem: the tests used to rank them are becoming obsolete almost as quickly as the models improve. On Monday, Artificial Analysis, an independent AI benchmarking organization whose rankings are closely watched by developers and enterprise buyers, released a major overhaul to its Intelligence Index that fundamentally changes how the industry measures AI progress.
The new Intelligence Index v4.0 incorporates 10 evaluations spanning agents, coding, scientific reasoning, and general knowledge. But the changes go far deeper than shuffling test names. The organization removed three staple benchmarks — MMLU-Pro, AIME 2025, and LiveCodeBench — that have long been cited by AI companies in their marketing materials. In their place, the new index introduces evaluations designed to measure whether AI systems can complete the kind of work that people actually get paid to do.
type: embedded-entry-inline id: 1bCmRrroGCdUb07IuaHysL
"This index shift reflects a broader transition: intelligence is being measured less by recall and more by economically useful action," observed Aravind Sundar, a researcher who responded to the announcement on X (formerly Twitter).
Why AI benchmarks are breaking: The problem with tests that top models have already mastered
The benchmark overhaul addresses a growing crisis in AI evaluation: the leading models have become so capable that traditional tests can no longer meaningfully differentiate between them. The new index deliberately makes the curve harder to climb. According to Artificial Analysis, top models now score 50 or below on the new v4.0 scale, compared to 73 on the previous version — a recalibration designed to restore headroom for future improvement.
This saturation problem has plagued the industry for months. When every frontier model scores in the 90th percentile on a given test, the test loses its usefulness as a decision-making tool for enterprises trying to choose which AI system to deploy. The new methodology attempts to solve this by weighting four categories equally — Agents, Coding, Scientific Reasoning, and Genera l— while introducing evaluations where even the most advanced systems still struggle.
The results under the new framework show OpenAI's GPT-5.2 with extended reasoning effort claiming the top spot, followed closely by Anthropic's Claude Opus 4.5 and Google's Gemini 3 Pro. OpenAI describes GPT-5.2 as "the most capable model series yet for professional knowledge work," while Anthropic's Claude Opus 4.5 scores higher than GPT-5.2 on SWE-Bench Verified, a test set evaluating software coding abilities.
GDPval-AA: The new benchmark testing whether AI can do your job
The most significant addition to the new index is GDPval-AA, an evaluation based on OpenAI's GDPval dataset that tests AI models on real-world economically valuable tasks across 44 occupations and 9 major industries. Unlike traditional benchmarks that ask models to solve abstract math problems or answer multiple-choice trivia, GDPval-AA measures whether AI can produce the deliverables that professionals actually create: documents, slides, diagrams, spreadsheets, and multimedia content.
Models receive shell access and web browsing capabilities through what Artificial Analysis calls "Stirrup," its reference agentic harness. Scores are derived from blind pairwise comparisons, with ELO ratings frozen at the time of evaluation to ensure index stability.
Under this framework, OpenAI's GPT-5.2 with extended reasoning leads with an ELO score of 1442, while Anthropic's Claude Opus 4.5 non-thinking variant follows at 1403. Claude Sonnet 4.5 trails at 1259.
On the original GDPval evaluation, GPT-5.2 beat or tied top industry professionals on 70.9% of well-specified tasks, according to OpenAI. The company claims GPT-5.2 "outperforms industry professionals at well-specified knowledge work tasks spanning 44 occupations," with companies including Notion, Box, Shopify, Harvey, and Zoom observing "state-of-the-art long-horizon reasoning and tool-calling performance."
The emphasis on economically measurable output is a philosophical shift in how the industry thinks about AI capability. Rather than asking whether a model can pass a bar exam or solve competition math problems — achievements that generate headlines but don't necessarily translate to workplace productivity — the new benchmarks ask whether AI can actually do jobs.
Graduate-level physics problems expose the limits of today's most advanced AI models
While GDPval-AA measures practical productivity, another new evaluation called CritPT reveals just how far AI systems remain from true scientific reasoning. The benchmark tests language models on unpublished, research-level reasoning tasks across modern physics, including condensed matter, quantum physics, and astrophysics.
CritPT was developed by more than 50 active physics researchers from over 30 leading institutions. Its 71 composite research challenges simulate full-scale research projects at the entry level — comparable to the warm-up exercises a hands-on principal investigator might assign to junior graduate students. Every problem is hand-curated to produce a guess-resistant, machine-verifiable answer.
The results are sobering. Current state-of-the-art models remain far from reliably solving full research-scale challenges. GPT-5.2 with extended reasoning leads the CritPT leaderboard with a score of just 11.5%, followed by Google's Gemini 3 Pro Preview and Anthropic's Claude 4.5 Opus Thinking variant. These scores suggest that despite remarkable progress on consumer-facing tasks, AI systems still struggle with the kind of deep reasoning required for scientific discovery.
AI hallucination rates: Why the most accurate models aren't always the most trustworthy
Perhaps the most revealing new evaluation is AA-Omniscience, which measures factual recall and hallucination across 6,000 questions covering 42 economically relevant topics within six domains: Business, Health, Law, Software Engineering, Humanities & Social Sciences, and Science/Engineering/Mathematics.
The evaluation produces an Omniscience Index that rewards precise knowledge while penalizing hallucinated responses — providing insight into whether a model can distinguish what it knows from what it doesn't. The findings expose an uncomfortable truth: high accuracy does not guarantee low hallucination. Models with the highest accuracy often fail to lead on the Omniscience Index because they tend to guess rather than abstain when uncertain.
Google's Gemini 3 Pro Preview leads the Omniscience Index with a score of 13, followed by Claude Opus 4.5 Thinking and Gemini 3 Flash Reasoning, both at 10. However, the breakdown between accuracy and hallucination rates reveals a more complex picture.
On raw accuracy, Google's two models lead with scores of 54% and 51% respectively, followed by Claude 4.5 Opus Thinking at 43%. But Google's models also demonstrate higher hallucination rates than peer models, scoring 88% and 85%. Anthropic's Claude 4.5 Sonnet Thinking and Claude Opus 4.5 Thinking show hallucination rates of 48% and 58% respectively, while GPT-5.1 with high reasoning effort achieves 51%—the second-lowest hallucination rate tested.
Both Omniscience Accuracy and Hallucination Rate contribute 6.25% weighting each to the overall Intelligence Index v4.
Inside the AI arms race: How OpenAI, Google, and Anthropic stack up under new testing
The benchmark reshuffling arrives at an especially turbulent moment in the AI industry. All three leading frontier model developers have launched major new models within just a few weeks — and Gemini 3 still holds the top spot on much of the leaderboards on LMArena, a widely cited benchmarking tool used to compare LLMs.
Google's November release of Gemini 3 prompted OpenAI to declare a "code red" effort to improve ChatGPT. OpenAI is counting on its GPT family of models to justify its $500 billion valuation and over $1.4 trillion in planned spending. "We announced this code red to really signal to the company that we want to marshal resources in one particular area," said Fidji Simo, CEO of applications at OpenAI. Altman told CNBC he expected OpenAI to exit its code red by January.
Anthropic responded with Claude Opus 4.5 on November 24, achieving an SWE-Bench Verified accuracy score of 80.9% — reclaiming the coding crown from both GPT-5.1-Codex-Max and Gemini 3. The launch marked Anthropic's third major model release in two months. Microsoft and Nvidia have since announced multi-billion-dollar investments in Anthropic, boosting its valuation to about $350 billion.
How Artificial Analysis tests AI models: A look at the independent benchmarking process
Artificial Analysis emphasizes that all evaluations are run independently using a standardized methodology. The organization states that its "methodology emphasizes fairness and real-world applicability," estimating a 95% confidence interval for the Intelligence Index of less than ±1% based on experiments with more than 10 repeats on certain models.
The organization's published methodology defines key terms that enterprise buyers should understand. According to the methodology documentation, Artificial Analysis considers an "endpoint" to be a hosted instance of a model accessible via an API — meaning a single model may have multiple endpoints across different providers. A "provider" is a company that hosts and provides access to one or more model endpoints or systems. Critically, Artificial Analysis distinguishes between "open weights" models, whose weights have been released publicly, and truly open-source models—noting that many open LLMs have been released with licenses that do not meet the full definition of open-source software.
The methodology also clarifies how the organization standardizes token measurement: it uses OpenAI tokens as measured with OpenAI's tiktoken package as a standard unit across all providers to enable fair comparisons.
What the new AI Intelligence Index means for enterprise technology decisions in 2026
For technical decision-makers evaluating AI systems, the Intelligence Index v4.0 provides a more nuanced picture of capability than previous benchmark compilations. The equal weighting across agents, coding, scientific reasoning, and general knowledge means that enterprises with specific use cases may want to examine category-specific scores rather than relying solely on the aggregate index.
The introduction of hallucination measurement as a distinct, weighted factor addresses one of the most persistent concerns in enterprise AI adoption. A model that appears highly accurate but frequently hallucinates when uncertain poses significant risks in regulated industries like healthcare, finance, and law.
The Artificial Analysis Intelligence Index is described as "a text-only, English language evaluation suite." The organization benchmarks models for image inputs, speech inputs, and multilingual performance separately.
The response to the announcement has been largely positive. "It is great to see the index evolving to reduce saturation and focus more on agentic performance," wrote one commenter in an X.com post. "Including real-world tasks like GDPval-AA makes the scores much more relevant for practical use."
Others struck a more ambitious note. "The new wave of models that is just about to come will leave them all behind," predicted one observer. "By the end of the year the singularity will be undeniable."
But whether that prediction proves prophetic or premature, one thing is already clear: the era of judging AI by how well it answers test questions is ending. The new standard is simpler and far more consequential — can it do the work?
