Skip to content
Up To Date Time

Up To Date Time

  • Home
  • Sports
  • cryptocurrency
  • Technology
  • Virtual Reality
  • Education Law
  • More
    • About Us
    • Contact Us
    • Disclaimer
    • Privacy Policy
    • Terms and Conditions
  • Toggle search form
Did xAI lie about Grok 3’s benchmarks?

Did xAI lie about Grok 3’s benchmarks?

Posted on February 23, 2025 By rehan.rafique No Comments on Did xAI lie about Grok 3’s benchmarks?

Debates over AI benchmarks — and how they’re reported by AI labs — are spilling out into public view.

This week, an OpenAI employee accused Elon Musk’s AI company, xAI, of publishing misleading benchmark results for its latest AI model, Grok 3. One of the co-founders of xAI, Igor Babushkin, insisted that the company was in the right.

The truth lies somewhere in between.

In a post on xAI’s blog, the company published a graph showing Grok 3’s performance on AIME 2025, a collection of challenging math questions from a recent invitational mathematics exam. Some experts have questioned AIME’s validity as an AI benchmark. Nevertheless, AIME 2025 and older versions of the test are commonly used to probe a model’s math ability.

xAI’s graph showed two variants of Grok 3, Grok 3 Reasoning Beta and Grok 3 mini Reasoning, beating OpenAI’s best-performing available model, o3-mini-high, on AIME 2025. But OpenAI employees on X were quick to point out that xAI’s graph didn’t include o3-mini-high’s AIME 2025 score at “cons@64.”

What is cons@64, you might ask? Well, it’s short for “consensus@64,” and it basically gives a model 64 tries to answer each problem in a benchmark and takes the answers generated most frequently as the final answers. As you can imagine, cons@64 tends to boost models’ benchmark scores quite a bit, and omitting it from a graph might make it appear as though one model surpasses another when in reality, that’s isn’t the case.

Grok 3 Reasoning Beta and Grok 3 mini Reasoning’s scores for AIME 2025 at “@1” — meaning the first score the models got on the benchmark — fall below o3-mini-high’s score. Grok 3 Reasoning Beta also trails ever-so-slightly behind OpenAI’s o1 model set to “medium” computing. Yet xAI is advertising Grok 3 as the “world’s smartest AI.”

Babushkin argued on X that OpenAI has published similarly misleading benchmark charts in the past — albeit charts comparing the performance of its own models. A more neutral party in the debate put together a more “accurate” graph showing nearly every model’s performance at cons@64:

Hilarious how some people see my plot as attack on OpenAI and others as attack on Grok while in reality it’s DeepSeek propaganda
(I actually believe Grok looks good there, and openAI’s TTC chicanery behind o3-mini-*high*-pass@”””1″”” deserves more scrutiny.) https://t.co/dJqlJpcJh8 pic.twitter.com/3WH8FOUfic

— Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) (@teortaxesTex) February 20, 2025

But as AI researcher Nathan Lambert pointed out in a post, perhaps the most important metric remains a mystery: the computational (and monetary) cost it took for each model to achieve its best score. That just goes to show how little most AI benchmarks communicate about models’ limitations — and their strengths.

Technology

Post navigation

Previous Post: CSI Pacific Announces Groundbreaking International Partnership with LSVBW and OSP Stuttgart – Canadian Sport Institute Pacific
Next Post: Meta Announces $50M ‘Horizon Worlds’ Content Fund as Some VR Studios Struggle to Make Ends Meet

More Related Articles

Can Virtual Reality Harm Your Eyes? Exploring Eye Health in VR Gaming Can Virtual Reality Harm Your Eyes? Exploring Eye Health in VR Gaming Technology
Consumer Privacy Protection Act: What It Means for Auto Industry Consumer Privacy Protection Act: What It Means for Auto Industry Technology
Best External SSDs: Top 5 Solid Portable Drives for All Best External SSDs: Top 5 Solid Portable Drives for All Technology
Cloudflare turns AI against itself with endless maze of irrelevant facts Cloudflare turns AI against itself with endless maze of irrelevant facts Technology
How To Get Starlink Internet in Kenya: Expert Tips and Insights How To Get Starlink Internet in Kenya: Expert Tips and Insights Technology
Here are the most searched Google trends in Canada Here are the most searched Google trends in Canada Technology

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Recent Posts

  • Information Density: Microfilm And Microfiche
  • Top 10 Cryptos To Invest In June 2025
  • Udexreal UDCAP hands-on: very thin tracking gloves for VR
  • Has AI Rendered This Job Obsolete? A Career Projection for Software Developers
  • Navarrete Edges Suarez to Retain Crown

Categories

  • cryptocurrency
  • Education Law
  • Sports
  • Technology
  • Virtual Reality

Copyright © 2025 Up To Date Time.

Powered by PressBook Blog WordPress theme