Did AI Lie About Grok 3’s Benchmarks?

Debates over AI Benchmarks Spill into Public View

Debates over AI benchmarks and how they’re reported by AI labs are spilling out into public view. This week, an OpenAI employee accused Elon Musk’s AI company, xAI, of publishing misleading benchmark results for its latest AI model, Grok 3. One of the co-founders of xAI, Igor Babushkin, insisted that the company was in the right.

The Dispute

The truth lies somewhere in between. xAI’s graph showed two variants of Grok 3, Grok 3 Reasoning Beta and Grok 3 mini Reasoning, beating OpenAI’s best-performing available model, o3-mini-high, on AIME 2025, a collection of challenging math questions from a recent invitational mathematics exam. However, OpenAI employees pointed out that xAI’s graph didn’t include o3-mini-high’s AIME 2025 score at "cons@64."

What is Cons@64?

Cons@64, short for "consensus@64," gives a model 64 tries to answer each problem in a benchmark and takes the answers generated most frequently as the final answers. This tends to boost models’ benchmark scores, and omitting it from a graph might make it appear as though one model surpasses another when in reality, that’s not the case.

A More Accurate Picture

Grok 3 Reasoning Beta and Grok 3 mini Reasoning’s scores for AIME 2025 at "@1" fall below o3-mini-high’s score. Grok 3 Reasoning Beta also trails ever-so-slightly behind OpenAI’s o1 model set to "medium" computing. Yet, xAI is advertising Grok 3 as the "world’s smartest AI."

A Neutral Party’s Take

A more neutral party in the debate put together a more "accurate" graph showing nearly every model’s performance at cons@64. As Twitter user Teortaxes pointed out, "Hilarious how some people see my plot as an attack on OpenAI and others as an attack on Grok while in reality it’s DeepSeek propaganda."

The Importance of Transparency

As AI researcher Nathan Lambert noted, the most important metric remains a mystery: the computational (and monetary) cost it took for each model to achieve its best score. This just goes to show how little most AI benchmarks communicate about models’ limitations – and their strengths.

Conclusion

The debate over AI benchmarks highlights the need for transparency and clear communication in the AI community. As the field continues to evolve, it’s essential to ensure that benchmark results are accurately reported and that model comparisons are made on a level playing field.

FAQs

Q: What is cons@64?
A: Cons@64, short for "consensus@64," gives a model 64 tries to answer each problem in a benchmark and takes the answers generated most frequently as the final answers.

Q: Why is cons@64 important?
A: Cons@64 tends to boost models’ benchmark scores, and omitting it from a graph might make it appear as though one model surpasses another when in reality, that’s not the case.

Q: What is the most important metric in AI benchmarking?
A: The most important metric remains the computational (and monetary) cost it took for each model to achieve its best score.

Post Views: 120

Did AI Lie About Grok 3’s Benchmarks?

Generate single title from this title Designing assessments that assume AI is present in 100 -150 characters. And it must return only title i...

SmartThings Blog

A better method for planning complex visual tasks | MIT News

Generate single title from this title Why AI insurance underwriting is finally attracting institutional capital in 100 -150 characters. And it must return only...

Generate single title from this title A New AI Model Could Help Scientists Design New Forms of Life in 100 -150 characters. And it...

Generate single title from this title Designing assessments that assume AI is present in 100 -150 characters. And it must return only title i...

SmartThings Blog

A better method for planning complex visual tasks | MIT News

Generate single title from this title Why AI insurance underwriting is finally attracting institutional capital in 100 -150 characters. And it must return only...

Generate single title from this title A New AI Model Could Help Scientists Design New Forms of Life in 100 -150 characters. And it...

Generate single title from this title Train CodeFu-7B with veRL and Ray on Amazon SageMaker Training jobs in 100 -150 characters. And it must...

Generate single title from this title Nearly half of high school students now use AI in college search in 100 -150 characters. And it...

Engineering confidence to navigate uncertainty | MIT News

LEAVE A REPLY Cancel reply

Latest

Generate single title from this title Designing assessments that assume AI is present in 100 -150 characters. And it must return only title i...

SmartThings Blog

A better method for planning complex visual tasks | MIT News

Categories

Useful Links

Our Newsletter