Debates over AI Benchmarks Spill into Public View
Debates over AI benchmarks and how they’re reported by AI labs are spilling out into public view. This week, an OpenAI employee accused Elon Musk’s AI company, xAI, of publishing misleading benchmark results for its latest AI model, Grok 3. One of the co-founders of xAI, Igor Babushkin, insisted that the company was in the right.
The Dispute
The truth lies somewhere in between. xAI’s graph showed two variants of Grok 3, Grok 3 Reasoning Beta and Grok 3 mini Reasoning, beating OpenAI’s best-performing available model, o3-mini-high, on AIME 2025, a collection of challenging math questions from a recent invitational mathematics exam. However, OpenAI employees pointed out that xAI’s graph didn’t include o3-mini-high’s AIME 2025 score at "cons@64."
What is Cons@64?
Cons@64, short for "consensus@64," gives a model 64 tries to answer each problem in a benchmark and takes the answers generated most frequently as the final answers. This tends to boost models’ benchmark scores, and omitting it from a graph might make it appear as though one model surpasses another when in reality, that’s not the case.
A More Accurate Picture
Grok 3 Reasoning Beta and Grok 3 mini Reasoning’s scores for AIME 2025 at "@1" fall below o3-mini-high’s score. Grok 3 Reasoning Beta also trails ever-so-slightly behind OpenAI’s o1 model set to "medium" computing. Yet, xAI is advertising Grok 3 as the "world’s smartest AI."
A Neutral Party’s Take
A more neutral party in the debate put together a more "accurate" graph showing nearly every model’s performance at cons@64. As Twitter user Teortaxes pointed out, "Hilarious how some people see my plot as an attack on OpenAI and others as an attack on Grok while in reality it’s DeepSeek propaganda."
The Importance of Transparency
As AI researcher Nathan Lambert noted, the most important metric remains a mystery: the computational (and monetary) cost it took for each model to achieve its best score. This just goes to show how little most AI benchmarks communicate about models’ limitations – and their strengths.
Conclusion
The debate over AI benchmarks highlights the need for transparency and clear communication in the AI community. As the field continues to evolve, it’s essential to ensure that benchmark results are accurately reported and that model comparisons are made on a level playing field.
FAQs
Q: What is cons@64?
A: Cons@64, short for "consensus@64," gives a model 64 tries to answer each problem in a benchmark and takes the answers generated most frequently as the final answers.
Q: Why is cons@64 important?
A: Cons@64 tends to boost models’ benchmark scores, and omitting it from a graph might make it appear as though one model surpasses another when in reality, that’s not the case.
Q: What is the most important metric in AI benchmarking?
A: The most important metric remains the computational (and monetary) cost it took for each model to achieve its best score.

