People are using Super Mario to benchmark AI

Thought Pokémon was a tough benchmark for AI? One group of researchers argues that Super Mario Bros. is even tougher.

Hao AI Lab’s Mario Challenge

Hao AI Lab, a research organization at the University of California San Diego, recently threw AI into live Super Mario Bros. games. The results were surprising, with Anthropic’s Claude 3.7 performing the best, followed by Claude 3.5. Google’s Gemini 1.5 Pro and OpenAI’s GPT-4o struggled.

The Game

It’s worth noting that the game was not the original 1985 release, but rather a version that ran in an emulator and integrated with a framework called GamingAgent, which gave the AIs control over Mario.

GamingAgent and the AIs

GamingAgent, developed in-house by Hao, fed the AI basic instructions, such as "If an obstacle or enemy is near, move/jump left to dodge" and in-game screenshots. The AI then generated inputs in the form of Python code to control Mario.

The Challenge

The game forced each model to "learn" to plan complex maneuvers and develop gameplay strategies. Interestingly, the lab found that so-called reasoning models like OpenAI’s o1, which "think" through problems step by step to arrive at solutions, performed worse than "non-reasoning" models, despite being generally stronger on most benchmarks.

The Reasoning

The main reason reasoning models have trouble playing real-time games like this is that they take a while – seconds, usually – to decide on actions, according to the researchers. In Super Mario Bros., timing is everything. A second can mean the difference between a jump safely cleared and a plummet to the death.

An Evaluation Crisis

Games have been used to benchmark AI for decades. However, some experts have questioned the wisdom of drawing connections between AI’s gaming skills and technological advancement. Unlike the real world, games tend to be abstract and relatively simple, and they provide a theoretically infinite amount of data to train AI.

The Evaluation Crisis

As Andrej Karpathy, a research scientist and founding member at OpenAI, wrote, "I don’t really know what [AI] metrics to look at right now. TLDR my reaction is I don’t really know how good these models are right now."

Conclusion

At least we can watch AI play Mario.

Frequently Asked Questions

Q: What is the purpose of the Mario challenge?
A: To test the capabilities of AI models in real-time, complex environments.

Q: Which AI models performed well in the challenge?
A: Anthropic’s Claude 3.7 and Claude 3.5.

Q: Why did reasoning models struggle in the challenge?
A: Reasoning models take too long to decide on actions, making them less effective in real-time games.

Q: What is the evaluation crisis in AI?
A: A crisis in which experts are questioning the relevance of AI gaming benchmarks to real-world advancements.

Post Views: 22

People are using Super Mario to benchmark AI

Generate single title from this title Nearly half of high school students now use AI in college search in 100 -150 characters. And it...

Engineering confidence to navigate uncertainty | MIT News

Generate single title from this title Best of MWC 2026: Live updates on phones, concepts, and robots we’re seeing in 100 -150 characters. And...

Featured video: Coding for underwater robotics | MIT News

Generate single title from this title Upgrading agentic AI for finance workflows in 100 -150 characters. And it must return only title i dont...

Generate single title from this title Nearly half of high school students now use AI in college search in 100 -150 characters. And it...

Engineering confidence to navigate uncertainty | MIT News

Generate single title from this title Best of MWC 2026: Live updates on phones, concepts, and robots we’re seeing in 100 -150 characters. And...

Featured video: Coding for underwater robotics | MIT News

Generate single title from this title Upgrading agentic AI for finance workflows in 100 -150 characters. And it must return only title i dont...

Generate single title from this title Making Softmax More Efficient with NVIDIA Blackwell Ultra in 100 -150 characters. And it must return only title...

Generate single title from this title Nvidia shares fall as blockbuster results fail to dazzle in 100 -150 characters. And it must return only...

Generate single title from this title It exposed what was already broken in 100 -150 characters. And it must return only title i dont...

LEAVE A REPLY Cancel reply

Latest

Generate single title from this title Nearly half of high school students now use AI in college search in 100 -150 characters. And it...

Engineering confidence to navigate uncertainty | MIT News

Generate single title from this title Best of MWC 2026: Live updates on phones, concepts, and robots we’re seeing in 100 -150 characters. And...

Categories

Useful Links

Our Newsletter