Thought Pokémon was a tough benchmark for AI? One group of researchers argues that Super Mario Bros. is even tougher.
Hao AI Lab’s Mario Challenge
Hao AI Lab, a research organization at the University of California San Diego, recently threw AI into live Super Mario Bros. games. The results were surprising, with Anthropic’s Claude 3.7 performing the best, followed by Claude 3.5. Google’s Gemini 1.5 Pro and OpenAI’s GPT-4o struggled.
The Game
It’s worth noting that the game was not the original 1985 release, but rather a version that ran in an emulator and integrated with a framework called GamingAgent, which gave the AIs control over Mario.
GamingAgent and the AIs
GamingAgent, developed in-house by Hao, fed the AI basic instructions, such as "If an obstacle or enemy is near, move/jump left to dodge" and in-game screenshots. The AI then generated inputs in the form of Python code to control Mario.
The Challenge
The game forced each model to "learn" to plan complex maneuvers and develop gameplay strategies. Interestingly, the lab found that so-called reasoning models like OpenAI’s o1, which "think" through problems step by step to arrive at solutions, performed worse than "non-reasoning" models, despite being generally stronger on most benchmarks.
The Reasoning
The main reason reasoning models have trouble playing real-time games like this is that they take a while – seconds, usually – to decide on actions, according to the researchers. In Super Mario Bros., timing is everything. A second can mean the difference between a jump safely cleared and a plummet to the death.
An Evaluation Crisis
Games have been used to benchmark AI for decades. However, some experts have questioned the wisdom of drawing connections between AI’s gaming skills and technological advancement. Unlike the real world, games tend to be abstract and relatively simple, and they provide a theoretically infinite amount of data to train AI.
The Evaluation Crisis
As Andrej Karpathy, a research scientist and founding member at OpenAI, wrote, "I don’t really know what [AI] metrics to look at right now. TLDR my reaction is I don’t really know how good these models are right now."
Conclusion
At least we can watch AI play Mario.
Frequently Asked Questions
Q: What is the purpose of the Mario challenge?
A: To test the capabilities of AI models in real-time, complex environments.
Q: Which AI models performed well in the challenge?
A: Anthropic’s Claude 3.7 and Claude 3.5.
Q: Why did reasoning models struggle in the challenge?
A: Reasoning models take too long to decide on actions, making them less effective in real-time games.
Q: What is the evaluation crisis in AI?
A: A crisis in which experts are questioning the relevance of AI gaming benchmarks to real-world advancements.

