Date:

Debates over AI benchmarking have reached Pokémon

Not Even Pokémon is Safe from AI Benchmarking Controversy

A Viral Claim: Google’s Gemini Model Surpasses Anthropic’s Claude Model

Last week, a post on X went viral, claiming that Google’s latest Gemini model surpassed Anthropic’s flagship Claude model in the original Pokémon video game trilogy. Reportedly, Gemini had reached Lavender Town in a developer’s Twitch stream; Claude was stuck at Mount Moon as of late February.

The Viral Tweet

Gemini is literally ahead of Claude atm in pokemon after reaching Lavender Town

119 live views only btw, incredibly underrated stream pic.twitter.com/8AvSovAI4x
— Jush (@Jush21e8) April 10, 2025

The Hidden Advantage

But what the post failed to mention is that Gemini had an advantage. As users on Reddit pointed out, the developer who maintains the Gemini stream built a custom minimap that helps the model identify “tiles” in the game like cuttable trees. This reduces the need for Gemini to analyze screenshots before it makes gameplay decisions.

A Benchmarking Controversy

Now, Pokémon is a semi-serious AI benchmark at best — few would argue it’s a very informative test of a model’s capabilities. But it is an instructive example of how different implementations of a benchmark can influence the results.

Custom Implementations Can Skew Results

For example, Anthropic reported two scores for its recent Anthropic 3.7 Sonnet model on the benchmark SWE-bench Verified, which is designed to evaluate a model’s coding abilities. Claude 3.7 Sonnet achieved 62.3% accuracy on SWE-bench Verified, but 70.3% with a “custom scaffold” that Anthropic developed.

Meta’s LM Arena: A Recent Example

More recently, Meta fine-tuned a version of one of its newer models, Llama 4 Maverick, to perform well on a particular benchmark, LM Arena. The vanilla version of the model scores significantly worse on the same evaluation.

Implications for AI Benchmarking

Given that AI benchmarks — Pokémon included — are imperfect measures to begin with, custom and non-standard implementations threaten to muddy the waters even further. That is to say, it doesn’t seem likely that it’ll get any easier to compare models as they’re released.

Conclusion

In the age of AI benchmarking, it’s essential to consider the nuances of each model’s implementation. A fair comparison requires standardizing the testing environment and avoiding custom solutions that can skew results. As the field continues to evolve, it’s crucial to maintain transparency and consistency in benchmarking methods.

FAQs
Q: What is AI benchmarking?

A: AI benchmarking is the process of evaluating the performance of artificial intelligence models using standardized tests or tasks.

Q: Why is custom implementation a concern in AI benchmarking?

A: Custom implementations can influence the results, making it challenging to compare models accurately. This can lead to misleading conclusions about a model’s capabilities.

Q: What are some examples of imperfect AI benchmarks?

A: The Pokémon game trilogy, SWE-bench Verified, and LM Arena are examples of imperfect AI benchmarks. Each has its limitations and potential biases.

Q: How can we improve AI benchmarking?

A: Standardizing testing environments, maintaining transparency, and avoiding custom solutions can help improve the accuracy of AI benchmarking.

Latest stories

Read More

LEAVE A REPLY

Please enter your comment!
Please enter your name here