Meta’s Llama 4 Maverick Model Falls Short in Unmodified Form
Background on the Incident
Earlier this week, Meta landed in hot water for using an experimental, unreleased version of its Llama 4 Maverick model to achieve a high score on a crowdsourced benchmark, LM Arena. The incident prompted the maintainers of LM Arena to apologize, change their policies, and score the unmodified, vanilla Maverick.
The Unmodified Maverick’s Performance
The unmodified Maverick, “Llama-4-Maverick-17B-128E-Instruct,” was ranked below models including OpenAI’s GPT-4o, Anthropic’s Claude 3.5 Sonnet, and Google’s Gemini 1.5 Pro as of Friday. Many of these models are months old.
Why the Poor Performance?
Meta’s experimental Maverick, Llama-4-Maverick-03-26-Experimental, was “optimized for conversationality,” the company explained in a chart published last Saturday. Those optimizations evidently played well to LM Arena, which has human raters compare the outputs of models and choose which they prefer.
The Issue with LM Arena
As we’ve written about before, for various reasons, LM Arena has never been the most reliable measure of an AI model’s performance. Still, tailoring a model to a benchmark — besides being misleading — makes it challenging for developers to predict exactly how well the model will perform in different contexts.
Meta’s Response
In a statement, a Meta spokesperson told TechCrunch that Meta experiments with “all types of custom variants.”
“‘Llama-4-Maverick-03-26-Experimental’ is a chat optimized version we experimented with that also performs well on LMArena,” the spokesperson said. “We have now released our open source version and will see how developers customize Llama 4 for their own use cases. We’re excited to see what they will build and look forward to their ongoing feedback.”
Conclusion
The incident highlights the importance of transparency and fairness in the development and evaluation of AI models. Meta’s actions have raised questions about the reliability of benchmarks like LM Arena and the potential for model optimization to create unrealistic expectations.
Frequently Asked Questions
Q: What happened with Meta and LM Arena?
A: Meta used an experimental, unreleased version of its Llama 4 Maverick model to achieve a high score on LM Arena, prompting the maintainers of the benchmark to apologize and change their policies.
Q: How did the unmodified Maverick perform?
A: The unmodified Maverick, “Llama-4-Maverick-17B-128E-Instruct,” was ranked below models including OpenAI’s GPT-4o, Anthropic’s Claude 3.5 Sonnet, and Google’s Gemini 1.5 Pro.
Q: Why did Meta’s experimental Maverick perform well on LM Arena?
A: Meta’s experimental Maverick was optimized for conversationality, which played well to LM Arena’s human rater-based evaluation.
Q: What is Meta’s response to the incident?
A: Meta has released its open source version of the Llama 4 model and is excited to see how developers customize it for their own use cases.

