Tech Groups Rush to Redesign AI Model Evaluations
Current Benchmarks Becoming Obsolete
Tech groups are rushing to redesign how they test and evaluate their artificial intelligence (AI) models, as the fast-advancing technology surpasses current benchmarks. OpenAI, Microsoft, Meta, and Anthropic have all recently announced plans to build AI agents that can execute tasks for humans autonomously on their behalf. To do this effectively, the systems must be able to perform increasingly complex actions, using reasoning and planning.
Need for New Benchmarks
Companies conduct "evaluations" of AI models by teams of staff and outside researchers. These are standardized tests, known as benchmarks, that assess models’ abilities and the performance of different groups’ systems or older versions. However, recent advances in AI technology have meant many of the newest models have been able to get close to or above 90 per cent accuracy on existing tests, highlighting the need for new benchmarks.
Internal Benchmarks and Concerns
To deal with this issue, several tech groups including Meta, OpenAI, and Microsoft have created their own internal benchmarks and tests for intelligence. However, this has raised concerns within the industry over the ability to compare the technology in the absence of public tests.
Public Benchmarks
Current public benchmarks, such as Hellaswag and MMLU, use multiple-choice questions to assess common sense and knowledge across various topics. However, researchers argue this method is now becoming redundant and models need more complex problems.
New Benchmarks and Evaluations
One public benchmark, SWE-bench Verified, was updated in August to better evaluate autonomous systems based on feedback from companies, including OpenAI. It uses real-world software problems sourced from the developer platform GitHub and involves supplying the AI agent with a code repository and an engineering issue, asking them to fix it. The tasks require reasoning to complete.
Reasoning and Planning
The ability to reason and plan is critical to unlocking the potential of AI agents that can conduct tasks over multiple steps and applications, and correct themselves. "We are discovering new ways of measuring these systems and of course one of those is reasoning, which is an important frontier," said Ece Kamar, VP and lab director of AI Frontiers at Microsoft research.
Challenges and Debates
Some, including researchers from Apple, have questioned whether current large language models are "reasoning" or purely "pattern matching" the closest similar data seen in their training. "In the narrower domains [that] enterprises care about, they do reason," said Ruchir Puri, chief scientist at IBM Research. "[The debate is around] this broader concept of reasoning at a human level, that would almost put it in the context of artificial general intelligence. Do they really reason, or are they parroting?"
Conclusion
The need for new benchmarks has led to efforts by external organizations. In September, the start-up Scale AI and Hendryks announced a project called "Humanity’s Last Exam", which crowdsourced complex questions from experts across different disciplines that required abstract reasoning to complete. Another example is FrontierMath, a novel benchmark released this week, created by expert mathematicians. Based on this test, the most advanced models can complete less than 2 per cent of questions.
FAQs
Q: Why are current benchmarks becoming obsolete?
A: Recent advances in AI technology have meant many of the newest models have been able to get close to or above 90 per cent accuracy on existing tests, highlighting the need for new benchmarks.
Q: What are internal benchmarks and how do they differ from public benchmarks?
A: Internal benchmarks are created by companies themselves to evaluate their own AI models, while public benchmarks are standardized tests that can be used to compare different models.
Q: What is the importance of reasoning and planning in AI agents?
A: The ability to reason and plan is critical to unlocking the potential of AI agents that can conduct tasks over multiple steps and applications, and correct themselves.
Q: Are current large language models truly "reasoning" or just "pattern matching"?
A: The debate is ongoing, with some arguing that current models are not truly reasoning, while others argue that they do reason, but in a limited sense.