AI Groups Redesign Model Testing, Create New Benchmarks

Tech Groups Rush to Redesign AI Model Evaluations

Current Benchmarks Becoming Obsolete

Tech groups are rushing to redesign how they test and evaluate their artificial intelligence (AI) models, as the fast-advancing technology surpasses current benchmarks. OpenAI, Microsoft, Meta, and Anthropic have all recently announced plans to build AI agents that can execute tasks for humans autonomously on their behalf. To do this effectively, the systems must be able to perform increasingly complex actions, using reasoning and planning.

Need for New Benchmarks

Companies conduct "evaluations" of AI models by teams of staff and outside researchers. These are standardized tests, known as benchmarks, that assess models’ abilities and the performance of different groups’ systems or older versions. However, recent advances in AI technology have meant many of the newest models have been able to get close to or above 90 per cent accuracy on existing tests, highlighting the need for new benchmarks.

Internal Benchmarks and Concerns

To deal with this issue, several tech groups including Meta, OpenAI, and Microsoft have created their own internal benchmarks and tests for intelligence. However, this has raised concerns within the industry over the ability to compare the technology in the absence of public tests.

Public Benchmarks

Current public benchmarks, such as Hellaswag and MMLU, use multiple-choice questions to assess common sense and knowledge across various topics. However, researchers argue this method is now becoming redundant and models need more complex problems.

New Benchmarks and Evaluations

One public benchmark, SWE-bench Verified, was updated in August to better evaluate autonomous systems based on feedback from companies, including OpenAI. It uses real-world software problems sourced from the developer platform GitHub and involves supplying the AI agent with a code repository and an engineering issue, asking them to fix it. The tasks require reasoning to complete.

Reasoning and Planning

The ability to reason and plan is critical to unlocking the potential of AI agents that can conduct tasks over multiple steps and applications, and correct themselves. "We are discovering new ways of measuring these systems and of course one of those is reasoning, which is an important frontier," said Ece Kamar, VP and lab director of AI Frontiers at Microsoft research.

Challenges and Debates

Some, including researchers from Apple, have questioned whether current large language models are "reasoning" or purely "pattern matching" the closest similar data seen in their training. "In the narrower domains [that] enterprises care about, they do reason," said Ruchir Puri, chief scientist at IBM Research. "[The debate is around] this broader concept of reasoning at a human level, that would almost put it in the context of artificial general intelligence. Do they really reason, or are they parroting?"

Conclusion

The need for new benchmarks has led to efforts by external organizations. In September, the start-up Scale AI and Hendryks announced a project called "Humanity’s Last Exam", which crowdsourced complex questions from experts across different disciplines that required abstract reasoning to complete. Another example is FrontierMath, a novel benchmark released this week, created by expert mathematicians. Based on this test, the most advanced models can complete less than 2 per cent of questions.

FAQs

Q: Why are current benchmarks becoming obsolete?
A: Recent advances in AI technology have meant many of the newest models have been able to get close to or above 90 per cent accuracy on existing tests, highlighting the need for new benchmarks.

Q: What are internal benchmarks and how do they differ from public benchmarks?
A: Internal benchmarks are created by companies themselves to evaluate their own AI models, while public benchmarks are standardized tests that can be used to compare different models.

Q: What is the importance of reasoning and planning in AI agents?
A: The ability to reason and plan is critical to unlocking the potential of AI agents that can conduct tasks over multiple steps and applications, and correct themselves.

Q: Are current large language models truly "reasoning" or just "pattern matching"?
A: The debate is ongoing, with some arguing that current models are not truly reasoning, while others argue that they do reason, but in a limited sense.

Post Views: 32

AI Groups Redesign Model Testing, Create New Benchmarks

Generate single title from this title Samsung on track for highest profit in 3 years in 100 -150 characters. And it must return only...

Reforming the Sponsored Visas System Can Change That

Futures of Work ~ The Modern Slavery Act: 10 years on

Futures of Work ~ Graves into Gardens

Futures of Work ~ Reflections and recommendations from the second U.K. Independent Anti-Slavery Commissioner

Generate single title from this title Samsung on track for highest profit in 3 years in 100 -150 characters. And it must return only...

Reforming the Sponsored Visas System Can Change That

Futures of Work ~ The Modern Slavery Act: 10 years on

Futures of Work ~ Graves into Gardens

Futures of Work ~ Reflections and recommendations from the second U.K. Independent Anti-Slavery Commissioner

Futures of Work ~ Building Better Systems for Survivors of Exploitation

Where is the “Modern Slavery” Agenda Heading?

Generate single title from this title I compared 5G network signals of Verizon, T-Mobile, and AT&T at a baseball stadium – here’s the winner...

LEAVE A REPLY Cancel reply

Latest

Generate single title from this title Samsung on track for highest profit in 3 years in 100 -150 characters. And it must return only...

Reforming the Sponsored Visas System Can Change That

Futures of Work ~ The Modern Slavery Act: 10 years on

Categories

Useful Links

Our Newsletter