No AI System Can Pass It Yet

A New Test for A.I. Systems

If you’re looking for a new reason to be nervous about artificial intelligence, try this: Some of the smartest humans in the world are struggling to create tests that A.I. systems can’t pass.

A History of A.I. Tests

For years, A.I. systems were measured by giving new models a variety of standardized benchmark tests. Many of these tests consisted of challenging, S.A.T.-caliber problems in areas like math, science, and logic. Comparing the models’ scores over time served as a rough measure of A.I. progress.

The Problem with Existing Tests

But A.I. systems eventually got too good at those tests, so new, harder tests were created — often with the types of questions graduate students might encounter on their exams. Those tests aren’t in good shape, either. New models from companies like OpenAI, Google, and Anthropic have been getting high scores on many Ph.D.-level challenges, limiting those tests’ usefulness and leading to a chilling question: Are A.I. systems getting too smart for us to measure?

The "Humanity’s Last Exam"

This week, researchers at the Center for AI Safety and Scale AI are releasing a possible answer to that question: A new evaluation, called "Humanity’s Last Exam," that they claim is the hardest test ever administered to A.I. systems.

The Test

Humanity’s Last Exam is the brainchild of Dan Hendrycks, a well-known A.I. safety researcher and director of the Center for AI Safety. The test consists of roughly 3,000 multiple-choice and short-answer questions designed to test A.I. systems’ abilities in areas ranging from analytic philosophy to rocket engineering.

The Questions

Questions were submitted by experts in these fields, including college professors and prizewinning mathematicians, who were asked to come up with extremely difficult questions they knew the answers to. Here, try your hand at a question about hummingbird anatomy from the test:

Hummingbirds within Apodiformes uniquely have a bilaterally paired oval bone, a sesamoid embedded in the caudolateral portion of the expanded, cruciate aponeurosis of insertion of m. depressor caudae. How many paired tendons are supported by this sesamoid bone? Answer with a number.

Or, if physics is more your speed, try this one:

A block is placed on a horizontal rail, along which it can slide frictionlessly. It is attached to the end of a rigid, massless rod of length R. A mass is attached at the other end. Both objects have weight W. The system is initially stationary, with the mass directly above the block. The mass is given an infinitesimal push, parallel to the rail. Assume the system is designed so that the rod can rotate through a full 360 degrees without interruption. When the rod is horizontal, it carries tension T1. When the rod is vertical again, with the mass directly below the block, it carries tension T2. (Both these quantities could be negative, which would indicate that the rod is in compression.) What is the value of (T1−T2)/W?

The Results

The researchers gave Humanity’s Last Exam to six leading A.I. models, including Google’s Gemini 1.5 Pro and Anthropic’s Claude 3.5 Sonnet. All of them failed miserably. OpenAI’s o1 system scored the highest of the bunch, with a score of 8.3 percent.

The Future of A.I. Testing

Mr. Hendrycks said he expected those scores to rise quickly, and potentially to surpass 50 percent by the end of the year. At that point, he said, A.I. systems might be considered "world-class oracles," capable of answering questions on any topic more accurately than human experts. And we might have to look for other ways to measure A.I.’s impacts, like looking at economic data or judging whether it can make novel discoveries in areas like math and science.

Conclusion

Measuring the progress of A.I. systems is a challenging task. As A.I. systems become increasingly advanced, it’s essential to develop new tests that can assess their capabilities and limitations. Humanity’s Last Exam is a step in the right direction, but it’s not the only solution. We need to continue exploring creative methods of tracking A.I. progress and considering the potential impacts of these systems on society.

FAQs

Q: What is the purpose of Humanity’s Last Exam?
A: The purpose of Humanity’s Last Exam is to create a test that can measure the capabilities and limitations of A.I. systems across a wide range of academic subjects.

Q: Who submitted questions to the test?
A: Questions were submitted by experts in various fields, including college professors and prizewinning mathematicians.

Q: What are the implications of A.I. systems becoming "world-class oracles"?
A: If A.I. systems become capable of answering questions on any topic more accurately than human experts, it may change the way we approach knowledge and expertise in various fields.

Q: How can we measure the progress of A.I. systems in the future?
A: We need to continue exploring creative methods of tracking A.I. progress and considering the potential impacts of these systems on society.

Post Views: 33

No AI System Can Pass It Yet

A History of A.I. Tests

The Problem with Existing Tests

The "Humanity’s Last Exam"

The Test

The Questions

The Results

The Future of A.I. Testing

Conclusion

FAQs

Generate single title from this title Why AI insurance underwriting is finally attracting institutional capital in 100 -150 characters. And it must return only...

Generate single title from this title A New AI Model Could Help Scientists Design New Forms of Life in 100 -150 characters. And it...

Generate single title from this title Train CodeFu-7B with veRL and Ray on Amazon SageMaker Training jobs in 100 -150 characters. And it must...

Generate single title from this title Nearly half of high school students now use AI in college search in 100 -150 characters. And it...

Engineering confidence to navigate uncertainty | MIT News

Generate single title from this title Why AI insurance underwriting is finally attracting institutional capital in 100 -150 characters. And it must return only...

Generate single title from this title A New AI Model Could Help Scientists Design New Forms of Life in 100 -150 characters. And it...

Generate single title from this title Train CodeFu-7B with veRL and Ray on Amazon SageMaker Training jobs in 100 -150 characters. And it must...

Generate single title from this title Nearly half of high school students now use AI in college search in 100 -150 characters. And it...

Engineering confidence to navigate uncertainty | MIT News

Generate single title from this title Best of MWC 2026: Live updates on phones, concepts, and robots we’re seeing in 100 -150 characters. And...

Featured video: Coding for underwater robotics | MIT News

Generate single title from this title Upgrading agentic AI for finance workflows in 100 -150 characters. And it must return only title i dont...

LEAVE A REPLY Cancel reply

Latest

Generate single title from this title Why AI insurance underwriting is finally attracting institutional capital in 100 -150 characters. And it must return only...

Generate single title from this title A New AI Model Could Help Scientists Design New Forms of Life in 100 -150 characters. And it...

Generate single title from this title Train CodeFu-7B with veRL and Ray on Amazon SageMaker Training jobs in 100 -150 characters. And it must...

Categories

Useful Links

Our Newsletter