A New Test for A.I. Systems
If you’re looking for a new reason to be nervous about artificial intelligence, try this: Some of the smartest humans in the world are struggling to create tests that A.I. systems can’t pass.
A History of A.I. Tests
For years, A.I. systems were measured by giving new models a variety of standardized benchmark tests. Many of these tests consisted of challenging, S.A.T.-caliber problems in areas like math, science, and logic. Comparing the models’ scores over time served as a rough measure of A.I. progress.
The Problem with Existing Tests
But A.I. systems eventually got too good at those tests, so new, harder tests were created — often with the types of questions graduate students might encounter on their exams. Those tests aren’t in good shape, either. New models from companies like OpenAI, Google, and Anthropic have been getting high scores on many Ph.D.-level challenges, limiting those tests’ usefulness and leading to a chilling question: Are A.I. systems getting too smart for us to measure?
The "Humanity’s Last Exam"
This week, researchers at the Center for AI Safety and Scale AI are releasing a possible answer to that question: A new evaluation, called "Humanity’s Last Exam," that they claim is the hardest test ever administered to A.I. systems.
The Test
Humanity’s Last Exam is the brainchild of Dan Hendrycks, a well-known A.I. safety researcher and director of the Center for AI Safety. The test consists of roughly 3,000 multiple-choice and short-answer questions designed to test A.I. systems’ abilities in areas ranging from analytic philosophy to rocket engineering.
The Questions
Questions were submitted by experts in these fields, including college professors and prizewinning mathematicians, who were asked to come up with extremely difficult questions they knew the answers to. Here, try your hand at a question about hummingbird anatomy from the test:
Hummingbirds within Apodiformes uniquely have a bilaterally paired oval bone, a sesamoid embedded in the caudolateral portion of the expanded, cruciate aponeurosis of insertion of m. depressor caudae. How many paired tendons are supported by this sesamoid bone? Answer with a number.
Or, if physics is more your speed, try this one:
A block is placed on a horizontal rail, along which it can slide frictionlessly. It is attached to the end of a rigid, massless rod of length R. A mass is attached at the other end. Both objects have weight W. The system is initially stationary, with the mass directly above the block. The mass is given an infinitesimal push, parallel to the rail. Assume the system is designed so that the rod can rotate through a full 360 degrees without interruption. When the rod is horizontal, it carries tension T1. When the rod is vertical again, with the mass directly below the block, it carries tension T2. (Both these quantities could be negative, which would indicate that the rod is in compression.) What is the value of (T1−T2)/W?
The Results
The researchers gave Humanity’s Last Exam to six leading A.I. models, including Google’s Gemini 1.5 Pro and Anthropic’s Claude 3.5 Sonnet. All of them failed miserably. OpenAI’s o1 system scored the highest of the bunch, with a score of 8.3 percent.
The Future of A.I. Testing
Mr. Hendrycks said he expected those scores to rise quickly, and potentially to surpass 50 percent by the end of the year. At that point, he said, A.I. systems might be considered "world-class oracles," capable of answering questions on any topic more accurately than human experts. And we might have to look for other ways to measure A.I.’s impacts, like looking at economic data or judging whether it can make novel discoveries in areas like math and science.
Conclusion
Measuring the progress of A.I. systems is a challenging task. As A.I. systems become increasingly advanced, it’s essential to develop new tests that can assess their capabilities and limitations. Humanity’s Last Exam is a step in the right direction, but it’s not the only solution. We need to continue exploring creative methods of tracking A.I. progress and considering the potential impacts of these systems on society.
FAQs
Q: What is the purpose of Humanity’s Last Exam?
A: The purpose of Humanity’s Last Exam is to create a test that can measure the capabilities and limitations of A.I. systems across a wide range of academic subjects.
Q: Who submitted questions to the test?
A: Questions were submitted by experts in various fields, including college professors and prizewinning mathematicians.
Q: What are the implications of A.I. systems becoming "world-class oracles"?
A: If A.I. systems become capable of answering questions on any topic more accurately than human experts, it may change the way we approach knowledge and expertise in various fields.
Q: How can we measure the progress of A.I. systems in the future?
A: We need to continue exploring creative methods of tracking A.I. progress and considering the potential impacts of these systems on society.

