Date:

These Researchers Used NPR Sunday Puzzle Questions to Benchmark AI ‘Reasoning’ Models

Improving AI Models with Public Radio Quizzes

Every Sunday, NPR host Will Shortz, The New York Times’ crossword puzzle guru, gets to quiz thousands of listeners in a long-running segment called the Sunday Puzzle. While written to be solvable without too much foreknowledge, the brainteasers are usually challenging even for skilled contestants.

A New Benchmark for AI Models

That’s why some experts think they’re a promising way to test the limits of AI’s problem-solving abilities. A recent study by a team of researchers from Wellesley College, Oberlin College, the University of Texas at Austin, Northeastern University, Charles University, and startup Cursor created an AI benchmark using riddles from Sunday Puzzle episodes. The team says their test uncovered surprising insights, like that reasoning models – OpenAI’s o1, among others – sometimes "give up" and provide answers they know aren’t correct.

The Sunday Puzzle Benchmark

The AI industry is in a bit of a benchmarking quandary at the moment. Most of the tests commonly used to evaluate AI models probe for skills, like competency on PhD-level math and science questions, that aren’t relevant to the average user. Meanwhile, many benchmarks – even those released relatively recently – are quickly approaching saturation point. The advantages of a public radio quiz game like the Sunday Puzzle is that it doesn’t test for esoteric knowledge, and the challenges are phrased such that models can’t draw on "rote memory" to solve them, explained Guha.

Insights from the Study

The team’s research revealed that the models make other bizarre choices, like giving a wrong answer only to immediately retract it, attempt to tease out a better one, and fail again. They also get stuck "thinking" forever and give nonsensical explanations for answers, or they arrive at a correct answer right away but then go on to consider alternative answers for no obvious reason.

Conclusion

The current best-performing model on the benchmark is o1 with a score of 59%, followed by the recently released o3-mini set to high "reasoning effort" (47%). As a next step, the researchers plan to broaden their testing to additional reasoning models, which they hope will help to identify areas where these models might be enhanced.

FAQs

Q: What is the Sunday Puzzle?
A: The Sunday Puzzle is a long-running segment on NPR where listeners are quizzed on various riddles and puzzles.

Q: Why is the Sunday Puzzle a good benchmark for AI models?
A: The Sunday Puzzle is a good benchmark for AI models because it doesn’t test for esoteric knowledge and the challenges are phrased such that models can’t draw on "rote memory" to solve them.

Q: What are the results of the study?
A: The study found that reasoning models like o1 and R1 far outperform other models on the benchmark, and that they sometimes "give up" and provide answers they know aren’t correct.

Q: What’s the next step for the researchers?
A: The researchers plan to broaden their testing to additional reasoning models, which they hope will help to identify areas where these models might be enhanced.

Latest stories

Read More

LEAVE A REPLY

Please enter your comment!
Please enter your name here