These Researchers Used NPR Sunday Puzzle Questions to Benchmark AI ‘Reasoning’ Models

Improving AI Models with Public Radio Quizzes

Every Sunday, NPR host Will Shortz, The New York Times’ crossword puzzle guru, gets to quiz thousands of listeners in a long-running segment called the Sunday Puzzle. While written to be solvable without too much foreknowledge, the brainteasers are usually challenging even for skilled contestants.

A New Benchmark for AI Models

That’s why some experts think they’re a promising way to test the limits of AI’s problem-solving abilities. A recent study by a team of researchers from Wellesley College, Oberlin College, the University of Texas at Austin, Northeastern University, Charles University, and startup Cursor created an AI benchmark using riddles from Sunday Puzzle episodes. The team says their test uncovered surprising insights, like that reasoning models – OpenAI’s o1, among others – sometimes "give up" and provide answers they know aren’t correct.

The Sunday Puzzle Benchmark

The AI industry is in a bit of a benchmarking quandary at the moment. Most of the tests commonly used to evaluate AI models probe for skills, like competency on PhD-level math and science questions, that aren’t relevant to the average user. Meanwhile, many benchmarks – even those released relatively recently – are quickly approaching saturation point. The advantages of a public radio quiz game like the Sunday Puzzle is that it doesn’t test for esoteric knowledge, and the challenges are phrased such that models can’t draw on "rote memory" to solve them, explained Guha.

Insights from the Study

The team’s research revealed that the models make other bizarre choices, like giving a wrong answer only to immediately retract it, attempt to tease out a better one, and fail again. They also get stuck "thinking" forever and give nonsensical explanations for answers, or they arrive at a correct answer right away but then go on to consider alternative answers for no obvious reason.

Conclusion

The current best-performing model on the benchmark is o1 with a score of 59%, followed by the recently released o3-mini set to high "reasoning effort" (47%). As a next step, the researchers plan to broaden their testing to additional reasoning models, which they hope will help to identify areas where these models might be enhanced.

FAQs

Q: What is the Sunday Puzzle?
A: The Sunday Puzzle is a long-running segment on NPR where listeners are quizzed on various riddles and puzzles.

Q: Why is the Sunday Puzzle a good benchmark for AI models?
A: The Sunday Puzzle is a good benchmark for AI models because it doesn’t test for esoteric knowledge and the challenges are phrased such that models can’t draw on "rote memory" to solve them.

Q: What are the results of the study?
A: The study found that reasoning models like o1 and R1 far outperform other models on the benchmark, and that they sometimes "give up" and provide answers they know aren’t correct.

Q: What’s the next step for the researchers?
A: The researchers plan to broaden their testing to additional reasoning models, which they hope will help to identify areas where these models might be enhanced.

Post Views: 34

These Researchers Used NPR Sunday Puzzle Questions to Benchmark AI ‘Reasoning’ Models

Generate single title from this title The $32B acquisition that one VC is calling the ‘Deal of the Decade’ in 100 -150 characters. And...

Generate single title from this title Designing assessments that assume AI is present in 100 -150 characters. And it must return only title i...

SmartThings Blog

A better method for planning complex visual tasks | MIT News

Generate single title from this title Why AI insurance underwriting is finally attracting institutional capital in 100 -150 characters. And it must return only...

Generate single title from this title The $32B acquisition that one VC is calling the ‘Deal of the Decade’ in 100 -150 characters. And...

Generate single title from this title Designing assessments that assume AI is present in 100 -150 characters. And it must return only title i...

SmartThings Blog

A better method for planning complex visual tasks | MIT News

Generate single title from this title Why AI insurance underwriting is finally attracting institutional capital in 100 -150 characters. And it must return only...

Generate single title from this title A New AI Model Could Help Scientists Design New Forms of Life in 100 -150 characters. And it...

Generate single title from this title Train CodeFu-7B with veRL and Ray on Amazon SageMaker Training jobs in 100 -150 characters. And it must...

Generate single title from this title Nearly half of high school students now use AI in college search in 100 -150 characters. And it...

LEAVE A REPLY Cancel reply

Latest

Generate single title from this title The $32B acquisition that one VC is calling the ‘Deal of the Decade’ in 100 -150 characters. And...

Generate single title from this title Designing assessments that assume AI is present in 100 -150 characters. And it must return only title i...

SmartThings Blog

Categories

Useful Links

Our Newsletter