Time for Human Evaluation

The Human Touch in AI Development: A Shift towards Human-Assessed Benchmarks

A Shift in Focus

Artificial intelligence has traditionally advanced through automatic accuracy tests in tasks meant to approximate human knowledge. Carefully crafted benchmark tests, such as The General Language Understanding Evaluation benchmark (GLUE), the Massive Multitask Language Understanding data set (MMLU), and "Humanity’s Last Exam," have used large arrays of questions to score how well a large language model knows about a lot of things. However, those tests are increasingly unsatisfactory as a measure of the value of the generative AI programs. Something else is needed, and it just might be a more human assessment of AI output.

The Need for Human Oversight

That view has been floating around in the industry for some time now. "We’ve saturated the benchmarks," said Michael Gerstenhaber, head of API technologies at Anthropic, which makes the Claude family of LLMs, during a Bloomberg Conference on AI in November. The need for humans to be "in the loop" when assessing AI models is appearing in the literature, too.

Human-Computer Interaction Studies

In a paper published this week in The New England Journal of Medicine by scholars at multiple institutions, including Boston’s Beth Israel Deaconess Medical Center, lead author Adam Rodman and collaborators argue that "When it comes to benchmarks, humans are the only way." The traditional benchmarks in the field of medical AI, such as MedQA created at MIT, "have become saturated," they write, meaning that AI models easily ace such exams but are not plugged into what really matters in clinical practice.

Adapting Classical Methods

Rodman and team argue for adapting classical methods by which human physicians are trained, such as role-playing with humans. "Human-computer interaction studies are far slower than even human-adjudicated benchmark evaluations, but as the systems grow more powerful, they will become even more essential," they write.

Human Oversight in AI Development

Human oversight of AI development has been a staple of progress in Gen AI. The development of ChatGPT in 2022 made extensive use of "reinforcement learning by human feedback." That approach performs many rounds of having humans grade the output of AI models to shape that output toward a desired goal. Now, however, ChatGPT creator OpenAI and other developers of so-called frontier models are involving humans in rating and ranking their work.

Google’s Gemma 3: ELO Scores and Human Evaluation

In unveiling its open-source Gemma 3 this month, Google emphasized not automated benchmark scores but ratings by human evaluators to make the case for the model’s superiority. Google even couched Gemma 3 in the same terms as top athletes, using so-called ELO scores for overall ability.

OpenAI’s GPT-4.5: Human Preference Measures

Similarly, when OpenAI unveiled its latest top-end model, GPT-4.5, in February, it emphasized not only results on automated benchmarks such as SimpleQA, but also how human reviewers felt about the model’s output. "Human preference measures," says OpenAI, are a way to gauge "the percentage of queries where testers preferred GPT-4.5 over GPT-4o." The company claims that GPT-4.5 has a greater "emotional quotient" as a result, though it didn’t specify in what way.

Conclusion

The shift towards human-assessed benchmarks is a significant development in the field of AI. As AI models become more powerful, it is crucial to involve humans in the loop to ensure that the models are aligned with human values and needs. This approach will not only improve the accuracy of AI models but also ensure that they are useful and beneficial for society.

FAQs

Q: What is the purpose of human-assessed benchmarks in AI development?
A: Human-assessed benchmarks are designed to ensure that AI models are aligned with human values and needs. They provide a more accurate measure of a model’s performance and can help to identify biases and inaccuracies.

Q: How do human-assessed benchmarks differ from traditional automated benchmarks?
A: Human-assessed benchmarks involve human evaluators in the assessment process, whereas traditional automated benchmarks rely solely on algorithms and data. Human-assessed benchmarks provide a more holistic view of a model’s performance and can help to identify issues that may not be apparent through automated testing.

Q: What are the benefits of human-assessed benchmarks in AI development?
A: Human-assessed benchmarks can help to improve the accuracy and usefulness of AI models by ensuring that they are aligned with human values and needs. They can also help to identify biases and inaccuracies in AI models, which can have serious consequences if left unaddressed.

Post Views: 41

Time for Human Evaluation

Generate single title from this title Nearly half of high school students now use AI in college search in 100 -150 characters. And it...

Engineering confidence to navigate uncertainty | MIT News

Generate single title from this title Best of MWC 2026: Live updates on phones, concepts, and robots we’re seeing in 100 -150 characters. And...

Featured video: Coding for underwater robotics | MIT News

Generate single title from this title Upgrading agentic AI for finance workflows in 100 -150 characters. And it must return only title i dont...

Generate single title from this title Nearly half of high school students now use AI in college search in 100 -150 characters. And it...

Engineering confidence to navigate uncertainty | MIT News

Generate single title from this title Best of MWC 2026: Live updates on phones, concepts, and robots we’re seeing in 100 -150 characters. And...

Featured video: Coding for underwater robotics | MIT News

Generate single title from this title Upgrading agentic AI for finance workflows in 100 -150 characters. And it must return only title i dont...

Generate single title from this title Making Softmax More Efficient with NVIDIA Blackwell Ultra in 100 -150 characters. And it must return only title...

Generate single title from this title Nvidia shares fall as blockbuster results fail to dazzle in 100 -150 characters. And it must return only...

Generate single title from this title It exposed what was already broken in 100 -150 characters. And it must return only title i dont...

LEAVE A REPLY Cancel reply

Latest

Generate single title from this title Nearly half of high school students now use AI in college search in 100 -150 characters. And it...

Engineering confidence to navigate uncertainty | MIT News

Generate single title from this title Best of MWC 2026: Live updates on phones, concepts, and robots we’re seeing in 100 -150 characters. And...

Categories

Useful Links

Our Newsletter