Date:

Benchmarking AI Risks

MLCommons Launches New Benchmark to Measure AI’s Dark Side

Introducing AILuminate

MLCommons, a nonprofit organization that helps companies measure the performance of their artificial intelligence systems, is launching a new benchmark to gauge AI’s bad side too. The new benchmark, called AILuminate, assesses the responses of large language models to more than 12,000 test prompts in 12 categories including inciting violent crime, child sexual exploitation, hate speech, promoting self-harm, and intellectual property infringement.

How AILuminate Works

Models are given a score of “poor,” “fair,” “good,” “very good,” or “excellent,” depending on how they perform. The prompts used to test the models are kept secret to prevent them from ending up as training data that would allow a model to ace the test.

Measuring AI Risks

Peter Mattson, founder and president of MLCommons and a senior staff engineer at Google, says that measuring the potential harms of AI models is technically difficult, leading to inconsistencies across the industry. “AI is a really young technology, and AI testing is a really young discipline,” he says. “Improving safety benefits society; it also benefits the market.”

International Perspective

The effort could also provide more of an international perspective on AI harms. MLCommons counts a number of international firms, including the Chinese companies Huawei and Alibaba, among its member organizations. If these companies all used the new benchmark, it would provide a way to compare AI safety in the US, China, and elsewhere.

Early Results

Some large US AI providers have already used AILuminate to test their models. Anthropic’s Claude model, Google’s smaller model Gemma, and a model from Microsoft called Phi all scored “very good” in testing. OpenAI’s GPT-4o and Meta’s largest Llama model both scored “good.” The only model to score “poor” was OLMo from the Allen Institute for AI, although Mattson notes that this is a research offering not designed with safety in mind.

Industry Reaction

“Overall, it’s good to see scientific rigor in the AI evaluation processes,” says Rumman Chowdhury, CEO of Humane Intelligence, a nonprofit that specializes in testing or red-teaming AI models for misbehaviors. “We need best practices and inclusive methods of measurement to determine whether AI models are performing the way we expect them to.”

Conclusion

The launch of AILuminate is a significant step towards ensuring the responsible development and deployment of AI systems. By providing a standardized benchmark for measuring AI risks, MLCommons is helping to promote transparency and accountability in the AI industry.

FAQs

Q: What is AILuminate?
A: AILuminate is a new benchmark developed by MLCommons to measure the responses of large language models to test prompts that assess their potential harms.

Q: What kind of prompts are used in AILuminate?
A: The prompts used in AILuminate include inciting violent crime, child sexual exploitation, hate speech, promoting self-harm, and intellectual property infringement, among others.

Q: How do models perform in AILuminate?
A: Models are given a score of “poor,” “fair,” “good,” “very good,” or “excellent,” depending on how they perform.

Q: Why is measuring AI risks important?
A: Measuring AI risks is important because it helps to ensure the responsible development and deployment of AI systems, and promotes transparency and accountability in the AI industry.

Latest stories

Read More

LEAVE A REPLY

Please enter your comment!
Please enter your name here