Article
Executive Summary
Artificial intelligence (AI) companies often claim that Artificial General Intelligence (AGI) is just around the corner. However, the latest models still require additional training to reach their full potential. Scale AI, a company that has played a key role in building advanced AI models, has developed a platform that can automatically test a model across thousands of benchmarks and tasks, pinpoint weaknesses, and flag additional training data to enhance their skills.
The Challenges of Training AI Models
Training AI models requires a significant amount of data and human labor. Large language models (LLMs) are trained on vast amounts of text scraped from books, the web, and other sources. However, turning these models into helpful, coherent, and well-mannered chatbots requires additional "post-training" in the form of humans who provide feedback on a model’s output.
Introducing Scale Evaluation
Scale AI has developed a new tool called Scale Evaluation, which automates some of this work using Scale’s own machine learning algorithms. The tool allows model makers to go through results and slice and dice them to understand where a model is not performing well, then use that to target the data campaigns for improvement.
Case Study: Reasoning Capabilities
In one instance, Scale Evaluation revealed that a model’s reasoning skills fell off when it was fed non-English prompts. The tool highlighted the issue and allowed the company to gather additional training data to address it.
Industry Reaction
Jonathan Frankle, chief AI scientist at Databricks, a company that builds large AI models, says that being able to test one foundation model against another sounds useful in principle. "Anyone who moves the ball forward on evaluation is helping us to build better AI," Frankle says.
The Future of AI Testing
Scale’s new tool offers a more comprehensive picture by combining many different benchmarks and can be used to devise custom tests of a model’s abilities, like probing its reasoning in different languages. The company’s AI can take a given problem and generate more examples, allowing for a more comprehensive test of a model’s skills.
Conclusion
The development of Scale Evaluation highlights the need for more comprehensive testing of AI models. As AI continues to advance, it is crucial that we have tools in place to ensure that these models are safe, trustworthy, and effective.
Frequently Asked Questions
Q: What is Scale Evaluation?
A: Scale Evaluation is a new tool developed by Scale AI that automates some of the work involved in testing and evaluating AI models.
Q: How does Scale Evaluation work?
A: The tool uses Scale’s own machine learning algorithms to test a model across thousands of benchmarks and tasks, pinpoint weaknesses, and flag additional training data to enhance their skills.
Q: What are the benefits of Scale Evaluation?
A: The tool allows model makers to go through results and slice and dice them to understand where a model is not performing well, then use that to target the data campaigns for improvement.
Q: How will Scale Evaluation impact the development of AI?
A: Scale’s new tool offers a more comprehensive picture by combining many different benchmarks and can be used to devise custom tests of a model’s abilities, like probing its reasoning in different languages. The company’s AI can take a given problem and generate more examples, allowing for a more comprehensive test of a model’s skills.

