Can You Jailbreak Anthropic’s Latest AI Safety Measure?
Introduction
Researchers at Anthropic have released a new AI safety system called Constitutional Classifiers, aimed at preventing the misuse of AI models. The system is designed to filter out harmful content and prevent "jailbreaking" attempts. To test the effectiveness of this system, Anthropic is offering a $20,000 reward to anyone who can successfully jailbreak the AI model.
Background
Constitutional AI is a system used to make AI models "harmless" by guiding them with a constitution, or a list of principles that the model must abide by. The Constitutional Classifiers system is based on this concept, with each technique guided by a constitution.
How it Works
The Constitutional Classifiers system uses trained classifiers to filter out harmful content. These classifiers are trained on synthetic data and can filter out the "overwhelming majority" of jailbreak attempts without excessive over-refusals (incorrect flags of harmless content as harmful).
Initial Testing
In initial testing, 183 human red-teamers spent over 3,000 hours attempting to jailbreak Claude 3.5 Sonnet from a prototype of the system. The results were impressive, with none of the participants able to coerce the model to answer all 10 forbidden queries with a single jailbreak.
Improvements and Results
After improving the prototype, Anthropic ran a test of 10,000 synthetic jailbreaking attempts on an October version of Claude 3.5 Sonnet with and without classifier protection using known successful attacks. The results showed that Claude alone only blocked 14% of attacks, while Claude with Constitutional Classifiers blocked over 95%.
The Challenge
Anthropic is now offering a $10,000 reward to the first person to pass all eight levels, and a $20,000 reward to the first person to pass all eight levels with a universal jailbreak. The challenge is open to those with prior red-teaming experience, and can be attempted until February 10.
Conclusion
The Constitutional Classifiers system shows promise in preventing the misuse of AI models. While it may not be foolproof, it is a significant step forward in ensuring the safety and security of AI systems. The challenge offered by Anthropic is an opportunity for researchers and experts to test the system and help improve its effectiveness.
Frequently Asked Questions
Q: What is the goal of the Constitutional Classifiers system?
A: The goal of the Constitutional Classifiers system is to prevent the misuse of AI models by filtering out harmful content and preventing "jailbreaking" attempts.
Q: How does the system work?
A: The system uses trained classifiers to filter out harmful content, which are trained on synthetic data and can filter out the "overwhelming majority" of jailbreak attempts without excessive over-refusals.
Q: How effective is the system in preventing jailbreaks?
A: In initial testing, the system was able to block over 95% of jailbreak attempts, and none of the participants were able to coerce the model to answer all 10 forbidden queries with a single jailbreak.
Q: What is the reward for successfully jailbreaking the system?
A: The reward is $20,000 for the first person to pass all eight levels with a universal jailbreak, and $10,000 for the first person to pass all eight levels.

