Cracking the Code for AI Safety

Can You Jailbreak Anthropic’s Latest AI Safety Measure? Researchers Want You to Try

On Monday, the company released a new paper outlining an AI safety system based on Constitutional Classifiers. The process is based on Constitutional AI, a system Anthropic used to make Claude “harmless,” in which one AI helps monitor and improve another. Each technique is guided by a constitution, or “list of principles” that a model must abide by, Anthropic explained in a blog.

The Constitutional Classifiers System

Trained on synthetic data, these “classifiers” were able to filter the “overwhelming majority” of jailbreak attempts without excessive over-refusals (incorrect flags of harmless content as harmful), according to Anthropic. The principles define the classes of content that are allowed and disallowed (for example, recipes for mustard are allowed, but recipes for mustard gas are not).

Initial Testing Results

In initial testing, 183 human red-teamers spent more than 3,000 hours over two months attempting to jailbreak Claude 3.5 Sonnet from a prototype of the system, which was trained not to share any information about “chemical, biological, radiological, and nuclear harms.” Jailbreakers were given 10 restricted queries to use as part of their attempts; breaches were only counted as successful if they got the model to answer all 10 in detail.

Constitutional Classifiers Proved Effective

The Constitutional Classifiers system proved effective. “None of the participants were able to coerce the model to answer all 10 forbidden queries with a single jailbreak — that is, no universal jailbreak was discovered,” Anthropic explained, meaning no one won the company’s $15,000 reward, either.

Improvements and Future Plans

The prototype “refused too many harmless queries” and was resource-intensive to run, making it secure but impractical. After improving it, Anthropic ran a test of 10,000 synthetic jailbreaking attempts on an October version of Claude 3.5 Sonnet with and without classifier protection using known successful attacks. Claude alone only blocked 14% of attacks, while Claude with Constitutional Classifiers blocked over 95%.

Future Development

“Constitutional Classifiers may not prevent every universal jailbreak, though we believe that even the small proportion of jailbreaks that make it past our classifiers require far more effort to discover when the safeguards are in use,” Anthropic continued. “It’s also possible that new jailbreaking techniques might be developed in the future that are effective against the system; we therefore recommend using complementary defenses. Nevertheless, the constitution used to train the classifiers can rapidly be adapted to cover novel attacks as they’re discovered.”

Take Part in the Challenge

Have prior red-teaming experience? You can try your chance at the reward by testing the system yourself — with only eight required questions, instead of the original 10 — until February 10.

Conclusion

The Constitutional Classifiers system is a significant step forward in AI safety, demonstrating the effectiveness of Constitutional AI in preventing jailbreaks. While no system is foolproof, the company’s efforts to improve the system and adapt to new attacks are promising. The challenge for red-teamers is an opportunity to test the system and contribute to its development.

FAQs

Q: What is the Constitutional Classifiers system?
A: The Constitutional Classifiers system is an AI safety system based on Constitutional AI, which uses a constitution or “list of principles” to guide the behavior of a model.

Q: What is the purpose of the system?
A: The purpose of the system is to prevent jailbreaks and ensure that AI models do not share harmful information.

Q: How effective is the system?
A: In initial testing, the system was able to filter the “overwhelming majority” of jailbreak attempts without excessive over-refusals. In a test of 10,000 synthetic jailbreaking attempts, the system blocked over 95% of attacks.

Q: Can I participate in the challenge?
A: Yes, if you have prior red-teaming experience, you can try your chance at the reward by testing the system yourself — with only eight required questions, instead of the original 10 — until February 10.

Post Views: 2

Cracking the Code for AI Safety

Can You Jailbreak Anthropic’s Latest AI Safety Measure? Researchers Want You to Try

The Constitutional Classifiers System

Initial Testing Results

Constitutional Classifiers Proved Effective

Improvements and Future Plans

Future Development

Take Part in the Challenge

Conclusion

FAQs

DeepSeek and the A.I. Nonsense

Logitech MX Creative Console Cuts Down Time

Hitachi Ventures Raises $400M Fund

Gesture Drawing Essentials: 2 & 5 Minute Practice Methods

AI Startups Raised $8 Billion in 2024

DeepSeek and the A.I. Nonsense

Logitech MX Creative Console Cuts Down Time

Hitachi Ventures Raises $400M Fund

Gesture Drawing Essentials: 2 & 5 Minute Practice Methods

AI Startups Raised $8 Billion in 2024

We Need to Talk About Austen

Revamped ChatGPT: The Ultimate Messaging Revolution

Pixelated Perfection: ASCII Art Revival

LEAVE A REPLY Cancel reply

Latest

DeepSeek and the A.I. Nonsense

Logitech MX Creative Console Cuts Down Time

Hitachi Ventures Raises $400M Fund

Categories

Useful Links

Our Newsletter