Innovation

How 'many-shot jailbreaking' can be used to fool AI

The jailbreaking technique can fool AI into teaching users how to build a bomb.

Written by Don Reisinger, Contributing Writer April 3, 2024 at 7:19 a.m. PT

Log in for ChatGPT — Sabrina Ortiz/ZDNET

Some artificial intelligence researchers and detractors have long decried generative AI for how it could be used for harm. A new research paper seems to suggest that's even more possible than some believed.

AI researchers have written a paper that suggests "many-shot jailbreaking" can be used to game a large language model (LLM) for nefarious purposes, including, but not limited to, telling users how to build a bomb. The researchers said that if they asked nearly all popular AI models how to build a bomb out of the gate, they would decline to answer. If, however, the researchers first asked less dangerous questions and slowly increased the nefariousness in their questions, the algorithms would consistently provide answers, including eventually describing how to build a bomb.

To get that result, the researchers crafted their questions and the model's answers, randomized them, and placed them into a single query to make them look like a dialogue. They then fed that entire "dialogue" to the models and asked them how to build a bomb. The models responded with instructions without issue.

"We observe that around 128-shot prompts are sufficient for all of the [AI] models to adopt the harmful behavior," the researchers said.

Also: Microsoft wants to stop you from using AI chatbots for evil

AI has given users around the globe opportunities to do more in less time. While the tech clearly carries a slew of benefits, some experts fear that it could also be used to harm humans. Some of those detractors say bad actors could create AI models to wreak havoc, while still others argue that eventually, AI could become sentient and operate without human intervention.

This latest research, however, presents a new challenge to the most popular AI model makers, such as Anthropic and OpenAI. While these startups have all said they built their models for good and have protections in place to ensure human safety, if this research is accurate, their systems can all be easily exploited by anyone who knows how to "jailbreak" them for illicit purposes.

The researchers said this problem wasn't a concern in older AI models that can only take context from some words or a few sentences to provide answers. Nowadays, AI models are capable of analyzing books worth of data, thanks to a broader "context window" that lets them to do more with more information.

Indeed, by reducing the context window size, the researchers were able to mitigate the many-shot jailbreaking exploit. They found, however, that the smaller context window translated to worse results, which is an obvious non-starter for AI companies. The researchers thus suggested that companies should add the ability for models to contextualize queries before ingesting them, gauging a person's motivation and blocking answers to queries that are clearly meant for harm.

There's no telling if this will work. The researchers said they shared their findings with AI model makers to "foster a culture where exploits like this are openly shared among LLM providers and researchers." What the AI community does with this information, however, and how it avoids such jailbreaking techniques going forward remains to be seen.

Featured

Editorial standards

Show Comments

How 'many-shot jailbreaking' can be used to fool AI

Featured

Related

How Apple can rescue miserable Sonos users

4 ways to use AI to shop on Amazon Prime Day

How Pearson's AI assistant can help teachers save time