Nearly 10% of people ask AI chatbots for illicit content. Will it lead LLMs astray?

Beyond programming tips and writing help, a million conversations reflect people's desires for various kinds of 'unsafe' information. Here's what researchers are doing about it.
Written by Tiernan Ray, Senior Contributing Writer
RapidEye/Getty Images

With the overnight sensation of ChatGPT, it was only a matter of time before the use of generative AI became both a subject of serious research and also grist for the training of generative AI itself. 

In a research paper released this month, scholars gathered a database of one million "real-world conversations" that people have had with 25 different large language models. Released on the arXiv pre-print server, the paper was authored by Lianmin Zheng of the University of California at Berkeley, and peers at UC San Diego, Carnegie Mellon University, Stanford, and Abu Dhabi's Mohamed bin Zayed University of Artificial Intelligence.

Also: Generative AI will far surpass what ChatGPT can do. Here's everything on how the tech advances

A sample of 100,000 of those conversations, selected at random by the authors, showed that most were about subjects you'd expect. The top 50% of interactions were on such pedestrian topics as programming, travel tips, and requests for writing help

But below that top 50%, other topics crop up, including role-playing characters in conversations, and three topic categories that the authors term "unsafe":  "Requests for explicit and erotic storytelling"; "Explicit sexual fantasies and role-playing scenarios"; and "Discussing toxic behavior across different identities."


Statistics of the one million conversations gathered by the Berkeley-Stanford team from online users between April and August of this year. Topics 9, 15, and 17 are among those deemed "unsafe" based on automatic tagging technology.

UC Berkeley

The authors speculate that in the full one million conversations, there may be "even more harmful content." They used the OpenAI technology, in part, to tag conversations as "unsafe," although OpenAI's own system in some cases falls down on the job, as they discuss in detail. 

They also note that open-source language models such as Vicuña have more unsafe content because they don't have the same guardrails as commercial programs such as ChatGPT.

"Open-source models without safety measures tend to generate flagged content more frequently than proprietary ones," they write. "Nonetheless, we still observe 'jailbreak' successes on proprietary models like GPT-4 and Claude." And, in fact, they note that GPT-4 gets broken a third of the time on the challenges, which seems a high rate for something with guardrails in place.


Comparison of prevalence of "unsafe" content in different large language models.

UC Berkeley

Statistics for how much language models are broken by harmful speech, such as prompts urging the program to generate "unsafe",  offensive, or violent content, for example.

UC Berkeley

Examples of the so-called unsafe conversations are listed in the paper's appendix. Of course, the term "unsafe"  can have a very broad meaning. Some of the examples shown are akin to mass-market erotic fiction sold in bookstores, so the opprobrium has to be taken with a grain of salt.

Zheng and team have released the entire data set on HuggingFace

Collected over a period of five months, April to August of this year, the data set -- called "LMSYS-Chat-1M" -- is "the first large-scale, real-world LLM conversation dataset," they write. 

LMSYS-Chat-1M towers above the previously largest-known dataset, compiled by the AI startup Anthropic, which had 339,000 conversations. Where Anthropic had only 143 users in its study, Zheng and team gathered chats from more than 210,000 users, across 154 languages, and using 25 different large language models, including OpenAI's GPT-4, and open-source language models such as Claude and Vicuña. 

Also: AI safety and bias: Untangling the complex chain of AI training

The gathering of this dataset has several goals. First: fine-tune the language models in order to improve their performance. Also: develop benchmarks for the safety of generative AI by studying user prompts that could make language models go astray, such as by making requests for malicious information. 

As the authors note, not everyone can gather this data. It's expensive to run large language models, and the parties that can afford it, such as OpenAI, generally keep their data secret for commercial reasons. 

The Berkeley-Stanford team was able to gather data because they run a free online service to give people access to all 25 of the language models. And they incentivize participation by gamifying the chat: users can choose to enter the "chatbot arena," where a user can simultaneously chat with two different language models. The service maintains a leaderboard on HuggingFace of the performance of the bots, so it becomes something of a competitive sport to see how these language models do. (The code for the chatbot arena is also posted.) 

UC Berkeley

Zheng and team had previously written about the chatbot arena in a separate paper. Zheng is one of the team members that created the open-source Vicuña, a competitor to ChatGPT. (Vicuña is a relative of the llama; open-source large language models are adopting the habit of using names of forms of the genus "lama": alpaca, llama, vicuña, etc.)

The authors have several goals in mind for this kind of data. One intention is to create a moderation tool that would deal with unsafe content. They start with their own Vicuña language model, and train it by showing it warnings from the OpenAI API and having it produce textual explanations of why the content was flagged.

Also: Why open source is the cradle of artificial intelligence

"Instead of developing a classifier, we fine-tune a language model to generate explanations for why a particular message was flagged," as they describe it. Then they created a challenge data set of 110 conversations that OpenAI's system failed to flag. Finally, they used that benchmark to see how the fine-tuned Vicuña stacks up to OpenAI's GPT-4 and others. 


Scores for detecting "unsafe" content by the various language models. The authors developed the "Vicuna-moderator-7B" program as part of the research. 

UC Berkeley

"We observe a significant improvement (30%) when transitioning from Vicuna-7B to the fine-tuned Vicuna-moderator-7B, underscoring the effectiveness of fine-tuning," they write. "Furthermore, Vicuna-moderator-7B surpasses GPT-3.5-turbo's performance and matches that of GPT-4." 

It's interesting that their moderator program scores above GPT-4 in what's called "one-shot," which means the program was only given one example of a harmful text in the prompt rather than multiple. 

Also: The best AI chatbots of 2023: ChatGPT and alternatives

There are other uses to which Zheng and team devote their dataset, including refining the ability of the language model to handle multi-part instructional prompts, and generating new data sets of challenges to stump the most powerful language models. The latter effort is helped by having the chatbot arena prompts because they can see humans trying to formulate the best prompts. "Such human judgments provide useful signals for examining the quality of benchmark prompts," they note.

There's a goal, too, of releasing new data on a quarterly basis, for which the authors seek sponsorship. "Such an endeavor demands considerable computing resources, maintenance efforts, and user traffic, all while carefully handling potential data privacy issues," they write. 

"Our efforts aim to emulate the critical data collection processes observed in proprietary companies but in an open-source manner."

Editorial standards