With the overnight sensation of ChatGPT, it was only a matter of time before the use of generative AI became both a subject of serious research and also grist for the training of generative AI itself.
In a research paper released this month, scholars gathered a database of one million "real-world conversations" that people have had with 25 different large language models. Released on the arXiv pre-print server, the paper was authored by Lianmin Zheng of the University of California at Berkeley, and peers at UC San Diego, Carnegie Mellon University, Stanford, and Abu Dhabi's Mohamed bin Zayed University of Artificial Intelligence.
A sample of 100,000 of those conversations, selected at random by the authors, showed that most were about subjects you'd expect. The top 50% of interactions were on such pedestrian topics as programming, travel tips, and requests for writing help.
But below that top 50%, other topics crop up, including role-playing characters in conversations, and three topic categories that the authors term "unsafe": "Requests for explicit and erotic storytelling"; "Explicit sexual fantasies and role-playing scenarios"; and "Discussing toxic behavior across different identities."
The authors speculate that in the full one million conversations, there may be "even more harmful content." They used the OpenAI technology, in part, to tag conversations as "unsafe," although OpenAI's own system in some cases falls down on the job, as they discuss in detail.
"Open-source models without safety measures tend to generate flagged content more frequently than proprietary ones," they write. "Nonetheless, we still observe 'jailbreak' successes on proprietary models like GPT-4 and Claude." And, in fact, they note that GPT-4 gets broken a third of the time on the challenges, which seems a high rate for something with guardrails in place.
Examples of the so-called unsafe conversations are listed in the paper's appendix. Of course, the term "unsafe" can have a very broad meaning. Some of the examples shown are akin to mass-market erotic fiction sold in bookstores, so the opprobrium has to be taken with a grain of salt.
Collected over a period of five months, April to August of this year, the data set -- called "LMSYS-Chat-1M" -- is "the first large-scale, real-world LLM conversation dataset," they write.
LMSYS-Chat-1M towers above the previously largest-known dataset, compiled by the AI startup Anthropic, which had 339,000 conversations. Where Anthropic had only 143 users in its study, Zheng and team gathered chats from more than 210,000 users, across 154 languages, and using 25 different large language models, including OpenAI's GPT-4, and open-source language models such as Claude and Vicuña.
The gathering of this dataset has several goals. First: fine-tune the language models in order to improve their performance. Also: develop benchmarks for the safety of generative AI by studying user prompts that could make language models go astray, such as by making requests for malicious information.
As the authors note, not everyone can gather this data. It's expensive to run large language models, and the parties that can afford it, such as OpenAI, generally keep their data secret for commercial reasons.
The authors have several goals in mind for this kind of data. One intention is to create a moderation tool that would deal with unsafe content. They start with their own Vicuña language model, and train it by showing it warnings from the OpenAI API and having it produce textual explanations of why the content was flagged.
"Instead of developing a classifier, we fine-tune a language model to generate explanations for why a particular message was flagged," as they describe it. Then they created a challenge data set of 110 conversations that OpenAI's system failed to flag. Finally, they used that benchmark to see how the fine-tuned Vicuña stacks up to OpenAI's GPT-4 and others.
"We observe a significant improvement (30%) when transitioning from Vicuna-7B to the fine-tuned Vicuna-moderator-7B, underscoring the effectiveness of fine-tuning," they write. "Furthermore, Vicuna-moderator-7B surpasses GPT-3.5-turbo's performance and matches that of GPT-4."
It's interesting that their moderator program scores above GPT-4 in what's called "one-shot," which means the program was only given one example of a harmful text in the prompt rather than multiple.
There are other uses to which Zheng and team devote their dataset, including refining the ability of the language model to handle multi-part instructional prompts, and generating new data sets of challenges to stump the most powerful language models. The latter effort is helped by having the chatbot arena prompts because they can see humans trying to formulate the best prompts. "Such human judgments provide useful signals for examining the quality of benchmark prompts," they note.
There's a goal, too, of releasing new data on a quarterly basis, for which the authors seek sponsorship. "Such an endeavor demands considerable computing resources, maintenance efforts, and user traffic, all while carefully handling potential data privacy issues," they write.
"Our efforts aim to emulate the critical data collection processes observed in proprietary companies but in an open-source manner."