There's big risk in not knowing what OpenAI is building in the cloud, warn Oxford scholars

The language-model-as-a-service industry is concealing critical details about reliability and trustworthiness, warns a report by the University of Oxford and its collaborators.
Written by Tiernan Ray, Senior Contributing Writer
Buildings in the form of blocks
OsakaWayne Studios/Getty Images

One of the seminal events in artificial intelligence (AI) in 2023 was the decision by OpenAI, the creator of ChatGPT, to disclose almost no information about its latest large language model (LLM), GPT-4, when the company introduced the program in March

That sudden swing to secrecy is becoming a major ethical issue for the tech industry because no one knows, outside OpenAI and its partner Microsoft, what is going on in the black box in their computing cloud. 

Also: With GPT-4, OpenAI opts for secrecy versus disclosure

The obfuscation is the subject of a report this month by scholars Emanuele La Malfa at the University of Oxford and collaborators at The Alan Turing Institute and the University of Leeds. 

In a paper posted on the arXiv pre-print server, La Malfa and colleagues explore the phenomenon of "Language-Models-as-a-Service" (LMaaS), referring to LLMs that are hosted online, either behind a user interface, or via an API. The primary examples of that approach are OpenAI's ChatGPT and GPT-4. 

"Commercial pressure has led to the development of large, high-performance LMs [language models], accessible exclusively as a service for customers, that return strings or tokens in response to a user's textual input -- but for which information on architecture, implementation, training procedure, or training data is not available, nor is the ability to inspect or modify its internal states offered," write the authors.

Diagram of differences between open-source langauge models and LMaas

Differences between open-source language models and LMaaS. A user of open-source programs has complete control, while customers of an LMaaS service have to make do with what they get though a browser or an API. 

University of Oxford

Those access restrictions "inherent to LMaaS, combined with their black-box nature, are at odds with the need of the public and the research community to understand, trust, and control them better," they observe. "This causes a significant problem at the field's core: the most potent and risky models are also the most difficult to analyze."

The problem is one that has been pointed out by many parties, including competitors to OpenAI, especially those banking on open-source code to beat out closed-source code. For example, Emad Mostaque, CEO of generative AI startup Stability.ai, which produces tools such as the image generator Stable Diffusion, has said that no enterprises can trust closed-source programs such as GPT-4. 

"Open models will be essential for private data," said Mostaque during a small meeting of press and executives in May. "You need to know everything that's inside it; these models are so powerful." 

Also: GPT-3.5 vs GPT-4: Is ChatGPT Plus worth its subscription fee?

La Malfa and team review the literature of the various language models, and identify how obfuscation prevents an audit of the programs along four critical factors: accessibility, replicability, comparability, and trustworthiness. 

The authors note that these concerns are a new development in AI ethics: "These issues are specific to the LMaaS paradigm and distinct from preexisting concerns related to language models."

Also: Why open source is essential to allaying AI fears, according to Stability.ai founder

Accessibility concerns the issue of keeping code secret, which disproportionately benefits huge companies with huge R&D budgets, the writers allege. 

"With the computational power distributed unevenly and concentrated in a tiny number of companies," they write, "those with a technological, yet not computational, advantage face a dilemma: While open-sourcing their LMaaS would benefit them in terms of market exposure and contribution to their codebase by the community, releasing the code that powers a model may rapidly burn their competitive advantage in favour of players with higher computational resources."

In addition, the uniform pricing of the LMaaS programs means people in less developed economies are at a disadvantage in accessing the tools. "A starting point to mitigate these issues is thus analyzing the impact of LMaaS and, more generally, pay-per-usage artificial intelligence services as a standalone, pervasive, and disruptive technology," they suggest.

Another issue is the increasing gap in how LLMs are trained: the commercial ones can re-use customer prompts and thereby set themselves apart from programs that use only public data, the authors observe. 

Also: How does ChatGPT work?

LMaaS' commercial licenses, they write, "grant companies the right to use prompts to provide, maintain, and improve their services," so that there's no common baseline of training data from which everyone draws. 

They offer a chart (below) that assesses the disparity in whether language models gather customer prompts for training and "fine-tuning", which is a stage that in some cases enhances a language model's abilities, and whether they let users opt out.

Chart of different language models

Comparison of whether language models offer opt-outs to their customers with respect to data, and whether they use the data for training and fine-tuning their black-box models. 

University of Oxford

After describing at length the various risks, La Malfa and team propose "a tentative agenda" to address the four areas, urging, "we need to work as a community to find solutions that enable researchers, policymakers, and members of the public to trust LMaaS."

For one, they recommend that "companies should release the source code" of their LMaaS programs, if not to the general public, then "LMaaS should at least be available to auditors/evaluators/red teams with restrictions on sharing."

Also: AI bots have been acing medical school exams, but should they become your doctor?

Companies, they propose, should not totally do away with older language models as they roll out new ones. Or, at least, "all the parameters that make up a model should be hashed, and a log of 'model commits' should be offered by model maintainers to the user, as the maintainer updates the model." And the field, including journals and conferences, should "discourage the usage of models" that don't pursue such precautions. 

For benchmarking, tools need to be developed to test what elements an LMaaS has digested of its prompts, so that the baseline can be set accurately.  

Clearly, with LMaaS, the topic of AI ethics has entered a new phase, one in which critical information is kept under lock and key, making ethical choices a more difficult matter for everyone than they have been in past. 

Editorial standards