Open source isn't ready for generative AI. How stakeholders are changing this light bulb together

Open-source licenses, already stretched thin by software-as-a-service and the cloud, are an even worse fit for AI's large language models. What's an open source leader to do?
Written by Steven Vaughan-Nichols, Senior Contributing Editor
choness/Getty Images

Without open source, there is no AI. It's that simple. But, those same licenses have been showing their age: The Gnu General Public License (GPL), Apache License, and Mozilla Public License don't fit well with software-as-a-service or cloud services. AI poses even larger problems. The open-source licenses, with their copyright law foundations, aren't a good fit for AI's large language models (LLM)s.

This isn't just some theoretical techno-legal issue, either. It's already showing up in the courts. 

Also: Open source is actually the cradle of artificial intelligence. Here's why

In J. Doe 1 et al. vs GitHub, the plaintiffs allege that Microsoft, OpenAI, and GitHub -- via their commercial AI-based system, OpenAI's Codex and GitHub's Copilot -- stole their open source code. The class action suit claims that code "suggested" by AI often consists of near-identical strings of code scraped from public GitHub repositories -- but without the required open-source license attributions. 

On a related issue, two groups of writers, including George R.R. Martin, Michael Chabon, and John Grisham, are suing Microsoft and OpenAI for taking their work and using it in their LLMs. Copyright, the legal foundation of open source, is at the heart of this issue.  

But this isn't simply a Microsoft problem.

As Sean O'Brien, Yale Law School lecturer in cybersecurity and founder of the Yale Privacy Lab, told my ZDNET colleague David Gewirtz: "I believe there will soon be an entire sub-industry of trolling that mirrors patent trolls, but this time surrounding AI-generated works. A feedback loop is created as more authors use AI-powered tools to ship code under proprietary licenses. Software ecosystems will be polluted with proprietary code that will be the subject of cease-and-desist claims by enterprising firms."

Others, like German researcher and politician Felix Reda, claim that all AI-produced code is public domain

US attorney Richard Santalesa, a founding member of the SmartEdgeLaw Group, told Gewirtz that there exist both contract and copyright law issues -- and they're not the same thing. Santalesa believes companies producing AI-generated code will "as with all of their other IP, deem their provided materials – including AI-generated code – as their property." Besides, public domain code is not the same thing as open-source code.

Also: Red Hat's new rule: Open source betrayal?

So, what's to be done? Simply claiming your AI is open source is a nonstarter. Meta, for example, claims Llama 2 is open source. It's not. 

As Erica Brescia, a managing director at RedPoint, the open source-friendly venture capital firm, asked on Twitter: "Can someone please explain to me how Meta and Microsoft can justify calling Llama 2 open source if it doesn't actually use an OSI [Open Source Initiative]-approved license or comply with the OSD [Open Source Definition]? Are they intentionally challenging the definition of OSS [Open Source Software]?" 

Here's the short explanation: Meta is using open source as a marketing term, not a legal one. That usage won't fly once the lawsuits mount up

The problem with Llama 2 specifically is that it blocks extremely profitable companies from using it. According to Stephen O'Grady, open-source licensing expert and RedMonk co-founder, the problem is that they won't work in open source.  "Imagine if Linux was open source unless you worked at Facebook,"  

Also: Red Hat's new rule: Open source betrayal?

At the same time, as OpenUK CEO Amanda Brock observed, "I don't think we're going to see going forward any LLM or any significant AI being able to be licensed as open source, because the key to open source is the Open Source Definition."  

And the road to that Definition was a long and bumpy one.

The first free software licenses began In the early 1980s when MIT Lab programmer Richard M. Stallman couldn't get an early laser printer, the Xerox 9700, to produce error messages. The problem? Stallman couldn't read or change its source code. At the time, this was a new development. Although we now think of proprietary software as the default, it wasn't then. 

So, Stallman created the GNU General Public License (GPL). While not the first Free Software license (that honor belongs to the Berkeley Software Distribution (BSD) license), GNU would prove to be very influential. In no small part, that's because Linus Torvalds chose to use the GPLv2 as Linux's license.

The GPL is based on two principles. First, software code can be copyrighted. Second, anyone is free to read and edit the code so long as these freedoms aren't taken away from anyone else.

Also: A look back at 40 Years of GNU and the Free Software Foundation

By 1985, Free Software was becoming popular, but it also had become clear that the word "free" was too ambiguous. After Netscape released Mozilla's source code -- which became the basis of the Firefox web browser -- several leading Free Software luminaries, including Eric S. Raymond, Bruce Perens, Michael Tiemann, Jon "Maddog" Hall, and Christine Peterson, coined the phrase open source to describe this kind of license. In 1998, Perens and Raymond went on to found the OSI, which drafted the Open Source Definition (OSD) and used this as the general guide to defining all open-source licenses. 

All open-source licenses must comply with the OSD. For AI and LLMs, that's much easier said than done. 

True, there are open LLMs such as Falcon, FastChat-T5, and OpenLLaMA. But most LLMs contain proprietary, copyrighted, or simply unknown information that their owners won't tell you about. The Electronic Frontier Foundation (EFF) says it well: "Garbage In, Gospel Out." 

We've seen this problem coming for a while. At Open Source Europe in Bilbao, Spain, last month, I spoke with Stefano Maffulli, executive director of the Open Source Initiative (OSI), the organization that defines and manages open-source licenses. "The process started two years ago when GitHub Copilot came out," Maffulli told me. "It was a watershed moment. All of a sudden, code you wrote as a human for humans, everything we have produced and put on the Internet was being harvested for machine learning."

Also: The best AI chatbots: ChatGPT and alternatives

So, what can we do? Maffulli and other open-source and AI leaders are working on combining AI with open-source licenses in sensible ways. 

Maffulli observed that combining AI with open-source licenses is as hard, if not harder, than when software copyright was first applied to source code in the 1980s (when Free Software and open-source were first defined). True, open-source AI programs -- such as TensorFlowPyTorch, and Hugging Face -- work well with old-style licenses. But old-style software isn't the problem. It's where software and data mix that the existing open-source licenses begin to break down. Specifically, it's where all that data and code merge together in AI/ML artifacts -- such as datasets, models, and weights -- that's where trouble emerges. "Therefore," said Mafulli, "we need to make a new definition for open-source AI." 

This must be a definition that all stakeholders can agree upon and work with. Free software and open source are no longer just matters for developers. The goals of open-source savvy programmers and lawyers aren't the same as those of AI companies. To address this, Maffulli, together with Google, Microsoft, GitHub, Open Forum Europe, Creative Commons, Wikimedia Foundation, Hugging Face, GitHub, the Linux Foundation, ACLU Mozilla, and the Internet Archive, are working on a draft for defining a common understanding of open-source AI. In other words, all the AI players are working on the definition.  

If all goes well, we can expect to see the fruits of their labor as early as this month. And while this will only be the first draft of the AI Open Source Definition, I expect that it will be finalized as quickly as possible. Everyone involved knows that AI is advancing rapidly and the sooner we get an open-source framework around it, the better.

Editorial standards