Generative AI enables developers and business users to generate code and narratives at the push of a button. The bad news is that "most of the words and images in datasets behind artificial intelligent agents like ChatGPT and DALL-E are copyrighted," the paper's authors point out. "Existing foundation models are trained on copyrighted material. Deploying these models can pose both legal and ethical risks when data creators fail to receive appropriate attribution or compensation."
The legal doctrine that provides cover for the use of limited segments of the code, narratives, and images is "fair use." However, the study's authors assert, "If the model produces output that is similar to copyrighted data, particularly in scenarios that affect the market of that data, fair use may no longer apply to the output of the model. Fair use is not guaranteed, and additional work may be necessary to keep model development and deployment squarely in the realm of fair use."
AI and machine learning practitioners "aren't necessarily aware of the nuances of fair use," Peter Henderson, a co-author of the paper, pointed out in a related interview. "At the same time, the courts have ruled that certain high-profile real-world examples are not protected fair use, yet those very same examples look like things AI is putting out. There's uncertainty about how lawsuits will come out in this area."
The co-authors predict an upcoming wave of lawsuits stemming from uncompensated use of code and content with generative AI. If courts eventually rule that AI does not meet the criteria of fair use, it "could dramatically curtail how generative AI is trained and used. As AI tools continue to advance in capabilities and scale, they challenge the traditional understanding of fair use, which has been well-defined for news reporting, art, teaching, and more. New AI tools -- both their capability and scale -- complicate this definition."
The IP and copyright implications for code also need to be weighed. "Like in natural language text cases, in software cases, literal infringement -- verbatim copying -- is unlikely to be fair use when it comprises a large portion of the code base," the co-authors explain. "And when the amount copied is small, the overall product is sufficiently different from the original one, or the code is sufficiently transformative, then fair use may be indicated under current standards."
Such standards may be more permissive than those for text or music. "Functional aspects of code are not protected by copyright, meaning that copying larger segments of code verbatim might be allowed in cases where the same level of similarity would not be permissible for text or music. Nonetheless, for software generated by foundation models, the more the generated content can be transformed from the original structure, sequence, and organization, the better."
The report's co-authors recommend establishing technical guardrails. "Install fair use filters that try to determine when the generated work -- a chapter in the style of J.K. Rowling, for instance, or a song reminiscent of Taylor Swift -- is a little too much like the original and begins to infringe on fair use."