X
Innovation

These authors are suing OpenAI and Meta for copyright infringement now

This could catalyze stricter regulation on using copyrighted work to train AI models.
Written by Maria Diaz, Staff Writer
NEW YORK, NEW YORK - MAY 05: Sarah Silverman speaks on stage at Variety's 2022 Power Of Women: New York Event Presented By Lifetime at The Glasshouse on May 05, 2022 in New York City. (Photo by Cindy Ord/Getty Images for Variety)

Sarah Silverman speaks on May 05, 2022 in New York City.

Cindy Ord/Getty Images for Variety

Sarah Silverman joined forces with fellow authors Richard Kadfrey and Christopher Golden to sue Meta and OpenAI in dual claims of copyright infringement. 

The suits are separate, each against one of the companies, and the authors claim they never consented for their copyrighted books to be used as training material for the large language models used (LLM) behind OpenAI's ChatGPT and Meta's LLaMa. 

Also: Generative AI is coming for your job. Here are 4 reasons to get excited

An LLM is a type of artificial intelligence algorithm trained using massive amounts of information from books and texts from the internet to learn language patterns, grammar, and context until it can generate human-like text and have chat interactions with users. 

According to the lawsuits, the models "remix the copyrighted works of thousands of book authors -- and many others -- without consent, compensation, or credit." 

Copyright infringement has been one of the many concerns of AI skeptics since ChatGPT became widely available in November, triggering the generative AI boom and questions about how AI will affect the creativity and copyright process.

Also: Who owns the code? If ChatGPT's AI helps write your app, does it still belong to you?

The lawsuits claim the LLMs were trained on illegally-acquired materials, such as those found in "shadow library" websites. According to the OpenAI suit:

"The OpenAI Books2 dataset can be estimated to contain about 294,000 titles. The only 'internet-based books corpora' that have ever offered that much material are notorious 'shadow library' websites like Library Genesis (aka LibGen), Z-Library (aka B-ok), Sci-Hub, and Bibliotik. The books aggregated by these websites have also been available in bulk via torrent systems."

The Meta suit makes similar claims, as it links to the sources where the books' training data was gathered. It divides them in two: The first as being from Project Gutenberg, which is an online archive of books that are out of copyright, and the second is from the "Books3 section of ThePile", which is a dataset available on the popular AI project hosting site, Hugging Face, and appears to represent all of Bibliotik, mentioned above.

Also: Want to build your own AI chatbot? Say hello to open-source HuggingChat

The plaintiffs are represented by lawyers Joseph Savery and Matthew Butterick, who also represent authors Mona Awad and Paul Tremblay in a lawsuit filed in June against OpenAI over copyright infringement.

Editorial standards