OpenAI and Google reportedly used YouTube transcripts to train their AI models

OpenAI and Google have reportedly taken to transcripts of YouTube videos to train their LLMs.
Written by Don Reisinger, Contributing Writer
YouTube on iPhone

Get ready for a brand new YouTube experience.

Maria Diaz/ZDNET

Training artificial intelligence models requires a lot of data to help them better understand the context of queries and ultimately provide better responses. In the constant search for more data, both OpenAI and Google have turned to using YouTube videos, created by others, to train their large language models (LLMs), The New York Times reported over the weekend, citing people who claim to have knowledge of the companies' activities.

In 2023, OpenAI developed Whisper, a speech recognition tool that would help the company scrape YouTube, take audio from more than 1 million YouTube videos, and use that to inform GPT-4, according to the Times' sources.

Google, meanwhile, also transcribed YouTube videos, according to the report. What's more, the search giant changed its terms of service in 2023 to make it easier to sweep up public Google Docs, Google Maps restaurant reviews, and other publicly available content for use in its AI models, according to the Times.

Also: Have 10 hours? IBM will train you in AI fundamentals - for free

It's no secret that AI models require significant troves of data to operate efficiently. More data, including text, audio, and videos, gives models the ability to understand human context, human interaction, and other critical communication details that make them more effective.

However, there's increasing tension between the companies developing those models and the content creators. What content, if any, should be permissible to use in training AI models? In a growing number of cases, news outlets, websites, and content creators themselves are calling on OpenAI, Google, Meta, and other tech companies to pay for access to their content before they can be used to train LLMs.

In some cases, model makers have complied and signed agreements with companies, including Reddit and Stack Overflow, to get access to user data. In other cases, not so much.

According to The New York Times' report, for instance, OpenAI's alleged transcription of more than 1 million YouTube videos may run afoul of Google's own terms of service, which prevent third-party applications from using its YouTube videos for "independent" means. Additionally, the companies' decisions to allegedly transcribe videos may run afoul of copyright laws, since YouTube creators who upload videos to YouTube still retain the copyright to the content they create.

To be clear, the Times report cannot be independently verified. Also, neither Google nor OpenAI acknowledged that they scraped data illegally. We do know, however, that the companies are running out of ways to access more content. What's worse, a Times source said that it's possible tech companies will run out of content to ingest into their models by 2026.

Also: I spent a weekend with Amazon's free AI courses, and highly recommend you do too

What then? It's entirely possible — and perhaps, likely — that the tech companies move to sign licensing agreements with content creators, media outlets, and even musical artists to access their creations. It's also possible they will further change their terms of service, or worse, find ways to skirt privacy laws, to access the data they currently can't.

It's clear that the amount of data companies like Meta, Google, and OpenAI will need in the coming years will only increase. It's critical that as they access that data, they do so in a way that doesn't harm the people who created the content in the first place.

Editorial standards