Research shows companies are training AI models with YouTube content without permission

Artificial intelligence models need as much useful data as possible to work, but some of the biggest AI developers rely in part on transcribed YouTube videos without the creators’ permission, violating YouTube’s own rules, an investigation by Proof News And Wired.

The two media outlets revealed that Apple, Nvidia, Anthropic and other major AI companies trained their models on a dataset called “YouTube Subtitles,” which contains transcripts of nearly 175,000 videos from 48,000 channels — all without the knowledge of the video creators.

The YouTube Subtitles dataset includes the text of video subtitles, often with translations into multiple languages. The dataset was created by EleutherAI, which says the goal of the dataset is to lower the barriers to AI development for people outside of large tech companies. It is just one component of the much larger EleutherAI dataset called “Pile.” In addition to the YouTube transcripts, “Pile” includes Wikipedia articles, speeches from the European Parliament, and even emails from Enron, according to the report.

However, Pile also has many fans among the big tech companies. For example, Apple used Pile to train its OpenELM AI model, and the Salesforce AI model released two years ago was trained with Pile and has since been downloaded more than 86,000 times.

The YouTube captions dataset includes a number of popular channels across news, education and entertainment. This includes content from major YouTube stars such as MrBeast and Marques Brownlee. All of them have had their videos used to train AI models. Proof News has set up a search tool that scans the collection to see if a particular video or channel is included. There are even some TechRadar videos in the collection, as seen below.

YouTube subtitle dataset

(Image credit: Proof News)

The YouTube subtitles dataset appears to violate YouTube’s terms of service, which explicitly prohibit automated scraping of its videos and related data. Yet that’s exactly what the dataset relied on, with a script that downloaded subtitles via YouTube’s API. The investigation found that the automated download sorted the videos based on nearly 500 search terms.

The discovery sparked widespread surprise and anger among the YouTube creators Proof and Wired interviewed. Concerns about unauthorized use of content are valid, and some of the creators were upset at the idea that their work could be used in AI models without payment or permission. This is especially true for those who found that the dataset contains transcripts of deleted videos, and in one case, the data comes from a creator who has since removed their entire online presence.

The report did not include any comment from EleutherAI. However, it noted that the organization describes its mission as democratizing access to AI technologies through the publication of trained models. This, if this dataset is anything to go by, could conflict with the interests of content creators and platforms. The legal and regulatory battles surrounding AI were already complex. This kind of revelation will likely make the ethical and legal landscape of AI development even more treacherous. It’s easy to suggest a balance between innovation and ethical responsibility for AI, but striking it will be much harder.

Related Posts

Detroit police ask for help in finding missing 55-year-old man

Ascension is affected by a cybersecurity incident affecting clinical operations

San Jose State University professor suspended after pro-Palestinian protests