Anthropic, Nvidia, Apple, and Salesforce Utilize YouTube Transcripts for AI Training: Report

  • Companies like Anthropic PBC, Nvidia Corp., Apple Inc., and Salesforce Inc. reportedly used YouTube video subtitles without permission to train AI.
  • The dataset from EleutherAI, a nonprofit focused on AI model transparency, supplied the subtitles, covering 173,536 videos from 48,000 channels.
  • Content creators, including prominent YouTubers like MrBeast and David Pakman, have expressed concerns over the unauthorized use of their transcripts.
  • Legal challenges similar to past lawsuits involving major tech firms question the copyright of publicly available data used in AI training.
  • U.S. case law establishes that facts themselves cannot be copyrighted.

Main AI News:

The latest report sheds light on the contentious practice of using YouTube video subtitles for training artificial intelligence systems without explicit consent from content creators. According to Proof News, several major companies, including Anthropic PBC, Nvidia Corp., Apple Inc., and Salesforce Inc., have been implicated in utilizing subtitles sourced from 173,536 YouTube videos spanning over 48,000 channels. While these companies are not accused of directly scraping the content, they are reported to have accessed a dataset provided by EleutherAI, a nonprofit organization focused on enhancing the interpretability and alignment of large AI models.

Founded in 2020, EleutherAI aims to democratize access to advanced AI technologies through the development and open-source release of models like GPT-Neo and GPT-J. The organization also advocates for open science norms in natural language processing, enabling independent researchers to study and audit AI technologies, thereby promoting transparency and ethical practices in AI development.

The dataset in question, labeled as “YouTube Subtitles,” includes transcripts extracted from educational and online learning channels, as well as media outlets and prominent YouTube personalities such as MrBeast, Marques Brownlee, PewDiePie, and David Pakman. Despite the dataset’s public availability, some content creators, including Pakman, have expressed concerns about the potential impact on their livelihoods due to the unauthorized use of their transcripts.

This issue is not isolated, as similar legal challenges have arisen in the past concerning the use of publicly available data for training AI models. Previous lawsuits involving Microsoft Corp., OpenAI, Google LLC, and Meta Holdings Inc. have raised questions about the copyrightability of publicly stated facts used in AI training data. These cases draw upon existing U.S. case law, which establishes that facts themselves cannot be copyrighted, as established in landmark decisions such as Feist Publications Inc. v. Rural Telephone Service Co., 499 U.S. 340 (1991), and International News Service v. Associated Press (1918).

Conclusion:

The revelation that major tech companies have utilized YouTube subtitles for AI training without explicit consent underscores ongoing ethical dilemmas in the technology sector. This practice raises significant legal and ethical questions about the use of publicly available content for commercial AI applications, particularly regarding the rights of content creators and the boundaries of permissible data usage in AI development.

Source