Tech Giants Face Legal Battle Over Alleged Use of Pirated Books in AI Development

TL;DR:

  • Microsoft, Meta, and Bloomberg face a class-action lawsuit for allegedly using copyrighted books in AI development without permission.
  • The lawsuit centers around the use of a plaintext dataset called Books3, containing data from 197,000 books, to train AI systems.
  • Large Language Models (LLMs) play a pivotal role in AI development by understanding and generating human language.
  • Tech companies like Microsoft and Meta benefit from LLMs by reducing reliance on human labor and creating personalized offerings.
  • The dataset Books3 was later incorporated into “The Pile” and hosted by EleutherAI, a grassroots collective.
  • Meta’s Large Language Model Meta AI (LLaMa) and Bloomberg’s AI system allegedly used the dataset without acknowledging copyrighted content.
  • The lawsuit includes claims of copyright infringement, conversion, and negligence, seeking class certification and damages.

Main AI News:

A recent lawsuit filed in the US District Court for the Southern District of New York has sent shockwaves through the tech industry. Microsoft, Meta (formerly Facebook), and Bloomberg stand accused of utilizing copyrighted books without permission in the development of their Artificial Intelligence (AI) technology. This class action, led by prominent authors including former Arkansas Governor and presidential candidate Mike Huckabee, alleges that their intellectual property was employed in training AI systems, using a plaintext dataset known as Books3.

Books3, a collection of data scraped from approximately 197,000 nonfiction books and novels published within the last two decades, was initially compiled by independent developer Shawn Presser. Presser, along with a team of developers, intended to make this dataset available for the creation of generative AI tools. These tools, known as Large Language Models (LLMs), are instrumental in understanding and generating human language, making them indispensable in AI development.

LLMs have become a cornerstone of AI technology, providing businesses with the means to enhance profitability through personalized offerings and reducing reliance on human labor. These models are trained on extensive datasets that draw from various sources, including the internet, books, and articles.

Books3 eventually became a component of a more extensive dataset called “The Pile,” which was hosted on the internet by EleutherAI, a grassroots collective of natural language processing researchers. EleutherAI, also named in the complaint, presented the dataset as a “free, open-source data set for the training of LLMs.

Meta, through its Large Language Model Meta AI (LLaMa), incorporated “The Pile” into its original dataset. However, according to the lawsuit, Meta failed to acknowledge the presence of copyrighted works within the dataset. The initial release of LLaMa was powerful, and Meta later announced its partnership with Microsoft in July 2023, with the specific goal of competing with OpenAI’s ChatGPT.

In August 2023, Meta unveiled LLaMa 2, an upgraded version of the model. Notably, the company made no mention that “The Pile” dataset contained copyrighted works. The authors’ attorneys argue that Microsoft and Meta significantly benefited from earlier iterations of the LLM, as they did not have to invest additional resources in training their LLM from scratch with their own content or properly licensed material.

Following the release of OpenAI GPT-3, Bloomberg also entered the arena by working on its own LLM, announced in March. The lawsuit alleges that Bloomberg used “Books3” to train its LLM in understanding and responding in natural language. Despite stating that it would not use the dataset in future versions of “BloombergGPT,” the authors argue that their copyrighted works have already been integrated into Bloomberg’s LLM and all subsequent versions.

The lawsuit asserts that the copyright-protected works now serve as a foundation for all future LLM models developed by these companies. Furthermore, it claims that none of the defendants sought or obtained licenses to use the copyrighted works from Books3, and they were aware that those responsible for assembling Books3 lacked the necessary licenses for dissemination.

The authors, therefore, allege that Microsoft, Meta, and Bloomberg deliberately used pirated and stolen works to profit from AI development, causing harm to the plaintiffs and the class. The lawsuit includes claims of direct and vicarious copyright infringement, conversion, and negligence. The authors seek class certification, an injunction to prevent the defendants from using their works in AI training, and damages.

Conclusion:

This legal battle highlights the significance of copyright protection in the AI industry. It could influence how companies source data for AI development and have broader implications for the market’s approach to intellectual property rights in AI technology. Companies may need to be more cautious about using datasets without proper licensing, potentially impacting their AI innovation strategies.

Source