Meta Faces Legal Storm Over Alleged Use of Pirated Books in AI Training

TL;DR:

  • Meta Platforms faces allegations of using pirated books to train its AI models despite legal warnings.
  • Prominent authors, including Sarah Silverman and Michael Chabon, accuse Meta of using their works without permission for its AI language model, Llama.
  • A California judge partially dismissed the lawsuit but allowed authors to amend their claims.
  • Meta’s legal department chat logs reveal concerns about the legality of using book files for training.
  • The chat logs indicate that Meta may have been aware that its use of copyrighted books might not be protected by U.S. copyright law.
  • Tech companies are facing lawsuits for using copyrighted works to build generative AI models, potentially increasing costs and legal risks.
  • New regulations in Europe may require AI companies to disclose their data sources for training models.
  • Meta’s release of Llama 2 could disrupt the AI market, especially for companies like OpenAI and Google.

Main AI News:

Meta Platforms, embroiled in a copyright infringement lawsuit, stands accused by notable authors of employing thousands of pirated books to train its AI models, despite explicit warnings from its legal team. Comedian Sarah Silverman and Pulitzer Prize winner Michael Chabon, among others, are behind the legal action, alleging that Meta’s artificial intelligence language model, Llama, has been nurtured using their copyrighted works without proper authorization.

A California judge, in a recent ruling, partially dismissed the Silverman lawsuit and indicated a willingness to allow the authors to amend their claims, further intensifying the legal battle. Meta has yet to comment on these serious allegations.

In a recent development, a new complaint filed on Monday presents chat logs involving a Meta-affiliated researcher discussing the acquisition of the contentious dataset within a Discord server. These chat records could potentially serve as crucial evidence, revealing Meta’s awareness of potential issues related to U.S. copyright law.

Within these chat logs, researcher Tim Dettmers engaged in a dialogue with Meta’s legal department, questioning the legality of using book files as training data. Dettmers’ statement in 2021 suggests uncertainty within Meta regarding the use of this data, stating, “At Facebook, there are a lot of people interested in working with (T)he (P)ile, including myself, but in its current form, we are unable to use it for legal reasons.” This dataset, known as “ThePile,” was employed by Meta in training the initial version of Llama.

Furthermore, Dettmers mentioned that Meta’s lawyers had expressed reservations, asserting that “the data cannot be used or models cannot be published if they are trained on that data.” Although the specifics of these concerns were not elaborated upon, the chat participants identified “books with active copyrights” as the primary source of apprehension. They argued that training on such data should potentially fall under the umbrella of “fair use,” a U.S. legal doctrine safeguarding specific unlicensed uses of copyrighted materials.

Dettmers, a doctoral student at the University of Washington, refrained from immediate comment on these allegations.

In a year marked by a barrage of lawsuits, tech companies are grappling with content creators who accuse them of appropriating copyrighted materials to develop generative AI models, which have taken the world by storm. These legal battles could potentially escalate the cost of building data-hungry AI models, as companies may be compelled to compensate artists, authors, and other content creators for the use of their intellectual property.

Simultaneously, new regulations in Europe pertaining to artificial intelligence may mandate companies to disclose their data sources for model training, exposing them to additional legal vulnerabilities.

Meta introduced the initial version of its Llama large language model in February, revealing a list of datasets utilized for its training, including the controversial “Books3 section of ThePile,” reportedly containing 196,640 books. However, the company did not disclose the training data for its subsequent iteration, Llama 2, which was released for commercial use during the summer. This move was closely watched in the tech industry, potentially disrupting the dominance of players like OpenAI and Google, who traditionally charge for access to their AI models.

Conclusion:

The legal challenges faced by Meta and the broader implications for the AI market underscore the need for tech companies to prioritize copyright compliance and transparency in their AI development processes. These legal battles and regulatory changes could reshape the landscape, potentially affecting the market dominance of established players like OpenAI and Google.

Source