In-Depth Analysis Reveals Rampant Unauthorized Use of Publisher Content Fueling Generative AI Technologies

TL;DR:

  • News/Media Alliance uncovers widespread unauthorized use of publisher content by generative AI developers.
  • Concerns were raised about the impact on the sustainability of high-quality content and the legal implications.
  • GAI systems extensively rely on copied journalistic content for training, potentially harming publishers.
  • Recommendations include recognizing copyright infringement, transparency requirements, and international cooperation.
  • The Alliance emphasizes the importance of enforcing copyright protections and maintaining high-quality standards.

Main AI News:

The News/Media Alliance has recently published a comprehensive White Paper accompanied by technical analysis and has submitted insightful comments to the U.S. Copyright Office regarding the utilization of publisher content to empower generative artificial intelligence technologies (GAI). These three publications collectively shed light on the pervasive, unauthorized exploitation of publisher content by GAI developers. This unregulated practice not only poses a significant threat to the sustainability and availability of high-quality original content but also raises serious legal concerns.

GAI systems have proliferated by unscrupulously copying vast amounts of expressive material from the Alliance’s member publications. This is done almost always without obtaining the necessary authorization or providing fair compensation to the original creators. The result is the emergence of new products and services that directly compete with the offerings of Alliance member publishers.

The Alliance acknowledges the immense potential of GAI models and applications to enhance various aspects of our daily lives. However, it firmly advocates that this development should not come at the expense of publishers and journalists who invest substantial time and resources in producing content that informs, protects, entertains, and holds our government officials and decision makers accountable.

The Alliance and its members are open to collaboration with GAI developers to foster the sustainable and responsible growth of these transformative technologies.

While the Copyright Office submission and White Paper discuss the broader landscape of publishers in the face of the GAI revolution, the accompanying technical analysis delves into the extent to which GAI developers rely on high-quality journalistic content to fuel their models. Key findings include:

  • GAI developers have extensively copied and employed news, magazine, and digital media content to train their large language models (LLMs).
  • Curated datasets underpinning LLMs exhibit a significant bias towards publisher content, ranging from over five to nearly 100 times the amount of generic web content collected by the well-known entity Common Crawl.
  • News and digital media content rank third among all source categories in Google’s C4 training set, a foundational element in the development of Google’s GAI-powered products like Bard. Furthermore, half of the top ten sites represented in this dataset are news outlets.
  • LLMs not only copy but also utilize publisher content in their outputs. These models can reproduce the content on which they were trained, demonstrating their ability to memorize and replicate expressive content.

Danielle Coffey, President & CEO of the Alliance, emphasized, “Our research and analysis reveal that AI companies and developers are not only engaging in unauthorized copying of our members’ content for product training but are doing so extensively, more so than other sources. This underscores their recognition of the unique value we bring, yet most developers fail to secure proper permissions through licensing agreements or provide fair compensation to publishers. This not only harms publishers but also endangers the sustainability of AI models and the availability of reliable information.”

The Copyright Office comments and White Paper provide a range of recommendations for policymakers, including:

  • Recognizing that unauthorized use of publishers’ expressive content for commercial GAI training infringes copyright and directly competes with and harms publisher businesses.
  • Establishing transparency requirements that mandate disclosure of the use of copyright-protected content in training.
  • Encouraging and facilitating effective licensing solutions.
  • Promoting international cooperation and harmonization of GAI regulations.
  • Adopting legislation to rectify existing market imbalances that hinder publishers from engaging in fair negotiations for the use of their content on dominant platforms.

Coffey concluded, “Generative AI systems should be held to the same standards of responsibility and accountability as any other business. This White Paper highlights the reliance of these systems on journalistic and creative content, which represents an investment in quality. Publishers are also bound by law to take responsibility for the content they share with the public. Continued unauthorized use jeopardizes markets that acknowledge the value of archived and real-time quality content, ultimately leading to the deterioration of GAI models themselves. Quality in, quality out. It is imperative that we rigorously enforce copyright protections and uphold high standards of quality and accountability as the cornerstones of these and other emerging technologies.”

The News/Media Alliance is a nonprofit organization representing over 2,200 news and magazine media organizations and their multiplatform businesses in the United States and worldwide. Its membership includes print and digital publishers committed to original journalism. Headquartered just outside Washington, D.C., the association is dedicated to ensuring the future of journalism through communication, research, advocacy, and innovation.

Conclusion:

The unauthorized use of publisher content to fuel generative AI technologies presents a significant challenge for the market. It threatens the sustainability of high-quality content, endangers publishers, and underscores the need for stronger copyright enforcement. Policymakers and industry players must work together to protect the interests of publishers and maintain the integrity of AI-driven innovations.

Source