Google Confirms Training Bard on Web Scraped Data

TL;DR:

  • Google confirms using publicly available web data, including social media posts, to train language models like Bard.
  • The updated privacy policy expands the scope of data utilization for AI models, benefiting users and the public.
  • The policy lacks specifics on preventing copyrighted materials from being included in the data pool.
  • Stricter regulations and lawsuits arise as the origins of training data for AI systems become more opaque.
  • The fair use doctrine’s application in this context remains uncertain, prompting legal and regulatory debates.
  • Gannett sues Google, alleging a monopoly in the digital ad market due to AI advancements.
  • Twitter and Reddit implement measures to restrict data harvesting, facing criticism from their communities.

Main AI News:

In a recent statement, Google confirmed that it uses publicly available web data to train its language models, including the highly anticipated Bard. The company’s spokesperson, Christa Muldoon, emphasized that their privacy policy has always been transparent about this practice, which is now explicitly extended to newer services like Bard. Google maintains that privacy principles and safeguards are incorporated into the development of their AI technologies, aligning with their AI Principles.

Google’s updated privacy policy, effective July 1st, 2023, further elaborates on the use of information to enhance services and develop new products, features, and technologies for the benefit of users and the public. The policy explicitly states that publicly available information is leveraged to train Google’s AI models and create innovative offerings such as Google Translate, Bard, and Cloud AI capabilities.

Analyzing the revision history of the policy reveals additional insights into the services that will be trained using the collected data. Notably, the information may now be utilized for “AI Models” beyond just “language models.” This expanded scope grants Google more flexibility in training and building systems based on public data. Interestingly, this information is concealed within an embedded link for “publically accessible sources” found under the “Your Local Information” tab, requiring users to click and explore the relevant section.

While Google’s updated policy specifies the use of “publicly available information” for training AI products, it remains silent on the specific mechanisms employed to prevent copyrighted materials from being included in the data pool. Several publicly accessible websites already have policies in place that prohibit data collection or web scraping for training large language models and AI toolsets. It will be intriguing to observe how this approach aligns with global regulations like the General Data Protection Regulation (GDPR), which safeguard individuals against the unauthorized misuse of their data.

The emergence of stricter laws, coupled with intensified market competition, has led major generative AI system developers, like OpenAI with their GPT-4, to be cautious about disclosing the origins of their training data. Questions have arisen regarding the inclusion of social media posts and copyrighted works by human artists and authors. Currently, the fair use doctrine’s applicability to this context remains uncertain, occupying a legal gray area. This ambiguity has prompted lawsuits and calls from lawmakers for more stringent regulations to govern how AI companies collect and employ training data. Additionally, concerns are raised about the processing of this data to ensure it does not contribute to detrimental failures within AI systems, often burdening data analysts with prolonged working hours and challenging conditions.

In a related development, Gannett, the largest newspaper publisher in the United States, has filed a lawsuit against Google and its parent company, Alphabet, alleging that advancements in AI technology have enabled the search giant to establish a monopoly in the digital advertising market. Furthermore, Google’s AI search beta and similar products have been labeled as “plagiarism engines,” drawing criticism for impeding website traffic.

Concurrently, popular social platforms Twitter and Reddit, known for hosting substantial amounts of public information, have recently implemented stringent measures to curtail unrestricted data harvesting by other companies. These changes to their application programming interfaces (APIs) and the limitations imposed on data scraping have faced backlash from their respective communities, as they adversely impact the fundamental user experiences on Twitter and Reddit.

Conclusion:

Google’s utilization of web data for AI training, as clarified in its updated privacy policy, signifies a strategic move to enhance its language models and develop innovative offerings. However, concerns surrounding data privacy, copyright infringement, and the fair use doctrine persist. The legal gray area surrounding training data raises the need for more robust regulations and stricter compliance measures. The lawsuit filed by Gannett against Google highlights the potential market impact of AI advancements, while social platforms like Twitter and Reddit face challenges in striking a balance between data protection and user experience. Companies in the market must navigate these complexities to ensure responsible AI development while meeting regulatory requirements and user expectations.

Source