Unpaid Contributors: Australian Universities and Government Provide Massive Data Sources for AI Chatbots without Compensation

TL;DR:

  • Australian universities and the New South Wales government contribute significant data to train AI chatbots but receive no compensation.
  • The massive volumes of data used to train AI chatbots remain secret.
  • Google and Stability AI extract information from the Common Crawl project.
  • New South Wales government web pages contribute the most to the Common Crawl, followed by Australian universities.
  • Social media services and media companies demand payment for the use of their data in training AI systems.
  • Australian public institutions are still grappling with the implications of their data use in AI.
  • Italy’s privacy watchdog initially blocked ChatGPT due to concerns about data processing and inaccurate information but restored access after changes were made.
  • Copyright and AI were discussed in a roundtable led by the federal Attorney-General.
  • Open data advocates fear restrictive regulations may impede the development of powerful tools.
  • The NSW Department of Customer Service follows an Open Data Policy.
  • Adelaide University declined to comment, while Google, Common Crawl, and Stability AI did not respond to requests for comment.
  • The University of Melbourne expressed a desire to share research globally.
  • The National Tertiary Education Union plans to address the impact of AI on workloads and plagiarism in upcoming meetings.

Main AI News:

Australian universities and the New South Wales government are significant contributors to the vast data used to train AI chatbots like ChatGPT. Despite their valuable contributions, they do not receive any compensation for their materials. This lack of remuneration highlights the secretive nature of the data that fuels these powerful generative AI chatbots, poised to revolutionize various white-collar industries such as media and education.

Notably, Google and Stability AI, two major AI companies, source some information from the Common Crawl, a non-profit project that scours the internet, extracting text from billions of web pages.

Among the top 500 registered domains in the Common Crawl’s database, the web pages of the New South Wales government take precedence, encompassing thousands of sites from schools, hospitals, and local councils across the state.

Australian National University, the University of Adelaide and the University of Melbourne closely follow in terms of contribution. While individually, these sites account for a small fraction of the overall Common Crawl database, which measures thousands of terabytes, they rank lower than prominent sources like Wikipedia and Amazon-hosted pages.

Nevertheless, their inclusion highlights the integration of websites created by millions, including Australians, for entirely different purposes into AI systems that have generated substantial profits for their select few owners.

In contrast, social media services like Reddit and Twitter, international media conglomerate News Corp, and photo library Getty Images are demanding payment for the utilization of their data to train generative image and text systems by AI companies. Meanwhile, Australian public institutions are just beginning to grasp the implications of their data’s use in AI. A spokesperson for the Australian National University mentioned that the institution is closely examining this issue, but a definitive stance has not been established yet.

One reason for the delay is the operating jurisdiction of these companies. If they fall under US legislation, they align their use of website content with the laws of that country rather than Australia. The spokesperson emphasized that many technological advancements have been impeded in Australia due to legislation rooted in fair-dealing principles instead of fair-use principles practiced in the United States.

Under a fair-dealing system, copyrighted material can only be freely used for specific purposes outlined in the law, excluding AI training, for which compensation must be provided. Conversely, fair use, a broader principle followed in the US, allows for the accommodation of novel technologies.

As Australian public institutions grapple with the implications of their data’s use in AI systems, the conversation around compensation and legal frameworks gains importance. The transformative potential of AI in various industries necessitates a balanced approach that ensures fairness for all stakeholders involved.

Conlcusion:

The use of data from Australian universities and the New South Wales government to train AI chatbots without compensation highlights the complex dynamics in the market. While major AI companies like Google and Stability AI draw information from sources like the Common Crawl, the lack of compensation raises questions about the ownership and value of data.

Additionally, the demands for payment from social media services and media companies indicate the growing awareness of the commercial value of data used in training AI systems. As governments and public institutions navigate the challenges surrounding data use in AI, it becomes crucial for market stakeholders to find a balance that fosters innovation, respects intellectual property rights, and ensures fair compensation for data providers. These developments underscore the need for robust regulatory frameworks and open discussions to address the evolving landscape of AI and data markets.

Source