New Worries Emerge Regarding the Origin of AI System Training Materials

TL;DR:

  • Concerns have arisen about the sources of training data for some of the world’s most advanced AI models, with an investigation highlighting the questionable origins of the Colossal Clean Crawled Corpus (C4) compiled by Google.
  • C4 serves as training material for LaMDA AI and LLaMA, and while touted as a “clean” version of Common Crawl, has come under scrutiny for its inclusion of questionable sources such as white nationalist site VDARE and far-right news outlet Breitbart.
  • Many sites included in the C4 database did not give explicit consent, and some pushed the boundaries of fair use, with the contents of a seized pirate ebook repository still included in the database.
  • Acquiring the vast amounts of text required to train AI models can be challenging, leading some researchers to rely on the “fair use” defense.
  • The London-based AI company Stability recently released StableLM, trained on the Pile, a massive dataset that includes uncleaned Common Crawl and other sources such as pirate ebooks and internal emails from Enron.
  • The version used by Stability is “three times larger,” but the company has not disclosed any further details. It open-sources its models to promote transparency and allow organizations to fine-tune them for their own use without sharing sensitive data.

Main AI News:

Amid growing concerns over the sources of training data used to develop some of the world’s most advanced AI models, a new investigation has shed light on the questionable origins of the Colossal Clean Crawled Corpus (C4). Compiled by Google from over 15 million websites, C4 serves as the training material for both the tech giant’s LaMDA AI and Meta’s GPT competitor, LLaMA.

While C4 is publicly accessible, its sheer size has made it challenging to thoroughly examine its contents. The dataset, touted as a “clean” version of Common Crawl, with potentially offensive language and racist slurs removed, has come under scrutiny for its inclusion of questionable sources.

According to the Washington Post, C4’s “cleanliness” is superficial at best. While reputable sites such as the Guardian, Wikipedia, Google Patents, and PLOS contribute to the dataset, it also features less trustworthy sources, including the white nationalist site VDARE and the far-right news outlet Breitbart. The Russian state-backed propaganda site RT is among the largest providers of training data to C4.

Few of the sites included in the database gave explicit consent to be part of C4, and while Common Crawl, the non-profit organization that compiled the scraped data, claims to respect opt-out requests, some of the sources push the boundaries of fair use. For example, b-ok.org, formerly known as Bookzz, was a major repository of pirated ebooks until its seizure by the FBI in 2022. Despite this, its contents remain in the C4 database.

It’s worth noting that vast collections of data like C4 play a crucial role in the creation of AI, as large language models (LLMs) like ChatGPT require substantial datasets to improve continuously.

Acquiring the hundreds of gigabytes of text required to train AI models can be a daunting task, leading some researchers to take the approach of “ask for forgiveness, not permission” and rely on the “fair use” defense to copyright.

This approach was taken by the London-based AI company Stability, which recently released its new LLM, StableLM, trained on the Pile, a massive 850GB dataset that includes the entire uncleaned Common Crawl database, pirate ebooks from Bibliotik, data scraped from GitHub, and other sources such as internal emails from Enron and the proceedings of the European Parliament.

The Pile is publicly hosted by a group of anonymous “data enthusiasts” known as the Eye, whose copyright takedown policy is linked to a controversial video. The version used by Stability is said to be “three times larger,” but the company has not disclosed any further details about the additional content. Despite this, Stability claims that the extra data gives StableLM “surprisingly high performance” in conversational and coding tasks.

“We open-source our models to promote transparency and foster trust,” the company stated. “Researchers can ‘look under the hood’ to verify performance, work on interpretability techniques, identify potential risks, and help develop safeguards. Organizations across the public and private sectors can adapt (‘fine-tune’) these open-source models for their own applications without sharing their sensitive data or giving up control of their AI capabilities.”

Conlcusion:

The recent investigation into the questionable origins of the Colossal Clean Crawled Corpus (C4) highlights the need for greater transparency and accountability in the AI industry. The use of questionable sources, including sites with explicit racist or far-right agendas, in training data raises serious ethical and legal issues that cannot be ignored.

Companies like Google and Stability, who open-source their models, must ensure that their training data is carefully curated to avoid such controversies. The market for AI systems and language models will continue to grow, and it’s crucial that the industry adopts best practices to maintain public trust and ensure the responsible development of AI.

Source