New Worries Emerge Regarding the Origin of AI System Training Materials

TL;DR:

Concerns have arisen about the sources of training data for some of the world’s most advanced AI models, with an investigation highlighting the questionable origins of the Colossal Clean Crawled Corpus (C4) compiled by Google.
C4 serves as training material for LaMDA AI and LLaMA, and while touted as a “clean” version of Common Crawl, has come under scrutiny for its inclusion of questionable sources such as white nationalist site VDARE and far-right news outlet Breitbart.
Many sites included in the C4 database did not give explicit consent, and some pushed the boundaries of fair use, with the contents of a seized pirate ebook repository still included in the database.
Acquiring the vast amounts of text required to train AI models can be challenging, leading some researchers to rely on the “fair use” defense.
The London-based AI company Stability recently released StableLM, trained on the Pile, a massive dataset that includes uncleaned Common Crawl and other sources such as pirate ebooks and internal emails from Enron.
The version used by Stability is “three times larger,” but the company has not disclosed any further details. It open-sources its models to promote transparency and allow organizations to fine-tune them for their own use without sharing sensitive data.

Main AI News:

Amid growing concerns over the sources of training data used to develop some of the world’s most advanced AI models, a new investigation has shed light on the questionable origins of the Colossal Clean Crawled Corpus (C4). Compiled by Google from over 15 million websites, C4 serves as the training material for both the tech giant’s LaMDA AI and Meta’s GPT competitor, LLaMA.

While C4 is publicly accessible, its sheer size has made it challenging to thoroughly examine its contents. The dataset, touted as a “clean” version of Common Crawl, with potentially offensive language and racist slurs removed, has come under scrutiny for its inclusion of questionable sources.

According to the Washington Post, C4’s “cleanliness” is superficial at best. While reputable sites such as the Guardian, Wikipedia, Google Patents, and PLOS contribute to the dataset, it also features less trustworthy sources, including the white nationalist site VDARE and the far-right news outlet Breitbart. The Russian state-backed propaganda site RT is among the largest providers of training data to C4.

Few of the sites included in the database gave explicit consent to be part of C4, and while Common Crawl, the non-profit organization that compiled the scraped data, claims to respect opt-out requests, some of the sources push the boundaries of fair use. For example, b-ok.org, formerly known as Bookzz, was a major repository of pirated ebooks until its seizure by the FBI in 2022. Despite this, its contents remain in the C4 database.

It’s worth noting that vast collections of data like C4 play a crucial role in the creation of AI, as large language models (LLMs) like ChatGPT require substantial datasets to improve continuously.

Acquiring the hundreds of gigabytes of text required to train AI models can be a daunting task, leading some researchers to take the approach of “ask for forgiveness, not permission” and rely on the “fair use” defense to copyright.

This approach was taken by the London-based AI company Stability, which recently released its new LLM, StableLM, trained on the Pile, a massive 850GB dataset that includes the entire uncleaned Common Crawl database, pirate ebooks from Bibliotik, data scraped from GitHub, and other sources such as internal emails from Enron and the proceedings of the European Parliament.

The Pile is publicly hosted by a group of anonymous “data enthusiasts” known as the Eye, whose copyright takedown policy is linked to a controversial video. The version used by Stability is said to be “three times larger,” but the company has not disclosed any further details about the additional content. Despite this, Stability claims that the extra data gives StableLM “surprisingly high performance” in conversational and coding tasks.

“We open-source our models to promote transparency and foster trust,” the company stated. “Researchers can ‘look under the hood’ to verify performance, work on interpretability techniques, identify potential risks, and help develop safeguards. Organizations across the public and private sectors can adapt (‘fine-tune’) these open-source models for their own applications without sharing their sensitive data or giving up control of their AI capabilities.”

Conlcusion:

The recent investigation into the questionable origins of the Colossal Clean Crawled Corpus (C4) highlights the need for greater transparency and accountability in the AI industry. The use of questionable sources, including sites with explicit racist or far-right agendas, in training data raises serious ethical and legal issues that cannot be ignored.

Companies like Google and Stability, who open-source their models, must ensure that their training data is carefully curated to avoid such controversies. The market for AI systems and language models will continue to grow, and it’s crucial that the industry adopts best practices to maintain public trust and ensure the responsible development of AI.

Source

Nvidia Introduces Minitron 4B and 8B: Cutting-Edge AI Models with 40x Faster Training

Google Cloud Integrates Mistral AI’s Codestral into Vertex AI

ANA’s Global CMO Growth Council Unveils Comprehensive Guide on Generative AI Success Stories

Snowflake Integrates AI21’s Jamba-Instruct to Enhance Enterprise Document Processing

LEAN-GitHub Dataset: Transforming Automated Theorem Proving with Large-Scale Data

Former ZoomInfo Executive Lands $15M for AI-Powered Sales Engineer Startup

AI-Driven Surge in Prefabricated Data Centers: Omdia Forecasts $11.7 Billion Market by 2027

Mytra Launches Innovative Robotics and AI System to Transform Warehouse Operations

KPMG and Avalara Partner to Advance AI-Driven Tax Compliance Solutions

Vijil AI Raises $6M to Enhance Trust and Safety in Generative AI

Tesla Faces Margin Squeeze as Investors Await Updates on Robotaxi and AI Strategies

Adaptive Revolutionizes Construction Payments with AI-Powered Automation

Transforming Supply Chain Management: Didero’s AI-Powered Solution for Mid-Market Enterprises

AI accelerates product development by discovering new ingredients quickly

Ukraine Leverages AI-Driven Drones to Gain Tactical Edge in Modern Warfare

Backslash Security Expands DevSecOps Platform with Advanced Simulation and Generative AI Tools

Intron Health Gains Traction with Innovative Speech Recognition Tool for African Accents

Tabnine Launches Advanced Tabnine Protected 2: Setting a New Standard for AI Privacy and Compliance

TruDoc and e& enterprise Leverage AI to Revolutionize Healthcare Communication in the MENA Region

Thorn Unveils Safer Predict: Advanced AI Solution to Combat Child Exploitation

Emerson Unveils Ovation 4.0: AI-Enhanced Automation Platform for Power and Water Industries

Monarch Tractor Secures $133 Million in Record Series C Funding to Advance AI-Driven Farming Solutions (Video)

Splight Secures $12 Million in Seed Funding to Revolutionize Renewable Energy Management with AI

vHive Launches Innovative Autonomous Digital Twin and AI Solution for Solar Farm Optimization

Google AI Reduces Computational Requirements for Weather Forecasts

New Worries Emerge Regarding the Origin of AI System Training Materials

TL;DR:

Main AI News:

Conlcusion:

New Worries Emerge Regarding the Origin of AI System Training Materials

TL;DR:

Main AI News:

Conlcusion:

Subscribe Now