Degenerative AI: The Perils of Machine-Generated Data in Training AI Models

TL;DR:

Training AI models on machine-generated data lead to model collapse.
Model collapse occurs when models trained on synthetic data forget the true underlying data distribution, even without a shift in distribution over time.
Errors resulting from optimization imperfections, limited models, and finite data contribute to model degradation.
Fair representation of minority groups in the original data is crucial to prevent model collapse.
Preserving human-generated training data, including less likely occurrences, holds value for future models.
Access to data from the “tails of the distribution” is essential to avoid model collapse.

Main AI News:

The use of machine-generated data to train artificial intelligence (AI) large language models have been found to result in a phenomenon known as model collapse, as highlighted in a recent study conducted by researchers from the United Kingdom and Canada. This concerning issue has significant implications for the future of training generative AI systems, as the prevalence of AI-generated text and synthetic data continues to rise in online content.

Originally, prominent language models such as OpenAI’s ChatGPT and Alphabet’s Bard were trained using predominantly human-generated text obtained from various sources on the Internet. These models were then fine-tuned with additional human input. However, with the growing prominence of AI models themselves in content creation, an alarming problem has emerged.

Ilia Shumailov and Zakhar Shumaylov, the authors of the study, embarked on an exploration of the potential challenges arising from the increased reliance on machine-generated data for training AI models. Their investigation quickly revealed that such reliance would indeed pose significant obstacles.

Shumailov explained that when AI models are trained on machine-generated data instead of human-created data, severe degradation occurs within a few iterations, even when some of the original data is preserved. He attributed this degradation to errors resulting from optimization imperfections, limited models, and finite data. Over time, these mistakes accumulate, causing models that learn from generated data to progressively misperceive reality.

This issue affects all forms of generative AI, and the authors emphasized that model collapse is an inherent phenomenon associated with models trained on synthetic data. Shumailov clarified that learning from data produced by other models is the primary cause of the model collapse, even in the absence of a shift in the data distribution over time.

To illustrate this concept, Shumailov employed an analogy involving dog pictures. Imagine a scenario where a model generates dog images and the initial dataset consists of 10 dogs with blue eyes and 90 dogs with yellow eyes. After training the initial model, it becomes adept at learning from the data, albeit with imperfections. Due to the abundance of yellow-eyed dogs in the training set, the model unintentionally alters the blue eyes to appear slightly more greenish.

Subsequently, if this model is used to generate new dogs shared on social media, and someone scrapes the Internet for dog images, including the generated ones, the new dataset obtained will contain ten blue-eyed dogs that now appear less blue and more green, along with 90 yellow-eyed dogs. Training a new model with this data will lead to a similar outcome, further exacerbating the issue. The model becomes more skilled at representing yellow-eyed dogs but gradually loses its ability to understand and represent blue-eyed dogs accurately.

Over time, the understanding of the minority group (blue-eyed dogs in this case) deteriorates, progressing from blue to blue-green, then green, and eventually yellow-green. Ultimately, the model completely loses or distorts its perception of this minority group. This phenomenon is what researchers refer to as model collapse.

Shumailov emphasized that preventing model collapse requires ensuring fair representation of minority groups from the original data in subsequent datasets. This representation should consider not only the quantity of data but also the distinctive attributes of the minority group, such as their blue eyes in the dog image analogy.

The study suggests that preserving human-generated training data, specifically data collected from the Internet before the widespread adoption of AI technology, holds value for future models to learn from. This data, which may include less likely occurrences, can contribute to preventing model collapse.

Shumailov highlighted that the critical factor in averting model collapse is access to data from the “tails of the distribution.” Consequently, companies and entities seeking to train AI models in the future must allocate sufficient resources to data collection and annotation, ensuring that their upcoming models can learn effectively.

Conclusion:

The reliance on machine-generated data for training AI models poses significant challenges, including the risk of model collapse. This phenomenon, characterized by a loss or distorted perception of the minority group in the data, can undermine the accuracy and effectiveness of AI models. To mitigate these risks, businesses and entities in the market must allocate sufficient resources to data collection and annotation, prioritize fair representation of minority groups, and preserve human-generated training data. By doing so, they can ensure that future AI models learn effectively and accurately interpret the true underlying data distribution, fostering better outcomes in the evolving market landscape.

Source

DeepMind Launches Next-Gen AI Models for Advanced Math Challenges

ABI Research: Shift to NPUs for TinyML in IoT Set to Propel AI Chipset Revenues to US$7.3 Billion by 2030

Microsoft and Lumen Technologies Forge Strategic Partnership to Drive AI and Digital Transformation

Amazon’s chip lab in Austin is testing new servers equipped with Amazon’s AI chips

BingX Launchpool Introduces MATR1X (MAX): The Intersection of Web3, AI, and eSports

MATRIX Inc. Unveils Gaussian VR: Transforming Real Estate Viewings with Advanced AI Technology (Video)

Channel99 Unveils Advanced AI Scoring Technology to Enhance B2B Vendor Performance

Language I/O Secures $5 Million in Funding to Advance AI-Powered Multilingual Support

Subtle Medical Secures $10 Million in Series B+ Funding to Expand AI-Powered Imaging Solutions

Alibaba-Backed Baichuan AI Startup Secures $691 Million in Funding

Toyota and Stanford Achieve Autonomous Tandem Drifting Milestone with Advanced AI for Enhanced Vehicle Safety

Tesla Faces Margin Squeeze as Investors Await Updates on Robotaxi and AI Strategies

Adaptive Revolutionizes Construction Payments with AI-Powered Automation

Transforming Supply Chain Management: Didero’s AI-Powered Solution for Mid-Market Enterprises

AI accelerates product development by discovering new ingredients quickly

UK Hospitals Launch AI Trial for Prostate Cancer Detection

InterSystems and NEOM Forge Strategic Alliance to Create AI-Driven Healthcare Ecosystem

Peerbridge Health Unveils EF-ACT Trial to Advance AI-Driven Remote Cardiac Monitoring

HHS Restructures Technology, Cybersecurity, Data, and AI Strategy for Enhanced Coordination

Subtle Medical Secures $10 Million in Series B+ Funding to Expand AI-Powered Imaging Solutions

Emerson Unveils Ovation 4.0: AI-Enhanced Automation Platform for Power and Water Industries

Monarch Tractor Secures $133 Million in Record Series C Funding to Advance AI-Driven Farming Solutions (Video)

Splight Secures $12 Million in Seed Funding to Revolutionize Renewable Energy Management with AI

vHive Launches Innovative Autonomous Digital Twin and AI Solution for Solar Farm Optimization

Google AI Reduces Computational Requirements for Weather Forecasts

Degenerative AI: The Perils of Machine-Generated Data in Training AI Models

TL;DR:

Main AI News:

Conclusion:

Degenerative AI: The Perils of Machine-Generated Data in Training AI Models

TL;DR:

Main AI News:

Conclusion:

Subscribe Now