Degenerative AI: The Perils of Machine-Generated Data in Training AI Models

TL;DR:

  • Training AI models on machine-generated data lead to model collapse.
  • Model collapse occurs when models trained on synthetic data forget the true underlying data distribution, even without a shift in distribution over time.
  • Errors resulting from optimization imperfections, limited models, and finite data contribute to model degradation.
  • Fair representation of minority groups in the original data is crucial to prevent model collapse.
  • Preserving human-generated training data, including less likely occurrences, holds value for future models.
  • Access to data from the “tails of the distribution” is essential to avoid model collapse.

Main AI News:

The use of machine-generated data to train artificial intelligence (AI) large language models have been found to result in a phenomenon known as model collapse, as highlighted in a recent study conducted by researchers from the United Kingdom and Canada. This concerning issue has significant implications for the future of training generative AI systems, as the prevalence of AI-generated text and synthetic data continues to rise in online content.

Originally, prominent language models such as OpenAI’s ChatGPT and Alphabet’s Bard were trained using predominantly human-generated text obtained from various sources on the Internet. These models were then fine-tuned with additional human input. However, with the growing prominence of AI models themselves in content creation, an alarming problem has emerged.

Ilia Shumailov and Zakhar Shumaylov, the authors of the study, embarked on an exploration of the potential challenges arising from the increased reliance on machine-generated data for training AI models. Their investigation quickly revealed that such reliance would indeed pose significant obstacles.

Shumailov explained that when AI models are trained on machine-generated data instead of human-created data, severe degradation occurs within a few iterations, even when some of the original data is preserved. He attributed this degradation to errors resulting from optimization imperfections, limited models, and finite data. Over time, these mistakes accumulate, causing models that learn from generated data to progressively misperceive reality.

This issue affects all forms of generative AI, and the authors emphasized that model collapse is an inherent phenomenon associated with models trained on synthetic data. Shumailov clarified that learning from data produced by other models is the primary cause of the model collapse, even in the absence of a shift in the data distribution over time.

To illustrate this concept, Shumailov employed an analogy involving dog pictures. Imagine a scenario where a model generates dog images and the initial dataset consists of 10 dogs with blue eyes and 90 dogs with yellow eyes. After training the initial model, it becomes adept at learning from the data, albeit with imperfections. Due to the abundance of yellow-eyed dogs in the training set, the model unintentionally alters the blue eyes to appear slightly more greenish.

Subsequently, if this model is used to generate new dogs shared on social media, and someone scrapes the Internet for dog images, including the generated ones, the new dataset obtained will contain ten blue-eyed dogs that now appear less blue and more green, along with 90 yellow-eyed dogs. Training a new model with this data will lead to a similar outcome, further exacerbating the issue. The model becomes more skilled at representing yellow-eyed dogs but gradually loses its ability to understand and represent blue-eyed dogs accurately.

Over time, the understanding of the minority group (blue-eyed dogs in this case) deteriorates, progressing from blue to blue-green, then green, and eventually yellow-green. Ultimately, the model completely loses or distorts its perception of this minority group. This phenomenon is what researchers refer to as model collapse.

Shumailov emphasized that preventing model collapse requires ensuring fair representation of minority groups from the original data in subsequent datasets. This representation should consider not only the quantity of data but also the distinctive attributes of the minority group, such as their blue eyes in the dog image analogy.

The study suggests that preserving human-generated training data, specifically data collected from the Internet before the widespread adoption of AI technology, holds value for future models to learn from. This data, which may include less likely occurrences, can contribute to preventing model collapse.

Shumailov highlighted that the critical factor in averting model collapse is access to data from the “tails of the distribution.” Consequently, companies and entities seeking to train AI models in the future must allocate sufficient resources to data collection and annotation, ensuring that their upcoming models can learn effectively.

Conclusion:

The reliance on machine-generated data for training AI models poses significant challenges, including the risk of model collapse. This phenomenon, characterized by a loss or distorted perception of the minority group in the data, can undermine the accuracy and effectiveness of AI models. To mitigate these risks, businesses and entities in the market must allocate sufficient resources to data collection and annotation, prioritize fair representation of minority groups, and preserve human-generated training data. By doing so, they can ensure that future AI models learn effectively and accurately interpret the true underlying data distribution, fostering better outcomes in the evolving market landscape.

Source