New method by Databricks, MIT, and DatologyAI utilizes small reference models to compute text sample perplexity

  • Data pruning is crucial for enhancing large language models (LLMs) by selecting high-quality subsets from extensive datasets.
  • Traditional pruning methods are limited, leading to the development of advanced techniques like neural network-based heuristics.
  • Researchers from Databricks, MIT, and DatologyAI propose a novel approach using small reference models to compute text sample perplexity for data pruning.
  • The method involves training a small model on a random data subset to evaluate perplexity, then pruning the dataset to retain only high-quality data.
  • Perplexity-based data pruning significantly improves LLM performance on downstream tasks and reduces pretraining steps.
  • This method offers promise for data researchers, demonstrating improved performance for diverse datasets like the Pile and Dolma.

Main AI News:

In today’s machine learning landscape, optimizing large language models (LLMs) to deliver exceptional performance while managing training costs is paramount. One crucial aspect in this pursuit is the enhancement of pretraining data quality, as it directly influences training efficiency and model effectiveness. A widely acknowledged strategy to achieve this is data pruning, a process that involves selecting high-quality subsets from extensive datasets to streamline model training. By doing so, models are shielded from noisy and irrelevant data, leading to improved overall performance.

Training LLMs poses a challenge due to the abundance of often noisy datasets. Subpar data can significantly hamper model performance, necessitating the development of techniques to filter out low-quality data. The primary objective is to retain only the most pertinent and high-quality information. Effective data pruning is essential to optimize model training, ensuring that only top-tier data is utilized, thus enhancing model accuracy and efficiency.

Conventional data pruning methods encompass simple rules-based filtering and basic classifiers to identify high-quality samples. While beneficial, these methods exhibit limitations in handling large and diverse datasets. Advanced approaches leveraging neural network-based heuristics have emerged, assessing data quality based on metrics like feature similarity or sample difficulty. Despite their advantages, these techniques can be computationally demanding and lack consistency across various data domains, highlighting the need for more efficient and universally applicable methods.

Researchers from Databricks, MIT, and DatologyAI have introduced a novel data pruning approach utilizing small reference models to compute text sample perplexity. This method entails training a small model on a random data subset, which then evaluates the perplexity of each sample. Perplexity gauges how well a probability model predicts a sample, with lower scores indicating higher-quality data. By focusing on samples with the lowest perplexity scores, researchers can prune the dataset, retaining only the most relevant data and thus enhancing the performance of larger models trained on this pruned data.

The proposed method involves dividing the dataset into training and validation sets for the small reference model. This model is trained on the standard next-token prediction objective, computing perplexity scores for each dataset sample. Subsequently, the dataset is pruned based on these scores, selecting samples within a specific perplexity range. For instance, samples with the lowest perplexity are chosen using a low selection criterion. The pruned dataset is then employed to train the final, larger model, benefiting from the high-quality data. The efficacy of this approach is demonstrated across different dataset compositions, including the diverse Pile and Dolma, primarily derived from web scrapes.

Perplexity-based data pruning yields significant performance enhancements for LLMs in downstream tasks. For example, pruning based on perplexity scores computed with a 125 million parameter model improved the average performance on downstream functions of a 3 billion parameter model by up to 2.04%. Additionally, it achieved up to a 1.45 times reduction in pretraining steps required to attain comparable baseline performance. The method’s effectiveness extends to various scenarios, including over-trained and data-constrained regimes. In over-training scenarios, the absolute gain in average downstream normalized accuracy was consistent for both compute-optimal and over-trained models, underscoring the method’s robustness.

This research underscores the significance of employing small reference models in perplexity-based data pruning, marking a significant advancement in LLM training optimization. By leveraging smaller models to filter out low-quality data, researchers can enhance model performance and training efficiency. This method serves as a promising tool for data researchers, demonstrating a 1.89 improvement in downstream performance for the Pile and 1.51 for Dolma when training for a compute-optimal duration. It elevates the performance of large-scale language models while reducing computational resource requirements, making it a valuable addition to the modern data researcher’s arsenal.

Conclusion:

This innovative approach to data pruning, as proposed by Databricks, MIT, and DatologyAI, signifies a substantial advancement in optimizing language model training. By effectively filtering out low-quality data using small reference models and focusing on perplexity scores, researchers can significantly enhance model performance and training efficiency. This method holds promise for the market, offering improved downstream performance for large-scale language models and reducing computational resource requirements, thus providing a valuable asset for modern data researchers and businesses alike.

Source