Exploring the Power of MINILLM for Smaller Language Models

TL;DR:

  • Knowledge distillation aids in reducing computational resource demand for large language models.
  • Black-box KD and white-box KD have commonly used strategies, with promising outcomes observed in optimizing smaller models.
  • White-box KD becomes more valuable with the development of open-source language models, leading to improved performance.
  • Conventional knowledge distillation methods may be more suitable for generative language models.
  • Minimizing the reverse Kullback-Leibler divergence (KL) addresses the issue of improbable samples in open text generation.
  • The use of Policy Gradient optimization enhances the accuracy and performance of language models.
  • Techniques such as single-step regularization, teacher-mixed sampling, and length normalization address challenges during model training.
  • MINILLM, a novel technique, successfully scales up from 120M to 13B parameter models and outperforms baseline KD models.
  • MINILLM generates lengthier and more diverse responses while mitigating exposure bias and improving calibration.

Main AI News:

In the rapidly evolving landscape of large language models, the demand for computational resources has become a significant challenge. To address this issue, knowledge distillation has emerged as a strategic approach, involving the training of smaller student models under the guidance of larger teacher models. This article delves into the potential of MINILLM, a novel technique that leverages knowledge distillation to unlock the capabilities of smaller language models.

There are two primary types of knowledge distillation: black-box KD and white-box KD. Black-box KD relies solely on the teacher’s predictions, while white-box KD utilizes the teacher’s parameters. Recently, black-box KD has shown promising outcomes in optimizing tiny models using prompt-response pairs generated by LLM APIs. On the other hand, white-box KD has proven to be increasingly beneficial for research communities and industrial sectors, particularly as more open-source LLMs become available. By utilizing white-box instructor models, student models can receive better signals, leading to improved performance.

While white-box KD has been extensively explored for small language understanding models with 1 billion parameters, its application to generative LLMs remains relatively uncharted. This paper examines white-box KD for LLMs and argues that conventional knowledge distillation methods may be better suited for LLMs performing generative tasks. These methods aim to minimize the approximated forward Kullback-Leibler divergence (KLD) between the teacher and student distributions, known as KL, by ensuring that the student distribution q(y|x) captures all the modes of the teacher distribution p(y|x). In text classification problems, where the output space typically consists of a finite number of classes, KL performs well since both p(y|x) and q(y|x) have a limited number of modes.

However, in open text generation problems with more complex output spaces, p(y|x) may encompass a significantly broader range of modes than q(y|x). When minimizing the forward KLD, q(y|x) might assign excessively high probabilities to the void regions of p(y|x), resulting in highly improbable samples under p. To address this issue, the authors propose minimizing the reverse KLD, KL, which is commonly used in computer vision and reinforcement learning. In a pilot experiment, it was observed that underestimating KL prompts q to focus on the major modes of p, assigning low probabilities to its vacant areas.

In the realm of language generation with LLMs, this approach ensures that the student model does not excessively learn long-tail versions of the teacher distribution. Instead, it concentrates on producing responses with accuracy—a crucial requirement in real-world scenarios where honesty and dependability are paramount. To optimize min KL, the authors employ Policy Gradient to generate the objective’s gradient. Previous studies have demonstrated the effectiveness of policy optimization in enhancing PLMs. However, they also identify challenges such as excessive variation, reward hacking, and generation length bias during model training. To mitigate these issues, they introduce the following techniques:

  1. Single-step regularization to reduce variation.
  2. Teacher-mixed sampling to address reward hacking.
  3. Length normalization to alleviate length bias.

In the instruction-following setting, which encompasses a wide range of NLP tasks, researchers from The CoAI Group, Tsinghua University, and Microsoft Research present MINILLM—a novel technique that has been applied to various generative language models ranging from 120 million to 13 billion parameters. Five instruction-following datasets are utilized, and assessments are conducted using Rouge-L and GPT-4 feedback. The results of their experiments demonstrate that MINILLM scales successfully across different model sizes and consistently outperforms baseline standard KD models across all datasets (see Figure 1). Further research reveals that MINILLM excels in generating lengthier replies with increased variety while mitigating exposure bias and improving calibration. The models and related resources can be accessed on GitHub.

Conclusion:

The integration of MINILLM and knowledge distillation offers significant implications for the market. Smaller language models can now leverage the power of larger models, reducing computational resource requirements. This development enhances the accuracy and reliability of AI-driven language processing applications, opening new opportunities for businesses to improve customer interactions, generate high-quality content, and enhance overall language understanding capabilities. The availability of MINILLM on GitHub provides accessibility and fosters collaboration within the AI community, further driving advancements in natural language processing. As the market demands more efficient and scalable language models, MINILLM’s potential will position businesses at the forefront of AI innovation.

Source