Princeton Introduces MeZO: A Groundbreaking Zeroth-Order Optimizer for Memory-Efficient Fine-Tuning of Large Language Models

TL;DR:

  • Large Language Models (LLMs) have witnessed immense success in AI, with ChatGPT leading the way.
  • Fine-tuning LLMs is crucial for specialized tasks, but it becomes memory-intensive as models scale up.
  • Princeton researchers introduce MeZO, a memory-efficient zeroth-order optimizer for fine-tuning LLMs.
  • MeZO adapts the ZO-SGD method, estimating gradients with minimal memory overhead.
  • MeZO enhances non-differentiable objectives and demonstrates compatibility with various parameter tunings.
  • MeZO outperforms existing methods while consuming significantly less memory.
  • It enables efficient fine-tuning of LLMs with billions of parameters.

Main AI News:

In the realm of Generative Artificial Intelligence (AI), Large Language Models (LLMs) have witnessed unprecedented growth, spearheading remarkable economic and societal transformations. A prime example is the widely acclaimed ChatGPT developed by OpenAI, which has amassed millions of users since its inception. Leveraging the power of Natural Language Processing (NLP) and Natural Language Understanding (NLU), this chatbot possesses the uncanny ability to generate human-like text, provide insightful answers to queries, summarize extensive paragraphs, complete code snippets, and even compose emails. Alongside ChatGPT, other LLMs such as PaLM, Chinchilla, and BERT have also showcased exceptional prowess in the field of AI.

Fine-tuning pre-trained language models has emerged as a prominent strategy for a myriad of language-related tasks. By adapting these models to specialized domains, incorporating human instructions, and catering to individual preferences, fine-tuning enables them to deliver optimal performance. However, as language models grow in size, the process of fine-tuning becomes computationally demanding and memory-intensive, particularly during the computation of gradients in backpropagation. Memory consumption escalates significantly due to the caching of activations, gradients, and gradient history storage, surpassing the requirements for inference.

Addressing this memory challenge head-on, a team of researchers from Princeton University has introduced an ingenious solution – MeZO: a memory-efficient zeroth-order optimizer. MeZO is an adaptation of the traditional Zeroth-Order Stochastic Gradient Descent (ZO-SGD) method, which estimates gradients solely based on differences in loss values and operates in-place, enabling fine-tuning of language models with a memory footprint equivalent to that of inference. The researchers focused on zeroth-order approaches for MeZO, as these methods can estimate gradients using just two forward passes, rendering them highly memory-efficient.

The MeZO algorithm has been meticulously crafted to optimize Large Language Models containing billions of parameters. The team’s contributions are manifold:

  1. MeZO has been developed by modifying the ZO-SGD method, incorporating a few variations to facilitate in-place operations on models of arbitrary sizes with minimal memory overhead.
  2. MeZO has demonstrated compatibility with Progressive Early Termination Fine-Tuning (PEFT) and comprehensive parameter tunings, including LoRA and prefix tuning.
  3. MeZO can enhance non-differentiable objectives such as accuracy or F1 score while still operating within the same memory constraints as inference.
  4. Adequate pre-training ensures that MeZO’s optimization rate per step and global convergence rate depends on a specific condition number of the landscape, namely the effective local rank, rather than the sheer number of parameters. This contrasts with previous ZO lower bounds that implied a potential slowdown in convergence rate based on the number of parameters.
  5. Experimental results have indicated that MeZO exhibits exceptional performance across diverse model types, including masked Language Models (LM) and autoregressive LM. MeZO’s scalability ranges from 350 million to a staggering 66 billion parameters, extending to downstream tasks encompassing classification, multiple-choice questions, and text generation.
  6. MeZO surpasses zero-shot approaches, Integrated Concept Learning (ICL), and linear probing in various experiments, showcasing comparable or superior performance to fine-tuning in 7 out of 11 tests using OPT-13B. Moreover, MeZO achieves this while consuming approximately 12 times less memory than RoBERTa-large or normal fine-tuning, respectively.

Upon evaluation, MeZO successfully trained a 30-billion parameter model using a single Nvidia A100 80GB GPU, a feat that eludes traditional backpropagation, which can only train a 2.7-billion parameter LM under the same memory constraints. In summary, MeZO represents a groundbreaking memory-efficient zeroth-order optimizer, empowering effective fine-tuning of large language models.

Conclusion:

The introduction of MeZO, a memory-efficient zeroth-order optimizer, represents a significant advancement in the market for large language models. This groundbreaking solution addresses the memory challenges faced during fine-tuning, enabling more efficient and scalable optimization. With the ability to fine-tune LLMs with billions of parameters while consuming significantly less memory, MeZO opens up new possibilities for training and adapting language models to specialized domains. This breakthrough will likely drive further innovation and progress in the AI industry, unlocking enhanced performance and expanding the applications of large language models in various sectors.

Source