New Transformer architecture could enable powerful LLMs without GPUs

  • New architecture eliminates MatMul operations in large language models (LLMs).
  • Achieves comparable performance to traditional Transformers with reduced memory usage.
  • Uses 3-bit ternary weights and additive operations instead of 16-bit floating point weights and MatMul.
  • Employs MatMul-free components like Linear Gated Recurrent Units (MLGRU) and modified Gated Linear Units (GLU) for efficient token and channel mixing.
  • Outperforms Transformer++ on certain language tasks with lower GPU memory usage and latency.

Main AI News:

Matrix multiplications (MatMul) are the most computationally expensive operations in large language models (LLM) using the Transformer architecture. As LLMs scale to larger sizes, the cost of MatMul grows significantly, increasing memory usage and latency during training and inference. Now, researchers at the University of California, Santa Cruz, Soochow University and University of California, Davis have developed a novel architecture that completely eliminates matrix multiplications from language models while maintaining strong performance at large scales.

Connecting physical and digital worlds: A developer’s journey

In their paper, the researchers introduce MatMul-free language models that achieve performance on par with state-of-the-art Transformers while requiring far less memory during inference. Matrix multiplication is a fundamental operation in deep learning, where it is used to combine data and weights in neural networks. MatMul is crucial for tasks like transforming input data through layers of a neural network to make predictions during training and inference. GPUs are designed to perform many MatMul operations simultaneously, thanks to their highly parallel architecture. This parallelism allows GPUs to handle the large-scale computations required in deep learning much faster than traditional CPUs, making them essential for training and running complex neural network models efficiently.

However, with LLMs scaling to hundreds of billions of parameters, MatMul operations have become a bottleneck, requiring very large GPU clusters during both training and inference phases. Replacing MatMul with a simpler operation can result in huge savings in memory and computation. But previous efforts to replace MatMul operations have produced mixed results, reducing memory consumption but slowing down operations because they do not perform well on GPUs.

Replacing MatMul with ternary operations

In the new paper, the researchers suggest replacing the traditional 16-bit floating point weights used in Transformers with 3-bit ternary weights that can take one of three states: -1, 0 and +1. They also replace MatMul with additive operations that provide equally good results at much less computational costs. The models are composed of “BitLinear layers” that use ternary weights.

By constraining the weights to the set {−1, 0, +1} and applying additional quantization techniques, MatMul operations are replaced with addition and negation operations,” the researchers write.

They also make more profound changes to the language model architecture. Transformer blocks consist of two main components: a token mixer and a channel mixer. The token mixer is responsible for integrating information across different tokens in a sequence. In traditional Transformer models, this is typically achieved using self-attention mechanisms, which use MatMul operations to compute relationships between all pairs of tokens to capture dependencies and contextual information. However, in the MatMul-free architecture described in the paper, the token mixer is implemented using a MatMul-free Linear Gated Recurrent Unit (MLGRU). The GRU is a deep learning for sequence modeling that was popular before the advent of Transformers. The MLGRU processes the sequence of tokens by updating hidden states through simple ternary operations without the need for expensive matrix multiplications.

The channel mixer is responsible for integrating information across different feature channels within a single token’s representation. The researchers implemented their channel mixer using a Gated Linear Unit (GLU), which is also used in Llama-2 and Mistral. However, they modified the GLU to also work with ternary weights instead of MatMul operations. This enabled them to reduce computational complexity and memory usage while maintaining the effectiveness of feature integration

By combining the MLGRU token mixer and the GLU channel mixer with ternary weights, our proposed architecture relies solely on addition and element-wise products,” the researchers write.

Evaluating MatMul-free language models

The researchers compared two variants of their MatMul-free LM against the advanced Transformer++ architecture, used in Llama-2, on multiple model sizes. Interestingly, their scaling projections show that the MatMul-free LM is more efficient in leveraging additional compute resources to improve performance in comparison to the Transformer++ architecture.

The researchers also evaluated the quality of the models on several language tasks. The 2.7B MatMul-free LM outperformed its Transformer++ counterpart on two advanced benchmarks, ARC-Challenge and OpenbookQA, while maintaining comparable performance on the other tasks.

These results highlight that MatMul-free architectures are capable achieving strong zero-shot performance on a diverse set of language tasks, ranging from question answering and commonsense reasoning to physical understanding,” the researchers write.

Expectedly, MatMul-free LM has lower memory usage and latency compared to Transformer++, and its memory and latency advantages become more pronounced as the model size increases. For the 13B model, the MatMul-free LM used only 4.19 GB of GPU memory at a latency of 695.48 ms, whereas Transformer++ required 48.50 GB of memory at a latency of 3183.10 ms.

Optimized implementations

The researchers created an optimized GPU implementation and a custom FPGA configuration for MatMul-free language models. With the GPU implementation of the ternary dense layers, they were able to accelerate training by 25.6% and reduce memory consumption by up to 61.0% over an unoptimized baseline implementation.

This work goes beyond software-only implementations of lightweight models and shows how scalable, yet lightweight, language models can both reduce computational demands and energy use in the real-world,” the researchers write.

The researchers believe their work can pave the way for the development of more efficient and hardware-friendly deep learning architectures.

Due to computational constraints, they were not able to test the MatMul-free architecture on very large models with more than 100 billion parameters. However, they hope their work will serve as a call to action for institutions and organizations that have the resources to build the largest language models to invest in accelerating lightweight models.

Ideally, this architecture will make language models much less dependent on high-end GPUs like those from Nvidia, and will enable researchers to run powerful models on other, less expensive and less supply constrained types of processors. The researchers have released the code for the algorithm and models for the research community to build on.

By prioritizing the development and deployment of MatMul-free architectures such as this one, the future of LLMs will only become more accessible, efficient, and sustainable,” the researchers write.

Conclusion:

The introduction of MatMul-free architecture marks a significant advancement in the efficiency of large language models (LLMs). By reducing reliance on costly GPU clusters and optimizing memory usage, this innovation opens doors for wider accessibility and sustainability in deploying powerful LLMs. As organizations seek more economical and scalable deep learning solutions, MatMul-free architectures present a compelling opportunity to reshape the landscape of AI-driven applications, potentially lowering barriers to entry and accelerating innovation across industries.

Source