Huawei Researchers Unveil Pangu-Σ: Trillion-Parameter Language Model with Sparse Architecture

TL;DR:

  • Huawei researchers have developed Pangu-Σ, a groundbreaking language model with 1.085 trillion parameters.
  • Pangu-Σ utilizes a sparse architecture, incorporating the Random Routed Experts’ Transformer decoder architecture (RRE) to achieve exceptional performance.
  • The Expert Computation and Storage Separation (ECSS) mechanism ensures efficient and scalable training systems, significantly reducing host-to-device and device-to-host communication.
  • Pangu-Σ surpasses previous state-of-the-art models in multiple downstream tasks, including conversation, translation, and code production, without requiring multitask finetuning or instruction tuning.
  • The model’s versatility, covering over 40 natural and programming languages, and its impressive training throughput make it a significant milestone in the field of large language models.

Main AI News:

In the realm of natural language processing, large language models (LLMs) have consistently proven their mettle, showcasing remarkable capabilities in areas such as language understanding, generation, and reasoning. These models have steadily evolved over time, with successive iterations like GPT-3 pushing the boundaries of what’s possible. The latest breakthrough comes in the form of Pangu-Σ, a gargantuan language model developed by Huawei researchers, boasting a staggering 1.085 trillion parameters.

The journey to constructing Pangu-Σ has seen the emergence of various influential language models, each attempting to outdo its predecessors. Among these contenders are renowned names such as Megatron-Turing NLG, PanGu, ERNIE 3.0 Titan, Gopher, PaLM, OPT, Bloom, and GLM-130B, all wielding an impressive arsenal of over one trillion parameters. This monumental leap in scale has prompted researchers to explore sparsely-activated models, particularly the Mixture-of-Experts (MoE) approach, as a means to achieve these staggering parameter counts.

Notably, there have been several notable advancements in the realm of trillion-parameter models, including Switch-C, GLaM, MoE-1.1T, Wu Dao 2.0, and M6-10T. However, only a select few have managed to achieve the desired performance levels while providing comprehensive assessment findings across a wide range of applications. Scaling efficiency emerges as the primary challenge, as current research on the scaling laws of language models underscores the need for a substantial amount of training data and a reasonable compute budget to unlock the true potential of LLMs. Consequently, the development of a scalable model architecture and an effective distributed training system represents a pivotal motivation driving this ambitious undertaking.

Scaling the model itself presents a multifaceted endeavor. As the size of the language model increases, so does its anticipated performance. Sparse architectures, such as Mixture of Experts (MoE), emerge as an alluring avenue for scaling models without incurring a linear rise in computational costs—a drawback associated with training dense Transformer models. However, hurdles like imbalanced workloads and global communication delays continue to plague MoE models. Additionally, integrating MoE into an existing dense model and determining the optimal number of experts per layer remain unresolved challenges. Addressing these obstacles and achieving a trillion-parameter sparse model with exceptional performance and training efficiency represents a critical yet arduous task for researchers.

Equally critical is scaling the system that supports these massive language models. To enable training models with trillion parameters, frameworks like DeepSpeed 4 have been proposed. However, the primary constraint often lies in the available compute budget, particularly the number of accelerating devices (e.g., GPUs, NPUs, and TPUs) at one’s disposal. Mitigating this constraint involves employing strategies such as tensor parallelism, pipeline parallelism, zero redundancy optimizer, and rematerialization over thousands of accelerating devices, enabling practitioners to train trillion-parameter models with viable batch sizes. Additionally, heterogeneous computing strategies, such as offloading a portion of the processing to host machines, can help minimize the strain on computing resources.

Nevertheless, the existing methodologies suffer from poor bandwidth between hosts and devices, along with limited computational power on CPUs compared to accelerating devices. These factors hamper the ability to feed copious amounts of data to large language models, thereby hindering their optimal performance. Effectively scaling system performance within the confines of a restricted computing budget becomes a critical factor in harnessing the potential of these mammoth language models.

Enter Pangu-Σ, the brainchild of Huawei researchers. In their groundbreaking paper, they introduce this colossal language model featuring sparse architecture and awe-inspiring 1.085 trillion parameters. Developed within the MindSpore 5 framework, Pangu-Σ underwent rigorous training for over 100 days on a cluster equipped with 512 Ascend 910 AI Accelerators, consuming a staggering 329 billion tokens.

Central to Pangu-Σ’s design is the integration of the Random Routed Experts’ Transformer decoder architecture (RRE), which expands the model’s built-in parameters. Unlike traditional MoE architectures, RRE adopts a two-level routing approach. At the first level, experts are organized based on tasks or domains, while the second level randomly assigns tokens to each group without employing any learnable gating functions, as seen in MoE models. Leveraging the RRE architecture empowers researchers to extract sub-models from Pangu-Σ, enabling diverse downstream applications such as conversation, translation, code production, and general natural language interpretation.

To ensure efficient and scalable training systems, the researchers propose the Expert Computation and Storage Separation (ECSS) mechanism. This innovative approach achieves an observed throughput of 69,905 tokens/s while training the 1.085 trillion-parameter Pangu-Σ on a cluster equipped with 512 Ascend 910 accelerators. Significantly, it also reduces the need for host-to-device and device-to-host communication during optimizer update computation. Overall, the training throughput is 6.3 times faster compared to the MoE architecture model, all while employing identical hyperparameters.

In the Chinese domain, a sub-model of Pangu-Σ surpasses the previous state-of-the-art models in multiple downstream tasks across six categories, even in zero-shot settings without multitask finetuning or instruction tuning. Pangu-Σ, with its 329 billion tokens covering over 40 natural and programming languages, exhibits superior performance in the relevant domains. Moreover, the researchers conducted extensive evaluations across various application domains such as conversation, machine translation, and code production, further substantiating the versatility and effectiveness of Pangu-Σ.

Conclusion:

The introduction of Huawei’s Pangu-Σ represents a remarkable development in the market for large language models. With its sparse architecture and staggering parameter count, Pangu-Σ showcases the potential for enhanced language understanding and generation. This breakthrough will likely drive increased competition among industry players and spur further advancements in natural language processing. Organizations across various sectors can anticipate improved conversational AI, machine translation, and code generation capabilities, leading to enhanced user experiences and expanded applications for language-based technologies.

Source