TL;DR:
- Argonne National Laboratory (ANL) is training a colossal AI model named AuroraGPT with one trillion parameters.
- AuroraGPT is being trained on ANL’s high-performance Aurora supercomputer, powered by Intel’s Ponte Vecchio GPUs.
- Intel and ANL are collaborating with global research labs to accelerate scientific AI development.
- AuroraGPT, also known as “ScienceGPT,” will feature a chatbot interface for researchers to seek answers and insights.
- Potential applications of this scientific AI model include biology, cancer research, and climate change.
- The training process has begun and may take months to complete, scaling from 256 to 10,000 nodes.
- Challenges in training large language models, such as memory requirements, are addressed using Microsoft’s Megatron/DeepSpeed.
- Intel aims for linear scaling to boost performance as the number of nodes increases.
Main AI News:
In a groundbreaking development, Argonne National Laboratory (ANL) has embarked on a colossal undertaking to train a massive AI model with a staggering one trillion parameters. This ambitious project, named AuroraGPT, is poised to revolutionize the landscape of scientific computing and become an indispensable resource for researchers worldwide.
AuroraGPT’s training takes place on ANL’s cutting-edge Aurora supercomputer, boasting an astonishing half an exaflop performance. This computational powerhouse is equipped with Intel’s Ponte Vecchio GPUs, providing the essential computational muscle for this monumental endeavor.
Collaboration is at the heart of this initiative, as Intel and ANL join forces with other research institutions in the United States and around the globe to usher in the era of scientific AI. The objective? To fuse an extensive corpus of text, code, scientific findings, and research papers into a versatile model capable of accelerating scientific discoveries.
Ogi Brkic, Vice President and General Manager for Data Center and HPC Solutions, remarked, “It combines all the text, codes, specific scientific results, papers, into the model that science can use to speed up research.” Brkic also hinted at the model’s forthcoming designation as “ScienceGPT,” indicating an accessible chatbot interface where researchers can pose inquiries and receive prompt responses.
The implications of this development span a broad spectrum of scientific domains, ranging from biology to cancer research and climate change. Chatbots integrated into the scientific research process could streamline and enhance the quest for knowledge across these critical fields.
Training such a complex model demands significant time and computing resources. ANL and Intel are currently in the preliminary stages of hardware testing before initiating full-scale training. While it is expected to operate akin to ChatGPT, the model’s potential for generating images and videos remains uncertain. Additionally, the inference capability will play a pivotal role as scientists interact with the chatbot and continually feed it new information.
The commencement of AuroraGPT’s training marks the initiation of a journey that could span several months. The initial phase involves training on 256 nodes, with plans to scale up to encompass the entirety of the Aurora supercomputer’s 10,000 nodes.
It is worth noting that OpenAI has yet to disclose the training duration of GPT-4, which is conducted on Nvidia GPUs. In parallel, Google has been actively training its forthcoming large-language model, Gemini, presumably on its TPUs.
One of the most formidable challenges in training large language models is memory requirements, often necessitating distribution across numerous GPUs. AuroraGPT leverages Microsoft’s Megatron/DeepSpeed to enable parallel training and optimize performance.
Intel and ANL are conducting initial testing of the one-trillion parameter model on a cluster of 64 Aurora nodes. Notably, the number of nodes is somewhat lower than typical for large language models, owing to Aurora’s unique design. Intel has collaborated closely with Microsoft to fine-tune both software and hardware, with the ultimate goal of extending training to encompass the entire 10,000-plus node system. Furthermore, linear scaling is a key aspiration, aiming to achieve improved performance with each additional node.
Brkic emphasized that Intel’s Ponte Vecchio GPUs have demonstrated superior performance compared to Nvidia’s A100 GPUs in another Argonne supercomputer called Theta, boasting a peak performance of 11.7 petaflops. This underscores the potential of AuroraGPT and the significance of this collaboration between Intel and ANL in the realm of scientific AI.
Conclusion:
Argonne National Laboratory’s AuroraGPT, in collaboration with Intel and global research institutions, signifies a significant leap in the world of scientific AI. This venture has the potential to reshape research methodologies and accelerate scientific discoveries across multiple domains, presenting lucrative opportunities for businesses and institutions involved in AI and high-performance computing markets.