TL;DR:
- Nvidia’s GPUs are crucial for cloud providers due to customer preference for the CUDA programming model.
- Google’s A3 VMs enable running AI applications and utilizing AI development and management services.
- A3 supercomputer allows renting GPUs for training large-scale models and updating them with new data.
- A3 supercomputer combines various technologies to enhance GPU-to-GPU communications and network performance.
- A3 VMs are based on Intel’s Sapphire Rapids chips, accompanied by H100 GPUs and DDR5 memory.
- Training models on Nvidia H100 GPUs are faster and more cost-effective compared to A100 GPUs.
- A3 VMs are suitable for inference workloads, providing up to 30x performance boost compared to A2 VMs with A100 GPUs.
- The A3 VMs utilize the infrastructure processing unit (IPU) called Mount Evans for networking, storage, and security offloading.
- The IPU enables data transfers at 200Gbps, offering significant network bandwidth improvement.
- Microsoft’s upcoming AI supercomputer may challenge the IPU’s throughput with Quantum-2 400Gbps networking capabilities.
- A3 supercomputer utilizes the Jupiter datacenter networking fabric, connecting GPU clusters via optical links.
- A3 supercomputer consists of eight interconnected H100 GPU blocks using Nvidia’s switching and chip interconnect technology.
- The GPUs are linked through NVSwitch and NVLink interconnects, achieving speeds of approximately 3.6TBps.
- Similar speed and Nvidia board designs are offered by Azure on its AI supercomputer.
- Multiple IPUs on the Jupiter DC network fabrics enable communication between GPU servers.
- Nvidia’s DGX Superpod follows a similar setup with 127 nodes, each equipped with eight H100 GPUs.
Main AI News:
Cloud providers are strategically strengthening their AI capabilities by assembling vast armies of GPUs. In a significant move, Google unveiled its latest creation, the Compute Engine A3 supercomputer, boasting an astounding fleet of 26,000 GPUs. This resounding display of power serves as yet another testament to Google’s unwavering commitment to secure a dominant position in the fierce AI race, a contest in which it squares off against its formidable adversary, Microsoft.
At the heart of this mammoth supercomputer lie the state-of-the-art Nvidia H100 Hopper GPUs, a staggering number that outshines even the world’s fastest public supercomputer, Frontier, which is equipped with a mere 37,000 AMD Instinct 250X GPUs. The implications of such colossal computational prowess are immense.
“For our largest customers, we can build A3 supercomputers up to 26,000 GPUs in a single cluster and are working to build multiple clusters in our largest regions,” revealed a Google spokeswoman via email. However, it’s worth noting that not all of their locations will witness the deployment of such colossal supercomputing power.
Google chose its esteemed Google I/O developer conference, held in Mountain View, California, as the ideal platform to announce this groundbreaking creation. Over the years, this conference has become a veritable showcase for Google, spotlighting the extraordinary capabilities of its AI software and hardware. It was at this very event that Google’s resolve to accelerate its AI development grew stronger, prompted by Microsoft’s integration of OpenAI technologies into its Bing search engine and office productivity applications.
The A3 supercomputer’s primary focus is to cater to customers in need of training large-language models. Recognizing this need, Google also introduced the A3 virtual machine instances, designed specifically for enterprises seeking to leverage the incredible capabilities of this supercomputer. It’s worth noting that numerous cloud providers are now deploying H100 GPUs, with Nvidia itself launching its DGX cloud service in March. However, this latest offering is priced considerably higher when compared to renting previous-generation A100 GPUs.
With the A3 supercomputer, Google takes a significant leap forward, surpassing the computational capabilities of its existing A2 virtual machines featuring Nvidia’s A100 GPUs. Consolidating all A3 computing instances, which are spread across different geographical regions, into a single supercomputer has resulted in a remarkable upgrade.
“The A3 supercomputer’s scale provides up to 26 exaflops of AI performance, significantly enhancing the efficiency and cost-effectiveness of training large ML models,” explained Roy Kim, Director at Google, and Chris Kleban, Product Manager, in a compelling blog entry.
While exaflops is a commonly employed performance metric to gauge the raw power of AI systems, critics remain skeptical. Google’s deployment of flops for training-targeted TF32 Tensor Core performance, which delivers “exaflops” nearly 30 times faster than the double-precision (FP64) floating point math still utilized by most classic high-performance computing applications, has been met with a cautious reception.
Nevertheless, the number of GPUs has emerged as a crucial indicator for cloud providers to showcase their AI computing services. In Microsoft’s Azure, the collaborative AI supercomputer built in partnership with OpenAI flaunts an impressive lineup of 285,000 CPU cores and 10,000 GPUs, with plans underway for an even more, GPU-rich next-generation model. Oracle’s cloud service grants access to clusters equipped with 512 GPUs, and the company is actively developing technologies to amplify GPU communication speed.
While Google has been generating considerable buzz around its TPU v4 artificial intelligence chips, designed for running internal AI applications such as the innovative Google Bard, its subsidiary, DeepMind, has made it clear that these swift TPUs play a vital role in advancing AI across various domains, both general and scientific.
The dominance of Nvidia’s GPUs in the realm of cloud providers is undeniable, as customers heavily rely on Nvidia’s proprietary parallel programming model, CUDA, for developing AI applications. This software toolkit leverages the accelerated performance delivered by the specialized AI and graphics cores of the H100 GPUs, producing unparalleled speed and efficiency.
With Google’s A3 virtual machines (VMs), customers can seamlessly run their AI applications and harness the full potential of Google’s AI development and management services, accessible through Vertex AI, Google Kubernetes Engine, and Google Compute Engine. These services provide a comprehensive ecosystem for companies to leverage the power of GPUs on the A3 supercomputer, enabling them to train large-scale models in tandem with large-language models. What’s impressive is that the model can be updated with new data without requiring a complete retraining process from scratch.
Google’s A3 supercomputer integrates various cutting-edge technologies to optimize GPU-to-GPU communications and network performance. The A3 VMs utilize Intel’s fourth-generation Xeon chips (codenamed Sapphire Rapids) along with the H100 GPUs. While it remains unclear whether the virtual CPUs in the VMs support the inferencing accelerators embedded in the Sapphire Rapids chips, these VMs are accompanied by DDR5 memory.
Training models on Nvidia H100 GPUs offer significant advantages in terms of speed and cost compared to the previous-generation A100 GPUs, which are widely available in the cloud. MosaicML, an AI services company, conducted a study that revealed the H100 GPUs to be 30% more cost-effective and three times faster than the NVIDIA A100 GPUs when training their seven-billion-parameter MosaicGPT large language model.
Although the H100 GPUs can handle inferencing tasks, their sheer processing power may be considered excessive. For inferencing workloads, Google Cloud provides Nvidia’s L4 GPUs, while Intel incorporates inferencing accelerators into its Sapphire Rapids CPUs.
According to Google’s Kim and Kleban, the A3 VMs deliver a substantial boost in inference performance, reaching up to 30 times the performance of the A2 VMs equipped with A100 GPUs.
A notable feature of the A3 VMs is their utilization of the infrastructure processing unit (IPU) called Mount Evans, jointly developed by Google and Intel. This IPU offloads networking, storage management, and security functions traditionally performed by virtual CPUs, enabling data transfers at a remarkable rate of 200Gbps.
While Google’s A3 supercomputer boasts impressive IPU capabilities, Microsoft’s upcoming AI supercomputer, equipped with Nvidia’s H100 GPUs, will introduce the chipmaker’s Quantum-2 400Gbps networking technology, potentially challenging the IPU’s throughput. However, Microsoft has yet to disclose the number of H100 GPUs in its next-generation AI supercomputer.
The backbone of Google’s A3 supercomputer relies on the Jupiter data center networking fabric, facilitating the interconnection of geographically dispersed GPU clusters through optical links. Google asserts that its workload bandwidth achieves levels comparable to more expensive off-the-shelf non-blocking network fabrics for almost every workload structure.
Additionally, the A3 supercomputer incorporates eight H100 GPU blocks interconnected by Nvidia’s proprietary switching and chip interconnect technology. The GPUs are linked via the NVSwitch and NVLink interconnects, enabling communication speeds of approximately 3.6TBps. Azure’s AI supercomputer offers the same speed as both Google and Microsoft deploy Nvidia’s board designs.
To establish communication between GPU servers, Google employs multiple IPUs on its Jupiter DC network fabrics. This setup resembles Nvidia’s DGX Superpod, which features 127 nodes, with each DGX node equipped with eight H100 GPUs.
Source: HPCwire
Conlcusion:
The emergence of Google’s A3 supercomputer with its massive fleet of Nvidia H100 GPUs, along with the growing adoption of GPUs by cloud providers, underscores the escalating demand for powerful AI computing capabilities. This trend highlights the pivotal role that GPUs play in driving AI applications and their increasing significance in the market. With Google and Microsoft vying for AI supremacy through their respective supercomputers powered by advanced GPU technologies, the market is witnessing intensified competition, pushing the boundaries of AI performance and efficiency.
This heightened focus on GPU-based computing solutions bodes well for businesses seeking to leverage AI advancements, as it offers expanded opportunities for training large-scale models and accelerating inferencing workloads. As the market evolves, we can expect further innovations in GPU technology and heightened competition among cloud providers, ultimately driving the proliferation of AI capabilities across various industries.