Meta Unveils Details of Two New 24k GPU AI Cluste

  • Meta reveals details of two new 24,000-GPU data center scale clusters.
  • Clusters based on Meta’s AI Research SuperCluster (RSC), geared towards training the Llama 3 large language AI model.
  • Each cluster houses 24,576 Nvidia Tensor Core H100 GPUs, an upgrade from the previous 16,000 Nvidia A100 GPUs.
  • Meta aims to expand infrastructure to 350,000 Nvidia H100s by 2024.
  • Clusters differ in network infrastructure, one utilizing RDMA over converged Ethernet (RoCE), the other featuring Nvidia Quantum2 InfiniBand fabric.
  • Built using Meta’s Grand Teton GPU hardware platform and Open Rack v3 architecture.
  • Customized rack configurations for optimal throughput capacity, rack count reduction, and power efficiency.
  • Storage is supported by the Linux Filesystem in Userspace API and Meta’s ‘Tectonic’ distributed storage solution, which is partnered with Hammerspace for parallel NFS.
  • Based on YV3 Sierra Point server platform with high-capacity E1.S SSDs.
  • Meta continues to evolve PyTorch framework for scalable GPU training.

Main AI News:

Meta has unveiled intricate specifics concerning its latest hardware and software infrastructure, unveiling two groundbreaking 24,000-GPU data center clusters. These clusters are pivotal components in Meta’s ongoing endeavor to train its expansive Llama 3 large language AI model. The unveiling sheds light on a meticulously crafted ecosystem designed to foster breakthroughs in AI research and development.

Crafted on the foundation of Meta’s AI Research SuperCluster (RSC), initially introduced in 2022, these clusters represent a significant leap forward in computational prowess. Each cluster is outfitted with a staggering 24,576 Nvidia Tensor Core H100 GPUs, a notable upgrade from the preceding configuration, which housed 16,000 Nvidia A100 GPUs. This augmentation in GPU capacity is poised to usher in an era of enhanced capabilities, empowering Meta to explore more expansive and intricate AI models.

Meta envisions a future where these clusters pave the path for monumental advancements in generative AI product development. By the close of 2024, the company aims to expand its infrastructure exponentially, aiming to encompass 350,000 Nvidia H100s. This strategic expansion is forecasted to endow Meta with computational capabilities tantamount to nearly 600,000 H100s, amplifying its capacity for innovation.

Despite the uniformity in GPU count across both clusters, each diverges in network infrastructure design. While both solutions boast 400 Gbps endpoints, Meta has ingeniously devised one with remote direct memory access (RDMA) over converged Ethernet (RoCE) network fabric, leveraging the Arista 7800 alongside Wedge400 and Minipack2 OCP rack switches. Conversely, the second cluster is distinguished by its incorporation of an Nvidia Quantum2 InfiniBand fabric.

Furthermore, both clusters are underpinned by Meta’s proprietary open GPU hardware platform, Grand Teton. This cutting-edge platform, succeeding the Zion-EX, offers quadruple the host-to-GPU bandwidth, coupled with double the compute and data network bandwidth and power envelope, revolutionizing large-scale AI workloads.

The clusters’ development also hinges on Meta’s Open Rack power and rack architecture, meticulously designed to provide unparalleled flexibility in data center environments. Notably, the Open Rack v3 hardware facilitates the installation of power shelves anywhere in the rack, enabling customizable configurations tailored to specific requirements.

In terms of storage, the clusters rely on a Linux Filesystem in Userspace API supported by Meta’s ‘Tectonic’ distributed storage solution. Collaborating with Hammerspace, Meta is spearheading the development of a parallel network file system (NFS), underscoring its commitment to innovation through strategic partnerships.

Both clusters are anchored on the YV3 Sierra Point server platform, augmented with cutting-edge E1.S SSDs. To optimize network utilization, Meta has implemented a series of enhancements to network topology and routing, alongside deploying the Nvidia Collective Communications Library (NCCL), ensuring seamless communication between GPUs and networking infrastructure.

Meta’s commitment to innovation extends beyond hardware, as evidenced by its continuous refinement of the PyTorch foundational AI framework. This ongoing evolution is geared towards enhancing compatibility for hundreds of thousands of GPU training instances, reaffirming Meta’s dedication to driving progress in the AI landscape.

In a collaborative blog post authored by Kevin Lee, Adi Gangidi, and Mathew Oldham, Meta emphasizes its steadfast commitment to open innovation in both software and hardware realms. The establishment of the AI Alliance underscores Meta’s ethos of fostering transparency, scrutiny, and trust within the AI community, ultimately fostering an ecosystem conducive to widespread innovation and responsible development.

Looking ahead, Meta remains resolute in its pursuit of innovation, recognizing the imperative of adaptability in a landscape defined by rapid evolution. As the company continues to refine its infrastructure across physical, virtual, and software layers, it endeavors to cultivate systems that are not only resilient but also primed to accommodate the dynamic demands of tomorrow’s AI landscape.

Conclusion:

The unveiling of Meta’s advanced 24k GPU AI clusters signifies a significant leap forward in AI infrastructure. With enhanced capabilities to support larger models and complex computations, Meta’s commitment to innovation and open collaboration through the AI Alliance sets a benchmark for the industry. This development underscores the increasing demand for scalable AI solutions and the imperative for companies to invest in flexible and reliable infrastructure to keep pace with evolving technology trends.

Source