Optical Interconnects: The Key to Meeting the Future Demands of AI Training

TL;DR:

  • Artificial intelligence (AI) has disrupted the datacenter, requiring a reevaluation of computing, storage and networking.
  • Noam Mizrahi, CTO at Marvell, discusses the impact of AI on network architectures.
  • GPT-4 was trained on a massive cluster of Nvidia GPUs and CPUs, while GPT-5 is expected to be trained on an even larger cluster.
  • The performance increase between GPT-4 and GPT-5 is estimated to be 15X.
  • The scale of these AI training clusters rivals that of exascale supercomputers.
  • Optical interconnects are crucial for future GPU clusters due to their size and low-latency requirements.
  • Marvell may explore developing optical switches similar to Google’s “Apollo” for TPUv4 clusters.
  • Disaggregated and composable infrastructure may be a side benefit of a shift to optical switching and interconnects.
  • The role of the CXL protocol in this evolving landscape is yet to be determined.

Main AI News:

Artificial intelligence (AI) has caused a paradigm shift in data centers, compelling companies to reevaluate the equilibrium between computing, storage, and networking. This disruption has upset the traditional balance in the data center, rendering demand curves hyper-exponential. To gain insights into how AI is shaping network architectures, we engaged in a conversation with Noam Mizrahi, the esteemed corporate chief technology officer at Marvell, a leading chip manufacturer.

Mizrahi’s journey commenced at Marvell as a verification engineer. Apart from a brief tenure at Intel in 2013, where he contributed to product definition and future CPU strategies, he has dedicated his entire career to chip design at Marvell. Beginning with CPU interfaces for diverse PowerPC and MIPS controllers, he progressed to become an architect for the controller line and ultimately ascended to the position of chief architect for ArmadaXP Arm-based system-on-chip designs. In recognition of his exceptional expertise, Mizrahi was honored as a Technology Fellow in 2017 and later appointed as a Senior Fellow and the company’s CTO in 2020, a time coinciding with the onset of the global coronavirus pandemic.

To grasp the magnitude of the subject at hand, let us delve into the training process of the GPT-4 generative AI platform. Microsoft and OpenAI undertook this training on an impressive cluster consisting of 10,000 Nvidia “Ampere” A100 GPUs and 2,500 CPUs. Reports circulating within the industry indicate that GPT-5 will undergo training on an even more colossal cluster comprising 25,000 “Hopper” H100 GPUs, likely accompanied by approximately 3,125 CPUs. Moreover, these GPUs are expected to deliver a performance boost of 3 times at FP16 precision and a remarkable six times when operating at reduced data resolution (FP8 precision). Consequently, the effective performance gain between GPT-4 and GPT-5 is projected to reach a staggering factor of 15X.

This level of computational power rivals the scale of exascale supercomputers currently under construction in the United States, Europe, and China. While Nvidia employs high-speed NVLink ports and NVSwitch memory switch chips to tightly couple multiple Ampere or Hopper GPUs on HGX system boards, extending the GPU memory interconnect by two orders of magnitude remains impractical. Anticipating ever-increasing demands, we can assume that the scale required for training these large language models will only continue to grow, necessitating innovative solutions.

Given the physical size and low-latency requirements of current and future GPU clusters, the exploration of optical interconnects becomes imperative. Could Marvell venture into developing a technology akin to Google’s “Apollo” optical switches, integral to the TPUv4 clusters? Alternatively, does Marvell possess alternative methodologies that offer comparable outcomes without the need for such drastic measures? Furthermore, how does the adoption of disaggregated and composable infrastructure align with the potential benefits of a transition to optical switching and interconnects? Lastly, where does the Compute Express Link (CXL) protocol fit into this ever-evolving landscape?

These pressing questions underscore the urgency for businesses to adapt their network architectures to accommodate the unprecedented growth of AI and the computational resources it demands. Marvell, with its extensive expertise and unwavering commitment to innovation, remains at the forefront of addressing these challenges, empowering organizations to embrace the future of AI-driven technologies.

Conlcusion:

The rapid advancement of artificial intelligence (AI) and the consequential impact on data center architectures, as discussed by Noam Mizrahi, CTO at Marvell, presents significant implications for the market. The training of AI models, such as GPT-5, on massive clusters of GPUs and CPUs, signifies the escalating demand for computational resources.

This trend not only underscores the need for continued innovation in network architectures but also highlights the potential opportunities for companies specializing in optical interconnects and advanced switching technologies. As the market adapts to the evolving requirements of AI-driven technologies, organizations like Marvell are well-positioned to capitalize on these developments and drive the future of the industry.

Source