Intel’s Gaudi2 Chip: A Game-Changer in LLM Training, Challenging NVIDIA GPUs, According to MLPerf Benchmarks

TL;DR:

  • Intel’s Gaudi2 chip emerges as the sole alternative to NVIDIA GPUs for training LLMs, according to MLPerf benchmarks.
  • Gaudi2 achieves comparable price/performance to NVIDIA’s A100 (FP16) and aims to surpass H100 in FP8 workloads by September.
  • Impressive benchmarks demonstrate Gaudi2’s time-to-train, scalability, and performance in computer vision and natural language processing models.
  • Intel’s Xeon Platinum CPUs also exhibit exceptional performance in LLM training, showcasing their capabilities in diverse models and use cases.

Main AI News:

The latest MLPerf training benchmarks unveiled by Intel and Habana have brought forth a compelling revelation. Intel’s Gaudi2 chip has emerged as the sole viable alternative to NVIDIA GPUs for training Language and Learning Models (LLMs). In the midst of the current AI frenzy, NVIDIA’s stock price has skyrocketed due to the extensive use of its GPUs in training various popular LLMs, such as ChatGPT. However, Intel’s Gaudi2 chip now poses formidable competition, as evidenced by their recently released benchmarks.

Intel proudly proclaims that the Gaudi2 chip achieves a similar price/performance ratio to NVIDIA’s A100 (FP16), and they have set their sights on surpassing NVIDIA’s H100 in FP8 workloads by September. While this goal may appear ambitious, Intel has substantial benchmark data to substantiate its claims. Let’s delve into a high-level overview of these results.

Impressively, Gaudi2 demonstrates remarkable time-to-train on the GPT-31 model, completing the process in just 311 minutes using 384 accelerators. Furthermore, the scalability of Gaudi2 is noteworthy, with near-linear 95% scaling observed from 256 to 384 accelerators in the GPT-3 model. The chip exhibits outstanding training results across various domains, including computer vision and natural language processing. For instance, it achieves excellent performance on computer vision models like ResNet-50 and Unet3D with 8 accelerators, as well as on natural language processing models like BERT with 8 and 64 accelerators. Notably, compared to the November submission, Gaudi2 presents performance increases of 10% and 4% for BERT and ResNet models, respectively, demonstrating the growing software maturity of Gaudi2.

It is essential to emphasize that these impressive results were achieved “out of the box,” indicating that customers can achieve comparable performance when implementing Gaudi2 either on-premises or in the cloud. In comparison, NVIDIA’s entry can train GPT-31 in a mere 45 minutes, albeit utilizing a significantly larger number of GPUs. To make a fair comparison, one must consider factors such as Total Cost of Ownership (TCO) and precise cost and TDP/heat constraints. However, these considerations may become less relevant due to the overwhelming demand for LLM training capabilities. As NVIDIA GPUs face limited supply, the market will face a shortage of silicon for training LLMs. It is in this context that Intel’s Gaudi2 chip has the potential to be the much-needed savior.

In addition to Gaudi2, Intel has also disclosed results for its Xeon Platinum CPUs, which currently power the highest-performing MLPerf submission for LLM training, achieving just over 10 hours for GPT-3. Here are the key highlights of those results:

  • In the closed division, 4th Gen Xeon CPUs can train BERT and ResNet-50 models in under 50 minutes (47.93 mins.) and under 90 minutes (88.17 mins.), respectively.
  • When scaling out to 16 nodes, Xeon demonstrated its capability to train the BERT model in approximately 30 minutes (31.06 mins.) in the open division.
  • For larger models like RetinaNet, Xeon achieved a time of 232 minutes (on 16 nodes), enabling customers to leverage off-peak Xeon cycles for training during different timeframes.

Intel’s 4th Gen Xeon CPUs, equipped with Intel Advanced Matrix Extensions (Intel AMX), deliver remarkable out-of-the-box performance improvements across multiple frameworks, end-to-end data science tools, and a wide ecosystem of smart solutions.

Conclusion:

The MLPerf benchmarks reveal a significant development in the LLM training market. Intel’s Gaudi2 chip has emerged as a formidable alternative to NVIDIA GPUs, providing comparable price/performance ratios and demonstrating remarkable capabilities in terms of time-to-train, scalability, and performance across various models. This breakthrough creates a competitive landscape that offers businesses and organizations a viable choice for training LLMs, driving innovation and productivity. Additionally, Intel’s Xeon Platinum CPUs further contribute to the market’s diversity, showcasing their prowess in delivering exceptional performance and supporting a wide range of applications. With Intel’s advancements, the market for LLM training is poised for increased competition and accelerated growth, empowering industries to harness the power of AI and drive meaningful transformations.

Source