NVIDIA Sets New Standards in Generative AI with MLPerf Training v4.0

  • NVIDIA achieves unprecedented performance and scalability in generative AI, as evidenced by its submission to MLPerf Training v4.0.
  • MLPerf Training v4.0 introduces new tests reflecting industry workloads, including LLM fine-tuning and graph neural network training.
  • NVIDIA’s hardware and software solutions, coupled with optimizations, enable significant speedups in AI training, notably in GPT-3 175B and Llama 2 70B models.
  • Advancements in visual generative AI and graph neural network training further solidify NVIDIA’s leadership position.
  • The NVIDIA Blackwell platform, announced at GTC 2024, promises revolutionary advancements in trillion-parameter AI training and inference.

Main AI News:

NVIDIA has once again demonstrated its unparalleled expertise in the realm of generative AI, achieving groundbreaking performance and scalability milestones, as showcased in its recent submission to MLPerf Training v4.0. This accomplishment underscores NVIDIA’s enduring leadership in AI training benchmarks, particularly in the domain of large language models (LLMs) and generative AI applications.

MLPerf Training v4.0 Highlights

MLPerf Training, an initiative spearheaded by the MLCommons consortium, remains the gold standard for assessing end-to-end AI training performance. With the latest iteration, v4.0, the benchmark has evolved to encompass the latest industry demands. Notably, two new tests have been introduced to mirror prevalent industry workloads. The first evaluates the fine-tuning efficiency of Llama 2 70B utilizing the low-rank adaptation (LoRA) technique, while the second centers on graph neural network (GNN) training, featuring an implementation of the relational graph attention network (RGAT).

The updated test suite encompasses diverse workloads, including LLM pre-training (such as GPT-3 175B), LLM fine-tuning (e.g., Llama 2 70B with LoRA), text-to-image generation (Stable Diffusion v2), among others, thus offering a comprehensive evaluation across a spectrum of AI applications.

NVIDIA’s Unprecedented Performance

In the latest MLPerf Training iteration, NVIDIA has once again raised the bar for AI training performance, leveraging its full suite of hardware and software solutions. Noteworthy components include NVIDIA Hopper GPUs, fourth-generation NVLink interconnect paired with third-generation NVSwitch chip, NVIDIA Quantum-2 InfiniBand networking, and a meticulously optimized NVIDIA software stack. These components, refined since the previous round, have enabled NVIDIA to shatter prior records. For instance, NVIDIA significantly slashed the training time for GPT-3 175B from 10.9 minutes, utilizing 3,584 H100 GPUs, to a mere 3.4 minutes, employing 11,616 H100 GPUs, showcasing nearly linear performance scaling.

Advancements in LLM Fine-Tuning

NVIDIA has also made strides in LLM fine-tuning, particularly with the Llama 2 70B model developed by Meta. Leveraging the LoRA technique, a single DGX H100 with eight H100 GPUs completed fine-tuning in just over 28 minutes, a time further reduced to 24.7 minutes with the NVIDIA H200 Tensor Core GPU. Demonstrating remarkable scalability, NVIDIA achieved fine-tuning in a mere 1.5 minutes using 1,024 H100 GPUs. These achievements were facilitated by harnessing context parallelism capabilities within the NVIDIA NeMo framework and implementing FP8 self-attention in cuDNN, resulting in a 15% performance improvement at the 8-GPU scale.

Breakthroughs in Visual Generative AI

MLPerf Training v4.0 also introduced a benchmark for text-to-image generative AI based on Stable Diffusion v2. NVIDIA’s submissions delivered up to 80% enhanced performance at equivalent scales through extensive software optimizations, including the adoption of full-iteration CUDA Graphs and an optimized distributed optimizer for Stable Diffusion.

Graph Neural Network Training Milestones

In the domain of graph neural network training, NVIDIA achieved unprecedented records. Utilizing 8, 64, and 512 H100 GPUs, the company attained a record time of just 1.1 minutes in the largest-scale configuration. Leveraging eight H200 Tensor Core GPUs provided a 47% performance boost compared to the H100 submission at the same scale.

NVIDIA’s Continued Leadership

NVIDIA remains at the forefront of AI training performance, showcasing unparalleled versatility and efficiency across a myriad of AI workloads. The ongoing optimization of its software stack ensures maximized performance per GPU, thereby lowering training costs and facilitating the training of increasingly complex models. Looking ahead, the recently unveiled NVIDIA Blackwell platform, announced at GTC 2024, promises to democratize trillion-parameter AI, delivering up to 30x faster real-time trillion-parameter inference and up to 4x faster trillion-parameter training compared to NVIDIA Hopper GPUs.

Conclusion:

NVIDIA’s exceptional performance in MLPerf Training v4.0 underscores its continued dominance in the generative AI landscape. The company’s ability to consistently push the boundaries of AI training, coupled with forthcoming innovations such as the Blackwell platform, cements its position as a market leader. This not only reaffirms NVIDIA’s relevance but also sets a high benchmark for competitors, driving further innovation and advancement in the AI market.

Source