NVIDIA’s Remarkable Performance in MLPerf Benchmarks Sets New Standards for Generative AI

TL;DR:

  • NVIDIA’s Eos AI supercomputer achieves a 3x boost, completing GPT-3 model training with 175 billion parameters in just 3.9 minutes.
  • Reduced training times translate to cost savings, energy efficiency, and faster time-to-market for large language models.
  • NVIDIA NeMo framework enables businesses to harness large language models effectively.
  • 1,024 NVIDIA Hopper GPUs set a new benchmark, completing training for a text-to-image model in 2.5 minutes.
  • NVIDIA reaffirms its leadership in AI performance, particularly in generative AI, within the MLPerf benchmarks.
  • Unprecedented scaling with 10,752 H100 GPUs showcases NVIDIA’s ability to meet growing demands in AI training.
  • NVIDIA’s full-stack platform innovations, employed by Eos and Microsoft Azure, drive efficiency and performance.
  • Collaboration with systems manufacturers underlines the significance of MLPerf as a tool for evaluating AI platforms and vendors.
  • H100 GPUs outperform A100 GPUs in MLPerf HPC benchmarks, achieving up to twice the performance and significant gains in drug discovery simulations.
  • The broader industry support for MLPerf benchmarks ensures transparency and objectivity for users.

Main AI News:

NVIDIA’s AI platform continues to redefine the boundaries of AI training and high-performance computing, as demonstrated in the latest MLPerf industry benchmarks. Among numerous breakthroughs, one achievement in generative AI stands out prominently: NVIDIA Eos, an AI supercomputer boasting a staggering 10,752 NVIDIA H100 Tensor Core GPUs and NVIDIA Quantum-2 InfiniBand networking, accomplished a training benchmark based on a GPT-3 model with a staggering 175 billion parameters, trained on one billion tokens, in a mere 3.9 minutes.

This remarkable feat represents an impressive 3x performance gain compared to the previous record of 10.9 minutes, set less than six months ago by NVIDIA. The benchmark leverages a subset of the comprehensive GPT-3 dataset behind the popular ChatGPT service. Extrapolating from this achievement, Eos could now train the entire dataset in just eight days, a jaw-dropping 73x faster than a previous state-of-the-art system employing 512 A100 GPUs.

This significant acceleration in training time not only drives cost efficiencies but also contributes to energy savings and accelerates time-to-market. This monumental achievement paves the way for large language models to become more accessible to businesses, thanks to tools like NVIDIA NeMo, a framework for tailoring these models to specific requirements.

In a separate generative AI test, 1,024 NVIDIA Hopper architecture GPUs completed a training benchmark based on the Stable Diffusion text-to-image model in an astonishing 2.5 minutes, further raising the bar for this emerging workload.

By embracing these two groundbreaking tests, MLPerf reaffirms its position as the industry’s gold standard for measuring AI performance, recognizing the profound impact that generative AI has on today’s technology landscape.

The Impressive Scaling of Systems

These exceptional results owe much to the unprecedented use of the largest number of accelerators ever applied to an MLPerf benchmark. With 10,752 H100 GPUs, NVIDIA far exceeded the scaling seen in AI training just months earlier when they employed 3,584 Hopper GPUs. This 3x scaling in GPU numbers translated into a remarkable 2.8x increase in performance, achieving an impressive 93% efficiency rate, largely attributed to innovative software optimizations.

Efficient scaling is of paramount importance in generative AI, especially as large language models continue to grow exponentially year by year. These latest results underscore NVIDIA’s ability to meet the unprecedented challenges faced by the world’s largest data centers.

This monumental accomplishment can be attributed to a comprehensive platform of innovations spanning accelerators, systems, and software, which both Eos and Microsoft Azure leveraged in their latest submissions.

Eos and Azure, each equipped with 10,752 H100 GPUs, achieved performance levels within 2% of each other, underscoring the efficiency and versatility of NVIDIA AI in data center and public-cloud deployments.

NVIDIA’s reliance on Eos spans a wide array of critical tasks, from advancing initiatives like NVIDIA DLSS, AI-powered software for cutting-edge computer graphics, to supporting NVIDIA Research projects such as ChipNeMo, generative AI tools that aid in the design of next-generation GPUs.

Advances Across Various Workloads

NVIDIA achieved several new records in addition to its milestones in generative AI during this benchmark round. H100 GPUs demonstrated 1.6x faster performance than previous-round training recommender models, commonly used for online content recommendations. Additionally, performance increased by 1.8x on RetinaNet, a computer vision model.

These improvements stemmed from a combination of software enhancements and the utilization of expanded hardware resources. NVIDIA once again distinguished itself as the sole company to complete all MLPerf tests. H100 GPUs consistently delivered the fastest performance and the most significant scaling across all nine benchmarks.

These performance enhancements translate into faster time-to-market, reduced costs, and energy savings for users engaged in training massive large language models or customizing them using frameworks like NeMo to cater to their unique business needs.

A Collaborative Effort

Eleven systems manufacturers embraced the NVIDIA AI platform in their submissions for this benchmark round, including ASUS, Dell Technologies, Fujitsu, GIGABYTE, Lenovo, QCT, and Supermicro. These partnerships underscore the significance of MLPerf as a valuable tool for customers evaluating AI platforms and vendors.

HPC Benchmarks and Achievements

In the realm of MLPerf HPC, a distinct benchmark dedicated to AI-assisted simulations on supercomputers, H100 GPUs delivered up to twice the performance of NVIDIA A100 Tensor Core GPUs, a substantial leap from the previous HPC round. These results marked up to 16x gains since the inaugural MLPerf HPC round in 2019.

This benchmark introduced a novel test involving the training of OpenFold, a model designed to predict the 3D structure of proteins based on their amino acid sequences. OpenFold can accomplish in minutes the critical work for healthcare that used to take researchers weeks or even months.

Understanding the structure of proteins is instrumental in expediting drug discovery, as most pharmaceuticals target proteins, essential components of cellular processes. In the MLPerf HPC test, H100 GPUs impressively trained OpenFold in just 7.5 minutes, a stark contrast to the 11 days required two years ago using 128 accelerators.

Soon, a version of the OpenFold model and the software used by NVIDIA to train it will become available in NVIDIA BioNeMo, a generative AI platform tailored for drug discovery.

Broad Industry Support

Since its inception in May 2018, MLPerf benchmarks have garnered extensive support from both industry and academia. Leading organizations backing these benchmarks include Amazon, Arm, Baidu, Google, Harvard, HPE, Intel, Lenovo, Meta, Microsoft, NVIDIA, Stanford University, and the University of Toronto.

MLPerf tests are characterized by transparency and objectivity, ensuring that users can rely on the results to make informed decisions when selecting AI platforms and vendors. All the software utilized by NVIDIA is readily accessible from the MLPerf repository, ensuring that developers worldwide can achieve the same outstanding results. These software optimizations are continuously integrated into containers available on NGC, NVIDIA’s software hub for GPU applications.

Conclusion:

NVIDIA’s outstanding performance in MLPerf benchmarks, especially in generative AI, signifies a pivotal moment for the AI market. Their relentless innovation in accelerators, systems, and software, combined with impressive scaling and efficiency gains, sets a new standard for AI training. This achievement not only enhances cost-efficiency and accelerates time-to-market but also reinforces NVIDIA’s leadership position in the AI industry. As large language models and AI applications continue to grow, NVIDIA’s capabilities will undoubtedly shape the future of AI technologies and their accessibility to businesses.

Source