Neural Magic Unveils Fully Quantized FP8 Version of Meta’s Llama 3.1 405B Model

  • Neural Magic has launched a fully quantized FP8 version of Meta’s Llama 3.1 405B model.
  • The new model operates efficiently on 8xH100 or 8xA100 systems, resolving memory limitations and out-of-memory issues.
  • Two versions of the model are available: Meta-Llama-3.1-405B-Instruct-FP8-dynamic and Meta-Llama-3.1-405B-Instruct-FP8.
  • The FP8-dynamic version maintains the original architecture but is restricted to English and lawful applications.
  • The model reduces bit size from 16 to 8, cutting disk space and GPU memory usage in half.
  • It can be deployed using the vLLM backend with Python’s vllm and transformers libraries.
  • The model scored 86.55 on the OpenLLM benchmark, closely matching the unquantized model’s score and demonstrating a recovery rate of 99.91%.
  • Detailed reproduction commands are provided, highlighting the model’s high accuracy across benchmarks.

Main AI News:

Neural Magic has introduced a groundbreaking fully quantized FP8 version of Meta’s Llama 3.1 405B model, marking a significant advancement in AI model compression. This innovation allows the expansive 405 billion parameter model to operate efficiently on standard 8xH100 or 8xA100 systems, overcoming the memory limitations and out-of-memory errors typical with previous FP8 and FP16 iterations. By significantly enhancing inference speeds and addressing memory constraints, the new model eliminates the need for CPU offloading or multi-node distribution.

The release includes two distinct model versions:

  • Meta-Llama-3.1-405B-Instruct-FP8-dynamic
  • Meta-Llama-3.1-405B-Instruct-FP8

The FP8-dynamic variant, Meta-Llama-3.1-405B-Instruct-FP8-dynamic, retains the original architecture designed for multilingual assistant-like applications but is currently restricted to English and legal use. This model, available under version 1.0 and the llama3.1 license, showcases Neural Magic’s commitment to optimizing large-scale AI models.

Quantization and Optimization

This model achieves high efficiency by compressing weights and activations to FP8, reducing the bit size from 16 to 8. This reduction cuts both disk space and GPU memory usage in half, enabling the model to be loaded and run on a single 8xH100 GPU node instead of requiring multiple nodes. The quantization process employs symmetric per-channel scaling, optimizing both weights and activations dynamically per token using the LLM Compressor with 512 sequences from UltraChat.

Deployment and Evaluation

Deploying Neural Magic’s FP8 model is streamlined with the vLLM backend, using Python’s vllm and transformers libraries. The integration simplifies text generation with the optimized model. The model has been tested across various benchmarks such as MMLU, ARC-Challenge, GSM-8K, Hellaswag, Winogrande, and TruthfulQA. Using Neural Magic’s fork of the ‘lm-evaluation-harness’ and vLLM engine, the FP8-dynamic model scored an average of 86.55 on the OpenLLM benchmark, closely aligning with the unquantized model’s score of 86.63 and reflecting a recovery rate of 99.91%.

Reproduction and Accuracy

Detailed commands for replicating the evaluation results are provided, demonstrating the model’s high accuracy across different benchmarks and few-shot scenarios. Notably, the model achieved a 99.91% recovery rate on MMLU (5-shot) and 100.2% on Winogrande (5-shot), affirming its precision and reliability.

Conclusion:

Neural Magic’s release of the fully quantized FP8 Meta-Llama 3.1 405B model represents a significant advancement in AI model compression and efficiency. By addressing memory constraints and optimizing performance, this innovation enhances the feasibility of deploying large-scale AI models on standard hardware configurations. The near-identical performance of the quantized model compared to its unquantized counterpart underscores its effectiveness, offering a robust solution for high-performance AI applications. This development is likely to drive increased adoption of advanced AI models by reducing infrastructure costs and complexity, ultimately contributing to broader market accessibility and innovation.

Source