Researchers from Yandex and NeuralMagic unveil advanced compression techniques for large AI models on consumer devices

  • Yandex LLC and NeuralMagic Inc. have developed new compression methods for large language models (LLMs).
  • Two techniques, Additive Quantization for Language Models (AQLM) and PV-Tuning, allow LLMs to be reduced by up to eight times while retaining 95% response quality.
  • AQLM compresses model parameters to two or three bits, and PV-Tuning enhances fine-tuning and error correction.
  • The combined techniques enable “ultra-compact” LLMs with almost equivalent capabilities to full-sized models.
  • Compressed models operate up to four times faster and reduce hardware costs by two to six times.
  • The advancements support deployment on consumer devices, enabling applications like text generation, voice assistance, and real-time translation without internet access.
  • The research is featured at the 41st International Conference on Machine Learning in Vienna.
  • The techniques are available on GitHub, with pre-compressed models on HuggingFace.

Main AI News:

Artificial intelligence researchers from Yandex LLC and NeuralMagic Inc. have announced a breakthrough in compressing large language models (LLMs) like Meta Platforms Inc.’s Llama 2 for deployment on everyday devices, such as smartphones and smart speakers. In collaboration with the Institute of Science and Technology Austria and King Abdullah University of Science and Technology, the team has developed two novel compression techniques—Additive Quantization for Language Models (AQLM) and PV-Tuning.

These techniques enable LLMs to be reduced in size by up to eightfold while maintaining an average response quality of 95%. AQLM utilizes “additive quantization” to compress model parameters to just two or three bits, ensuring accuracy is preserved. PV-Tuning, on the other hand, is a representation-agnostic framework that enhances existing fine-tuning strategies and mitigates errors during compression.

Designed to work in conjunction, these methods allow for the creation of “ultra-compact” LLMs that offer nearly the same capabilities as their full-sized versions. The researchers emphasize that these techniques address the challenge of balancing model size and computational efficiency, a problem that has previously limited LLM deployment on consumer hardware.

The new methods, which are open source and detailed in an academic paper on arxiv.org, show promising results. Compressed versions of popular LLMs like Llama 2, Mistral, and Mixtral achieved 95% answer quality on benchmarks such as WikiText2 and C4, despite being reduced in size by eight times. Additionally, these compressed models operate up to four times faster due to fewer computational demands.

This advancement offers significant cost savings for companies developing proprietary and open-source LLMs. For instance, compressing the 13 billion-parameter Llama 2 model to run on a single GPU, rather than four, could reduce hardware costs by two to six times. More importantly, it enables the deployment of advanced LLMs on personal computers and smartphones, unlocking new applications such as text and image generation, voice assistance, and real-time translation without internet connectivity.

The research paper is featured at the 41st International Conference on Machine Learning in Vienna, Austria, from July 21-27. AQLM and PV-Tuning are available on GitHub, with pre-compressed versions of popular models accessible on HuggingFace.

Conclusion:

The development of AQLM and PV-Tuning represents a significant leap forward in the deployment of large language models on consumer devices. By drastically reducing model size while preserving performance and enhancing operational speed, these techniques offer a substantial reduction in hardware costs and open new opportunities for integrating advanced AI capabilities into everyday devices. This progress not only makes powerful AI more accessible but also drives innovation in consumer applications, potentially reshaping the market landscape by making sophisticated AI technology more ubiquitous and practical for a wide range of uses.

Source