TL;DR:
- Intel researchers propose a novel approach for deploying Large Language Models (LLMs) on CPUs efficiently.
- LLMs are renowned for their performance but require substantial memory and specialized hardware, posing deployment challenges.
- The solution leverages automatic INT-4 weight-only quantization, preserving activation function precision while reducing model weight precision.
- A dedicated LLM runtime with optimized kernels accelerates CPU inference.
- Quantization is facilitated by the Intel Neural Compressor, offering flexibility in tuning quantization parameters.
- Experiments demonstrate that INT4 models maintain accuracy levels close to FP32 models.
- The LLM runtime outperforms ggml-based solutions, reducing latency by up to 1.6 times.
Main AI News:
The ascendancy of Large Language Models (LLMs) has been nothing short of a phenomenon, captivating the world with their exceptional prowess across a multitude of tasks. Renowned for their aptitude in text generation, language comprehension, and text summarization, among other domains, LLMs have emerged as transformative tools. However, their remarkable capabilities come at a cost, primarily in the form of their colossal model parameters, demanding extensive memory capacity and specialized hardware for inference—a conundrum that has presented formidable deployment challenges.
One potential avenue to mitigate the computational demands of inference lies in the realm of quantization techniques, which entail reducing the precision of weights and activation functions within an artificial neural network. Techniques such as INT8 and weight-only quantization have shown promise in alleviating the inference burden. Nonetheless, these approaches have been predominantly tailored for CUDA-based systems and may not seamlessly translate to CPU architectures.
Enter the researchers from Intel, who have unveiled an innovative solution for the efficient deployment of LLMs on CPUs. Their groundbreaking approach hinges on the adoption of automatic INT-4 weight-only quantization, preserving high precision for activation functions while applying low precision exclusively to model weights. Complementing this quantization methodology is a purpose-built LLM runtime replete with highly optimized kernels, engineered to accelerate the inference process on CPUs.
At its core, the quantization workflow leverages the capabilities of the Intel Neural Compressor, enabling fine-tuning across diverse quantization recipes, granularities, and group sizes to yield an INT4 model that meets the requisite accuracy thresholds. Subsequently, this model undergoes evaluation within the LLM runtime—an environment meticulously crafted to deliver efficient LLM inference on CPU architectures.
In a series of rigorous experiments, the researchers selected a range of popular LLMs, varying in parameter sizes from 7 billion to 20 billion. They subjected these models to evaluation, comparing the performance of FP32 (single-precision) models with their INT4 (four-bit integer precision) counterparts using open-source datasets. Impressively, the results showcased that the quantized models achieved accuracy levels nearly on par with their FP32 counterparts. Moreover, the researchers conducted a comprehensive latency analysis for the next token generation, revealing that the LLM runtime surpassed ggml-based solutions by up to 1.6 times—an unequivocal testament to the efficacy of their CPU deployment strategy.
Conclusion:
Intel’s groundbreaking approach to deploying LLMs on CPUs opens up new possibilities for the market. By mitigating the hardware and memory requirements traditionally associated with LLMs, this innovation could democratize access to these powerful language models. Businesses can expect enhanced performance and efficiency, driving advancements across various industries and applications.