TL;DR:
- Apple researchers have devised a groundbreaking method for the efficient operation of large language models (LLMs) on low-memory devices.
- Their strategy involves storing LLM parameters in flash memory and dynamically transferring them to DRAM during inference.
- Key optimizations include minimizing data transfer from flash memory and enhancing data access through “windowing” and “row-column bundling” techniques.
- The approach leverages sparsity in FeedForward Network (FFN) layers, guided by a hardware-inspired cost model.
- Results show LLMs twice the size of available DRAM can be executed with remarkable speed increases, reducing I/O latency significantly.
Main AI News:
Apple’s research team has unveiled a groundbreaking approach to efficiently operate large language models (LLMs) on devices with limited DRAM capacity, addressing the inherent hurdles of intensive computational and memory demands. This pioneering method introduces a strategic maneuver that involves the storage of LLM parameters within flash memory, dynamically allocating them to DRAM as the need arises during inference. The primary focus of this approach revolves around optimizing two pivotal areas:
- Minimizing Data Transfer Volume from Flash: Apple’s research team has engineered an inference cost model that seamlessly aligns with flash memory behavior, resulting in the creation of two principal techniques: “windowing” and “row-column bundling.”
- Reading Data in Larger, More Contiguous Chunks: The “windowing” technique ingeniously reduces data transfers by recycling previously activated neurons. By maintaining a sliding window of recent input tokens in memory and selectively loading neuron data that differs from its predecessors, Apple’s researchers have achieved optimal memory utilization, significantly diminishing the volume of data retrieved from flash memory for each inference query. Meanwhile, the “row-column bundling” technique capitalizes on flash memory’s inherent sequential data access strengths. Through the storage of concatenated rows and columns of up-projection and down-projection layers in flash memory, this approach enhances the size of data chunks read, effectively optimizing flash-memory throughput.
Apple’s research team has further leveraged the observed sparsity within the FeedForward Network (FFN) layers of LLMs. This innovation allows for the selective loading of non-zero parameters from flash memory. A hardware-inspired cost model, encompassing flash memory, DRAM, and computing cores, has been meticulously crafted to guide the optimization process.
The outcomes of this study unequivocally demonstrate the unparalleled effectiveness of Apple’s proposed method. Remarkably, LLMs up to twice the size of available DRAM can now be executed, resulting in a remarkable 4-5x and 20-25x surge in inference speed when compared to conventional loading approaches on CPU and GPU, respectively. These exceptional results manifest in significantly reduced I/O latency for the OPT 6.7B Model, achieving speeds that are 9-10 times faster than baseline latency.
This innovative approach seamlessly integrates sparsity awareness, context-adaptive loading, and a hardware-oriented design, heralding a new era of efficient LLM inference on devices constrained by limited memory resources. Apple’s methodology stands as a pivotal solution, effectively surmounting the challenges associated with loading and running large models on resource-constrained devices. In scenarios where the traditional practice of loading the entire model into DRAM for inference proves impractical due to memory constraints, this strategy offers a transformative alternative.
Conclusion:
Apple’s innovative strategy holds immense promise for the market, as it addresses the critical challenge of running large language models on memory-constrained devices. With the potential to significantly boost inference speed and reduce latency, this breakthrough could pave the way for broader adoption of LLMs in resource-restricted environments, unlocking new possibilities for applications across various industries.