TL;DR:
- Large diffusion models (LDMs) for image production have led to increased model size and inference workloads.
- Optimizing on-device ML inference in mobile contexts requires a delicate balance due to resource limitations.
- Google researchers have introduced optimizations that achieve groundbreaking latency figures for executing LDMs on various devices.
- Deploying LDMs locally on user devices offers advantages such as reduced server costs, offline capabilities, and enhanced privacy.
- Attention has been given to acceleration strategies for the softmax operation and the attention mechanism, including techniques like Winograd Convolution and FlashAttention.
- The optimizations presented in the paper focus on creating visuals from written descriptions using massive diffusion models.
- Attention modules in LDMs have been optimized through techniques like partially fused softmax and FlashAttention.
- Custom implementations have been devised to address limitations in fusion rules, particularly for the Gaussian Error Linear Unit (GELU) and group normalization layer.
- The optimizations improve memory bandwidth utilization while maintaining ALU/memory efficiency.
- These advancements enable the execution of LDMs with record-breaking latency values on a wide range of devices, enhancing user experience and expanding the scope of generative AI.
Main AI News:
The realm of image production has witnessed a surge in the prominence of large diffusion models (LDMs), which has subsequently led to exponential growth in model size and inference workloads. However, optimizing the performance of on-device machine learning (ML) inference in mobile contexts necessitates a delicate balancing act due to resource constraints. The demands imposed by these memory-intensive and computationally demanding LDMs pose substantial hurdles, especially when considering cost-effectiveness and user privacy.
The advent of foundation models has revolutionized the landscape of artificial intelligence, with large diffusion models garnering significant attention owing to their remarkable versatility and ability to generate photorealistic images. By deploying these models locally on users’ devices, organizations can benefit from reduced server costs, offline capabilities, and enhanced user privacy. Nevertheless, the computational and memory limitations of devices make it challenging to accommodate typical large diffusion models, which often comprise over 1 billion parameters. Recognizing this obstacle, Google researchers have introduced a series of modifications to the implementation of large diffusion models, resulting in the fastest inference latency on mobile devices with GPUs to date. These updates not only enhance the overall user experience across a multitude of devices but also expand the scope of generative AI applications.
The growing interest in on-device model inference acceleration stems from its numerous advantages over server-based methods, including reduced latency, enhanced privacy, and superior scalability. The complexity associated with the softmax operation frequently employed in deep learning has prompted extensive optimization endeavors, leading to the development of various acceleration strategies. Notably, the introduction of Winograd Convolution has streamlined the efficiency of convolutional computation by minimizing the number of required multiplications, which proves particularly beneficial for graphics processing units (GPUs).
The resounding success and widespread adoption of the Transformer design have spurred research efforts aimed at expediting the attention mechanism. One such approach is Reformer, which leverages sparse approximation techniques to reduce computing costs. Additionally, researchers have explored other methodologies, such as low-rank approximation and a combination of approximation techniques. Moreover, FlashAttention, a meticulous attention algorithm that takes into account hardware configurations, has demonstrated its efficacy in achieving optimal performance.
The focal point of this research centers around the challenge of creating visuals based on written descriptions through the utilization of massive diffusion models. Although the proposed improvements primarily focus on the Stable Diffusion architecture, it is worth noting that these optimizations can be seamlessly transferred to other large diffusion models. To enable text-based inferencing, additional conditioning based on the desired textual description is required to steer the reverse diffusion process.
The denoiser model in the LDM extensively employs attention blocks, which present a prime area for improvement. By assigning greater weight to attention blocks in the input, the model can effectively isolate relevant information. There are several optimization techniques available for attention modules, and researchers often employ one of the two detailed below, depending on which yields the best results.
The first optimization, referred to as partially fused softmax, reduces memory read and write operations during the attention module’s softmax computation by merging it with matrix multiplication. Another tweak involves leveraging an I/O-aware precise attention method called FlashAttention. By minimizing the number of high-bandwidth memory accesses from the GPU, this approach is an excellent choice for applications with limited memory bandwidth. However, it is crucial to note that this method only works with specific sizes of SRAM, requiring a substantial number of registers. As a result, it is exclusively employed on a subset of GPUs for attention matrices of particular sizes.
Furthermore, the research team identified a need for significantly larger fusion windows for commonly used layers and units in LDMs on mobile GPUs compared to what is currently available from commercially accessible GPU-accelerated ML inference engines. Recognizing the limitations imposed by standard fusion rules, the team devised custom implementations capable of accommodating a wider range of neural operators. Their attention was specifically directed toward two subfields: the Gaussian Error Linear Unit (GELU) and the group normalization layer.
Conclusion:
The optimizations presented in the paper pave the way for significant advancements in the market for large diffusion models. The achievement of groundbreaking latency figures and the ability to execute LDMs on diverse devices revolutionize the landscape of on-device ML inference. This opens up new opportunities for businesses to leverage the benefits of LDMs, including reduced server costs, enhanced privacy, and improved user experience. The market is likely to witness increased adoption of LDMs, particularly in applications requiring image production and generative AI. Organizations can leverage these optimizations to deliver cutting-edge solutions that capitalize on the advantages offered by the local deployment of LDMs on user devices.