TL;DR:
- Peking University introduced FastServe, a distributed inference serving system for LLMs.
- FastServe leverages iteration-level scheduling and the autoregressive pattern of LLM inference.
- It enables preemption at the level of each output token, reducing job completion times (JCT) and head-of-line blocking.
- The system utilizes a skip-join Multi-Level Feedback Queue (MLFQ) scheduler for efficient task prioritization.
- FastServe incorporates a productive GPU memory management system to optimize memory usage.
- Parallelization techniques and a distributed key-value cache enhance scalability and performance.
- FastServe outperforms existing solutions, improving average and tail JCT compared to Orca.
Main AI News:
Revolutionary advancements in large language models (LLMs) have opened up new possibilities across various industries, sparking a wave of interactive AI applications. Among these, ChatGPT has garnered substantial attention for its ability to facilitate informal communication between individuals and AI agents, enabling them to tackle a wide range of challenges, from software engineering conundrums to language translation dilemmas. With its remarkable capabilities, ChatGPT has swiftly become one of the fastest-growing programs in history. Recognizing the tremendous potential, several companies have joined the fray, introducing their own LLMs and ChatGPT-like products. Key players in this domain include Microsoft’s New Bing, Google’s Bard, Meta’s LLaMa, Stanford’s Alpaca, Databricks’ Dolly, and UC Berkeley’s Vicuna.
While LLM inference shares similarities with other deep neural network (DNN) model inferences, such as ResNet, it possesses distinct characteristics. Interactive AI applications built on LLMs rely heavily on inference capabilities to provide meaningful user experiences. Consequently, these applications necessitate swift job completion times (JCT) to ensure engaging interactions. For instance, when users input data into ChatGPT, they anticipate an immediate response. However, the infrastructure supporting inference serving faces significant strain due to the sheer number and complexity of LLMs. To handle LLM inference operations, businesses invest in costly clusters equipped with accelerators like GPUs and TPUs.
Unlike deterministic and highly predictable DNN inference jobs, where execution times are primarily determined by the model and hardware employed, LLM inference follows a unique autoregressive pattern. The process involves several rounds, with each iteration producing one output token that influences the subsequent iteration. As the output length remains unknown initially, it directly impacts both the execution time and input length. Existing inference serving systems, such as Clockwork and Shepherd, cater to deterministic model inference tasks like ResNet by leveraging precise execution time profiling. However, these methods prove ineffective for LLM inference, which exhibits variable execution times. Orca, the most advanced solution for LLM inference, adopts iteration-level scheduling, allowing for the addition or deletion of jobs from the processing batch after each iteration. Nevertheless, it relies on the first-come, first-served (FCFS) approach, where a scheduled task runs continuously until completion. Due to restricted GPU memory capacity and the low JCT requirements of inference jobs, the processing batch cannot be expanded arbitrarily. This leads to head-of-line blocking during run-to-completion processing, a well-known limitation.
The challenges surrounding LLM inference operations become particularly pronounced due to the immense size of LLMs and their lengthy execution times. Large LLM inference jobs, especially those generating extensive output lengths, significantly delay subsequent shorter jobs. To address this issue, researchers from Peking University have developed a distributed inference serving solution named FastServe. FastServe leverages iteration-level scheduling and the autoregressive pattern of LLM inference to introduce preemption at the level of each output token. This unique approach empowers FastServe to make decisions on whether to continue with a scheduled task after generating an output token or preempt it with another job from the queue. As a result, FastServe effectively reduces JCT and eliminates head-of-line blocking through preemptive scheduling.
At the core of FastServe lies a cutting-edge skip-join Multi-Level Feedback Queue (MLFQ) scheduler. MLFQ has long been recognized as a highly effective method for minimizing average JCT in information-free environments. Each task initially enters the highest priority queue, and if it fails to complete within a specific timeframe, it is demoted to the next priority queue. LLM inference, while semi-information agnostic, presents a crucial distinction from conventional scenarios in that the input length is known, even though the output length is not. This key differentiator arises from the autoregressive pattern inherent in LLM inference. The execution time required to generate the initial output token, in cases where the input is extensive and the output is brief, accounts for the majority of the workload. Leveraging this characteristic, skip-join is integrated into the traditional MLFQ approach.
Upon arrival, each task joins an appropriate queue based on a comparison between the execution time of the first output token and the demotion thresholds of the queues, rather than automatically entering the highest priority queue. By bypassing queues with higher priorities, downgrades are minimized. Preemptive scheduling, when combined with MLFQ, introduces additional memory overhead to maintain incomplete jobs in an interim state. LLMs employ a key-value cache for each Transformer layer to store intermediate states. As long as the batch size is not exceeded, the first-come, first-served (FCFS) cache handles the storage of intermediate states for scheduled jobs. However, MLFQ may initiate additional jobs that are relegated to queues with lower priorities, necessitating the maintenance of interim states by the cache. Considering the size of LLMs and the limited memory space available on GPUs, the cache may overflow. Instead of merely delaying the initiation of new jobs when the cache is full, which would result in head-of-line blocking, the researchers devised an innovative GPU memory management system.
This system proactively uploads the state of processes in low-priority queues upon scheduling and offloads the state when the cache nears capacity. Employing pipelining and asynchronous memory operations further enhances efficiency within FastServe. Additionally, parallelization techniques like tensor and pipeline parallelism are employed to enable distributed inference serving across multiple GPUs, catering to models that surpass the capacity of a single GPU. To mitigate pipeline bubbles, the scheduler concurrently performs numerous batches of jobs. The distributed key-value cache is managed by a dedicated key-value cache manager, which also handles the distribution of memory swapping between GPU and host memory. The researchers implemented a FastServe system prototype based on NVIDIA FasterTransformer, yielding results that showcased notable improvements in both average and tail JCT, surpassing the cutting-edge solution Orca by up to 5.1 and 6.4, respectively.
Conclusion:
Peking University’s FastServe represents a significant advancement in the field of distributed inference serving systems for LLMs. Its ability to preempt tasks at the output token level and optimize memory management enables faster job completion and minimizes bottlenecks. With its superior performance and scalability, FastServe has the potential to reshape the market by providing more efficient and engaging interactive AI applications. Businesses can expect improved user experiences and accelerated advancements in various domains, further fueling the adoption of LLMs and similar AI technologies.