‘Magnus’: Optimizing Large Language Model Serving Efficiency in LMaaS

  • Transformer-based LLMs like Magnus optimize NLP tasks via LMaaS.
  • Magnus integrates semantic data to predict request lengths accurately.
  • Components include batch scheduler, adaptive batcher, serving time estimator, and generation length predictor.
  • Testing on NVIDIA V100 GPUs showed up to 234% increase in request throughput and 89.7% reduction in response times.
  • Magnus offers significant improvements in serving latency and efficiency over traditional methods.

Main AI News:

Transformer-based generative Large Language Models (LLMs) have proven their prowess across a spectrum of Natural Language Processing (NLP) tasks. Despite their versatility, the cost of training and deploying these models often deters developers. Leading AI firms such as OpenAI, Google, and Baidu address this with Language Model-as-a-Service (LMaaS), granting access to LLMs via APIs.

In an LMaaS environment, developers submit user inputs and specific instructions, aiming to enhance Quality of Service (QoS) and accommodate more clients. Current systems like TensorFlow Serving and Triton Inference Server handle queries in a first-come, first-served (FCFS) manner with fixed batch sizes, limiting GPU parallelism to prevent memory issues.

To tackle these inefficiencies, continuous batching dynamically manages requests, yet often underutilizes GPU potential. Moreover, methods like model quantization and pruning, while reducing memory usage, may compromise output quality.

Research reveals a significant correlation between user input length and generated output in applications such as code translation and grammatical correction. Leveraging this correlation, Magnus, developed by a team of AI researchers from China, integrates application and user-level semantic data to predict request lengths accurately.

Magnus comprises four key components: a batch scheduler, adaptive batcher, serving time estimator, and generation length predictor. The generation length predictor, employing a random forest regressor, forecasts request lengths based on semantic features and user inputs. The adaptive batcher optimizes batch sizes by grouping requests with similar anticipated lengths, aiming to minimize computational waste.

The batch scheduler adopts a Highest Response Ratio Next (HRRN) policy, prioritizing requests to reduce queue times and enhance response efficiency. Meanwhile, the serving time estimator utilizes KNN regression to predict batch serving times, further refining QoS metrics.

In testing with ChatGLM-6B instances on NVIDIA V100 GPUs, Magnus demonstrated significant improvements in serving latency, request throughput, and overall efficiency compared to baseline approaches. Results indicated up to a 234% increase in request throughput and an 89.7% reduction in response times, underscoring Magnus’ effectiveness in optimizing batch serving within LMaaS frameworks.

This innovation underscores the potential for enhanced efficiency and scalability in deploying LLMs for diverse NLP applications, promising substantial benefits for developers and end-users alike in the realm of AI-driven language services.

Conclusion:

This innovation with Magnus marks a pivotal advancement in the LMaaS market, emphasizing enhanced efficiency and scalability in deploying LLMs for various NLP applications. By significantly improving request throughput and reducing response times, Magnus sets a new standard for optimizing batch serving, promising substantial benefits for developers and stakeholders invested in AI-driven language services.

Source

Your email address will not be published. Required fields are marked *