SoT: Microsoft and Tsinghua’s Innovation to Accelerate Large Language Models

TL;DR:

  • Large Language Models (LLMs) like GPT-4 face challenges due to slow processing speed.
  • Microsoft and Tsinghua’s Skeleton-of-Thought (SoT) approach aims to accelerate LLMs.
  • SoT doesn’t alter LLMs but optimizes their output organization.
  • SoT’s two-stage process creates a skeletal framework and then expands on it.
  • It’s versatile and applicable to open-source models like LLaMA and API-based models like GPT-4.
  • Extensive tests show SoT achieves speed-ups of 1.13x to 2.39x without compromising answer quality.

Main AI News:

In the realm of cutting-edge Artificial Intelligence, a groundbreaking innovation has emerged to revolutionize the speed and efficiency of Large Language Models (LLMs) like GPT-4 and LLaMA. These formidable AI constructs have undeniably reshaped the technological landscape, but their sluggish processing speed has remained a persistent challenge. This limitation has hindered their widespread adoption in latency-critical applications such as chatbots, copilots, and industrial controllers. Recognizing the critical need for a solution, Microsoft Research and Tsinghua University researchers have unveiled an ingenious approach known as the Skeleton-of-Thought (SoT).

Traditionally, efforts to enhance LLM speed have centered on intricate modifications to the models, systems, or hardware. However, SoT takes a distinctive path. Unlike conventional methods, SoT refrains from extensive alterations to LLMs and instead treats them as black boxes. The focus shifts from tinkering with the internal mechanics of the models to optimizing the organization of their output content. The proposed solution prompts LLMs to follow a unique two-stage process. In the first stage, the LLM is directed to construct a skeletal framework for the answer. Subsequently, in the second stage, the LLM is tasked with the parallel expansion of multiple facets within this skeletal framework. This innovative approach presents a novel means of accelerating LLM response times without necessitating complex changes to the model architecture.

The methodology of SoT involves dissecting the content generation process into two distinct phases. Firstly, the LLM is prompted to create a skeletal structure for the response, mirroring how humans often approach problem-solving by outlining a high-level framework. The second stage leverages this skeleton to execute parallel expansion, enabling the LLM to address multiple facets concurrently. Remarkably, this approach is applicable to a range of models, from open-source ones like LLaMA to API-based models like GPT-4, showcasing its versatility.

To assess the efficacy of SoT, the research team conducted extensive tests on a dozen recently released models, spanning both open-source and API-based categories. These tests utilized the Vicuna-80 dataset, featuring questions from diverse domains such as coding, mathematics, writing, and roleplay. The results were impressive, with SoT achieving speed-ups ranging from 1.13x to 2.39x across eight of the twelve models tested. Crucially, these speed improvements were achieved without any compromise in answer quality. The team employed metrics from FastChat and LLMZoo to evaluate the quality of SoT’s responses, demonstrating its ability to maintain or enhance response quality across a wide spectrum of question categories.

Conclusion:

The Skeleton-of-Thought (SoT) approach introduced by Microsoft Research and Tsinghua University promises to significantly enhance the speed and efficiency of Large Language Models. This innovation opens up opportunities for broader applications in latency-critical fields, such as chatbots and industrial controllers, without sacrificing the quality of responses. As AI continues to play a pivotal role in various industries, SoT could lead to more seamless and efficient AI-driven solutions, potentially reshaping the market by enabling faster and more effective interactions with AI systems.

Source