Salesforce Unveils XGen-7B: Revolutionizing Long-Form Content Processing with Advanced Language Models

TL;DR:

  • Salesforce Research has developed XGen-7B, a series of 7B Large Language Models (LLMs) trained on up to 8K sequence length for 1.5 trillion tokens.
  • These LLMs excel in comprehending long-form content, offering enhanced capabilities in tasks like text summarization, question-answering, and long-form dialogue generation.
  • The XGen-7B models outperform other instruction-tuned and baseline LLMs of similar size in standard NLP benchmarks.
  • Salesforce utilized its proprietary library, JaxFormer, for efficient training of the XGen-7B models.
  • Researchers identified factors like “loss spikes” and sequence length as key considerations in the training process.
  • Evaluations demonstrated that the XGen-7B models achieved superior performance in understanding longer contexts and generating coherent responses.
  • Limitations of the XGen model include the potential for biases and toxic responses.
  • Salesforce Research has open-sourced the code, encouraging community exploration and collaboration.

Main AI News:

In today’s era of technological advancements, Large Language Models (LLMs) have emerged as a driving force in artificial intelligence. Researchers and developers have harnessed the power of these models by training them on massive amounts of data, enabling them to tackle complex language-related tasks, such as understanding intricate patterns and generating coherent responses. One area where LLMs have shown great promise is in handling long-form content, which demands a deeper understanding of broader contexts. Whether it’s text summarization, code generation, protein structure prediction, or information retrieval, LLMs excel at processing diverse forms of information like paragraphs, tables, and images. By leveraging their capacity to capture long-distance structural dependencies, LLMs can establish connections between different parts of a text and extract the most pertinent information. Consequently, these models equipped with a broader range of knowledge can provide users with more accurate and contextually relevant answers to their queries.

Despite their potential, most open-source LLMs available today, including Meta’s LLaMA and MosaicML’s MPT LLM models, have been trained on sequences with a maximum of 2K tokens. This limitation poses a significant challenge when it comes to modeling longer sequences. Surprisingly, research on model scaling has revealed that smaller models trained on a greater number of tokens outperform larger models when provided with a fixed computational budget. Driven by this problem and recent advancements, Salesforce Research has achieved groundbreaking success with the introduction of XGen-7B, a series of 7B LLMs trained on 8K sequence length for an astonishing 1.5 trillion tokens. The XGen-7B series comprises three models: XGen-7B-4K-Base, which supports a sequence length of 4K; XGen-7B-8K-Base, which supports a sequence length of 8K; and XGen-7B-8K-Inst, fine-tuned on public-domain instructional data exclusively for research purposes. Notably, these LLMs, such as XGen, exhibit comparable or even superior performance compared to other state-of-the-art LLMs of similar size, including MPT, Falcon, and LLaMA, as evidenced by standard NLP benchmarks.

Salesforce’s XGen-7B models were trained using JaxFormer, the company’s proprietary library. JaxFormer enables efficient training of LLMs, leveraging data and model parallelism optimized specifically for TPU-v4 hardware. The training process involved following the guidelines of LLaMA, with the addition of two key investigations. The first exploration focused on understanding “loss spikes,” which refer to sudden and temporary increases in loss during training without a clear underlying cause. Although the exact cause of these spikes remains unknown, researchers identified potential contributing factors such as “sequential over parallel circuits,” “swish-GLU over GeLU,” and “RMS-Norm over Layer-norm.” The second aspect addressed was the challenge of sequence length. Given the significantly higher computational costs associated with training on longer sequences, owing to the quadratic complexity of self-attention, a staged training approach was adopted. The training process commenced with 800B tokens and a sequence length of 2K tokens, followed by 400B tokens with a length of 4K, and finally, 300B tokens with a length of 8K.

To evaluate the XGen-7B 8K model’s capabilities in comprehending longer contexts, researchers conducted assessments across three primary tasks: long-form dialogue generation, text summarization, and question-answering. The instruction-tuned model was utilized for these evaluations, as it proved most suitable for the complexity of the tasks. In the domain of long-form dialogue generation, the researchers assessed the model’s performance on three tasks: AMI meeting summarization, ForeverDreaming, and TVMegaSite screenplay summarization. Across various metrics, the XGen-7B-inst model consistently achieved the highest scores when compared to other instruction-tuned models, demonstrating its superior performance.

For long-form question-answering, researchers employed ChatGPT to generate questions based on Wikipedia documents covering a wide array of topics, including Physics, Engineering, History, and Entertainment. Alongside their corresponding summaries, the LLM-generated answers, limited to 256 tokens, underwent evaluation using GPT-4, considering factors such as structure, organization, and relevance to the question and source document. In this assessment, the XGen-7B-8K-Inst model outperformed baseline models limited to 2K tokens, highlighting its superior performance. When it came to text summarization, researchers used two datasets from different domains, specifically meeting conversations and government reports. The results unequivocally demonstrated that the XGen-7B model surpassed other baseline models in these tasks, cementing its position as an outstanding performer in text summarization.

Conclusion:

The introduction of Salesforce’s XGen-7B series signifies a significant advancement in the market for long-form content processing. These cutting-edge language models, trained on vast amounts of data, demonstrate superior performance in tasks such as text summarization, question-answering, and long-form dialogue generation. The scalability and efficiency of the training process, along with the open-sourcing of the code, pave the way for further innovation and collaboration in the field. As businesses increasingly rely on advanced language models for understanding and generating extensive textual contexts, the XGen-7B models offer a powerful solution for delivering accurate and contextually relevant information to users.

Source