Sailing into Success: Introducing Sailor Language Models Tailored for Southeast Asian (SEA) Languages

  • Sailor project introduces language models tailored for Southeast Asian (SEA) languages, ranging from 0.5B to 7B parameters.
  • Models based on Qwen1.5 are continuously pre-trained on a corpus of 200B to 400B tokens, including English, Chinese, Vietnamese, Thai, Indonesian, Malay, and Lao.
  • Innovative strategies like BPE dropout and rigorous data-cleaning processes enhance model adaptability and performance.
  • Optimization of training data through tiny proxy models enables fine-tuning of hyperparameters, improving training efficacy.
  • Empirical evaluations demonstrate Sailor models’ resilience and utility across various tasks, highlighting their potential to address language challenges in the SEA region.

Main AI News:

In recent years, Large Language Models (LLMs) have undergone a dramatic evolution, driven by the exponential growth of data on the internet and continuous advancements in pre-training techniques. Models like GPT, Gemini, and Llama have set new benchmarks in various domains, including logical reasoning, coding, and creative writing.

However, a significant challenge has been the overreliance on English-centric datasets, which has limited the performance of LLMs in non-English languages. This phenomenon, known as the curse of multilingualism, arises from the insufficient exposure of models to diverse linguistic contexts during pre-training.

To address this issue head-on, a collaborative effort between Sea AI Lab, Singapore, and SUTD, Singapore, has birthed the Sailor project. Sailor comprises a family of open language models tailored specifically for Southeast Asian (SEA) languages, ranging from 0.5B to 7B parameters. These models are built upon the flexible foundation of the Qwen1.5 language model, designed with multilingual applications in mind.

The Sailor models undergo continuous pre-training using a vast corpus of 200B to 400B tokens, drawing from languages predominant in the SEA region, such as English, Chinese, Vietnamese, Thai, Indonesian, Malay, and Lao. This extensive training data is subjected to meticulous processing, including BPE (Byte Pair Encoding) dropout to enhance the models’ adaptability across diverse linguistic patterns and situations, thus mitigating overfitting risks.

Moreover, rigorous deduplication and data-cleaning procedures are employed to ensure the quality of the training set, thereby bolstering the precision and reliability of the Sailor models’ predictions by eliminating irrelevant data and noise.

A noteworthy aspect of Sailor’s development lies in the optimization of training data through the use of tiny proxy models. This approach facilitates fine-tuning of hyperparameters, such as the data mixture ratio, thereby optimizing the training process and ultimately enhancing model performance.

Experimental evaluations across various tasks, spanning examination, question answering, reading comprehension, and common-sense reasoning, have underscored the resilience and efficacy of Sailor models compared to existing benchmarks. These findings underscore the potential of Sailor models to address language-related challenges prevalent in the SEA region across a wide spectrum of applications.

Conclusion:

The introduction of Sailor Language Models tailored for Southeast Asian languages signifies a significant step towards addressing the language barriers prevalent in the region. With their adaptability, precision, and utility across diverse linguistic contexts, these models are poised to revolutionize language processing applications in the Southeast Asian market, unlocking new opportunities for businesses and organizations operating in the region.

Source