PolyLM: Unlocking Multilingual Potential with Open Source Large Language Models Trained on 640B Tokens, Offering Model Sizes of 1.7B and 13B

TL;DR:

  • Large Language Models (LLMs) have garnered significant interest in the AI sector for their text generation abilities.
  • The team behind PolyLM has developed a multilingual LLM, addressing the predominant bias towards English in existing models.
  • PolyLM-13B and PolyLM-1.7B models have been released, catering to a wide range of languages.
  • A vast dataset of 640B tokens was used, with a curricular learning technique employed to enhance low-resource languages.
  • MULTIALPACA, a multilingual instruction dataset, aids in fine-tuning the model for better comprehension.
  • The team has established a benchmark for evaluating multilingual capabilities, demonstrating improved performance.
  • PolyLM’s contributions include proficiency in major non-English languages, a curriculum learning approach, and the MULTIALPACA dataset.

Main AI News:

The rapid advancements in Large Language Models (LLMs) have captured the attention of the Artificial Intelligence sector, owing to their remarkable versatility and proficiency in understanding, reasoning, and generating text based on natural language instructions. These models, trained on extensive data, have emerged as highly capable tools that closely mimic human-like abilities.

While LLMs have primarily focused on English and resource-rich languages, there has been a noticeable bias towards English in the research and development of these models. Addressing this limitation, a team of esteemed researchers from DAMO Academy and Alibaba Group have introduced POLYLM (Polyglot Large Language Model), a multilingual LLM that aims to broaden the scope and inclusivity of language understanding.

Unlike existing multilingual LLMs, which lack a 13B model, the team behind POLYLM has gone above and beyond by releasing two variants: POLYLM-13B and POLYLM-1.7B. This significant step ensures that the benefits of LLM technology are accessible to a wider range of languages and users.

To construct POLYLM, the team leveraged an expansive dataset comprising 640B tokens from publicly available sources, including reputable platforms like Wikipedia, mC4, and CC-100. However, they encountered the challenge of limited data for low-resource languages. To overcome this hurdle, they employed a curricular learning technique that gradually increased the inclusion of high-quality, low-resource languages during training, while placing initial emphasis on English. Through this approach, the team successfully transferred general knowledge from English to other languages, enhancing the model’s overall linguistic capabilities.

In addition to the development of POLYLM, the team also devised MULTIALPACA, a multilingual instruction dataset, for the supervised fine-tuning (SFT) phase. Existing multilingual SFT datasets often rely on time-consuming manual annotation or machine translation, which introduces errors and disregards cultural nuances. The novel approach taken by MULTIALPACA automatically generates high-quality multilingual instruction data, eliminating these limitations. By utilizing English seeds, translations into various languages, and an instruction production and filtering system, MULTIALPACA provides a robust foundation for improved multilingual instruction comprehension.

To evaluate the multilingual prowess of LLMs, the team created a comprehensive benchmark that encompasses a range of tasks, including question answering, language understanding, text generation, and cross-lingual machine translation. Developed with meticulous prompting, this benchmark covers fifteen languages and ten tasks. Through extensive experimentation, the team demonstrated that their pretrained POLYLM model outperforms existing open-source models of comparable size when it comes to non-English languages. The team’s curriculum training strategy not only enhances multilingual performance but also maintains a high level of English proficiency. Furthermore, the integration of multilingual instruction data significantly amplifies POLYLM’s ability to excel in multilingual zero-shot tasks.

To summarize the contributions of the research team:

  1. They have successfully developed a proficient 13B scale model capable of excelling in major non-English languages such as Spanish, Russian, Arabic, Japanese, Korean, Thai, Indonesian, and Chinese. This model bridges the gap left by existing open-source models, which either lack proficiency in these languages or offer smaller versions with limited capabilities.
  2. The team proposed an advanced curriculum learning approach that enables the transfer of general knowledge acquired in English to diverse non-English languages and specific natural language processing tasks, including machine translation.
  3. They introduced the MULTIALPACA dataset, a valuable addition to existing instruction datasets, which significantly enhances LLMs’ ability to comprehend multilingual instructions, particularly those from non-native English speakers.

Through the groundbreaking advancements brought forth by POLYLM, the field of multilingual language processing stands to benefit immensely, paving the way for more inclusive and effective language models in the future.

Conclusion:

The introduction of PolyLM, with its expanded multilingual capabilities, signifies a significant development in the market for Large Language Models. By addressing the bias towards English and catering to a wide range of languages, PolyLM opens up new opportunities for businesses and organizations across various language barriers. Its proficiency in major non-English languages, advanced curriculum learning approach, and the availability of the MULTIALPACA dataset ensure enhanced performance and better comprehension of multilingual instructions. This breakthrough will drive the market towards more inclusive and effective language models, facilitating seamless communication and understanding in a globalized world.

Source