Cerebras Systems and Barcelona Supercomputing Center Advance Multilingual AI with FLOR-6.3B Model

TL;DR:

  • Cerebras Systems and Barcelona Supercomputing Center collaborate on FLOR-6.3B, a multilingual large language model.
  • FLOR-6.3B encompasses English, Spanish, and Catalan, trained in just 2.5 days on Condor Galaxy.
  • Innovative techniques address data scarcity in Catalan and Spanish languages.
  • FLOR-6.3B’s reduced vocabulary size and cost-effective inference capabilities.
  • The FLOR family of models is open-source, and designed for research and commercial applications.
  • Condor Galaxy 1, a 64 CS-2 system AI supercomputer, simplifies large model training.
  • Cerebras’ leadership in multilingual AI with models like Jais, BTLM-3B-8K, and Med42.

Main AI News:

In a groundbreaking collaboration, Cerebras Systems, a pioneer in the realm of accelerating generative AI, has partnered with the Barcelona Supercomputing Center (BSC) to achieve a significant milestone in the field of multilingual natural language processing. Together, they have successfully trained FLOR-6.3B, a cutting-edge large language model (LLM) capable of understanding and generating content in English, Spanish, and Catalan. This achievement comes after just 2.5 days of training on the powerful Condor Galaxy (CG-1) AI supercomputer, which comprises 64 Cerebras CS-2s, demonstrating the exceptional capabilities of this technology.

FLOR-6.3B represents a significant step forward in multilingual AI, following in the footsteps of Jais, the leading Arabic-English model introduced by Cerebras. What makes this achievement even more remarkable is the unique challenge posed by the scarcity of data available for training in Catalan, a low-resourced language. Cerebras and BSC, however, rose to the occasion by innovatively combining Spanish, Catalan, and English within a single model. This approach leveraged a fully-trained LLM with an adjusted embedding layer to achieve results on par with models trained on larger data sets.

Andrew Feldman, CEO and co-founder of Cerebras, explained the significance of this accomplishment, stating, “Even though Spanish is one of the most commonly spoken languages in the world, there is a shortage of data available on the Internet for training – and we’ve found this to be a common problem for many languages beyond English.” He continued, “In collaboration with our partners, we have been committed to developing new methodologies for creating models where training data is underrepresented. We are proud to work with BSC on FLOR 6.3B, which is multilingual at its core and performs significantly better than competing Spanish LLMs thanks to our novel training techniques.”

FLOR, a new family of open-source models, offers a wide range of parameters, from 760M to 6.3B, based on publicly released checkpoints of BLOOM. These checkpoints were initially pre-trained on a staggering 341B tokens of multilingual data, encompassing 46 natural languages and 13 coding languages.

To tailor FLOR-6.3B for Catalan and Spanish, a new tokenizer with a reduced vocabulary set of 50,257 subwords was created. This tokenizer not only aligned with the existing Bloom vocabulary but also incorporated subwords that are more prevalent in Catalan and Spanish. As a result, FLOR-6.3B boasts fewer parameters than the Bloom-7.1B model, reducing the cost of inference by more than 10%.

The training of the FLOR family of models took place on the Condor Galaxy 1 AI Supercomputer, one of the world’s largest AI supercomputers. Smaller models were trained on single Cerebras CS-2 systems, while FLOR-6.3B benefited from the computational power of 16 CS-2s. Remarkably, the entire training process for FLOR-6.3B, involving a staggering 140 billion tokens, was completed in just 2.5 days.

FLOR-6.3B is now open source, ready for deployment in both research and commercial applications, further cementing Cerebras Systems’ position as a leader in the field of multilingual AI models. Additionally, Condor Galaxy 1, built by Cerebras and strategic partner G42, stands as a testament to their commitment to advancing AI technology, offering support for models with up to 600 billion parameters and simplifying the training process, ultimately accelerating the development of groundbreaking AI models.

The FLOR family of models is the latest testament to Cerebras’ leadership in the realm of multilingual AI. In 2023, their collaboration with Core42 led to the development of Jais 13B and Jais 30B, regarded as the world’s finest bilingual Arabic models, available on Azure Cloud. Condor Galaxy also played a crucial role in training BTLM-3B-8K, the top 3B model on HuggingFace, offering impressive 7B parameter performance in a lightweight model for inference. Additionally, Med42, developed in partnership with M42 and Core42, has emerged as a leading clinical LLM, surpassing MedPaLM in terms of both performance and accuracy, thanks to training on Condor Galaxy 1 in a mere weekend.

Conclusion:

The successful development of FLOR-6.3B signifies a significant leap in multilingual AI capabilities, addressing data scarcity challenges and offering cost-effective inference. With open-source availability and the power of Condor Galaxy 1, Cerebras Systems and BSC are poised to shape the future of the AI market by enabling the creation of advanced models with broad language coverage, opening doors to new research and commercial possibilities.

Source