EleutherAI unveiled Hi-NOLIN, a pioneering open-source English-Hindi bilingual model

TL;DR:

  • EleutherAI collaborates with INCITE Initiative and AAI CERC to launch Hi-NOLIN.
  • Hi-NOLIN aims to be the first open-source English-Hindi bilingual model.
  • The model scales from a 7B to a 9B framework to boost performance.
  • Utilizes a vast 300B token Pile text corpus inclusive of English and programming languages.
  • Exhibits seamless bilingual transitions and code processing capabilities.
  • Training employs the Summit supercomputer’s six-GPU-per-node architecture.
  • Early stages show a promising reduction in training loss, indicating potential advances.
  • Incorporates technologies from GPT-NeoX, Megatron-LM, and DeepSpeed for enhanced efficiency.
  • Outperforms Pythia 12B and multilingual Bloom models in early benchmarks.
  • Moves towards closing the performance gap with LLaMa 2 models.
  • Reflects EleutherAI’s commitment to open-source LLMs and transparent research.
  • Tech Mahindra’s Project Indus, catering to Hindi and its dialects, will launch soon.

Main AI News:

Embarking on an ambitious journey, Hi-NOLIN was conceived with the vision to be the inaugural open-source English-Hindi bilingual model. The research team broadened the scope of the 7B Pythia framework to a 9B model to enhance performance on their bespoke hardware, utilizing the expansive 300B token Pile text corpus that spans English and programming languages. Hi-NOLIN’s dual-language proficiency is notable, seamlessly switching between Hindi and English while adeptly handling code.

Ongoing advancements in Hi-NOLIN’s training are powered by the formidable Summit supercomputer, taking advantage of its six-GPU-per-node configuration. Even in its nascent phase and before full optimization, the model’s steady progress in reducing training loss heralds significant forthcoming enhancements.

By integrating cutting-edge methodologies from GPT-NeoX, Megatron-LM, and DeepSpeed, Hi-NOLIN is engineered with 3D parallelism and the ZeRO redundancy optimizer, ensuring optimal utilization of its training capabilities and computational might.

In terms of performance, Hi-NOLIN excels across numerous industry-standard LLM benchmarks, including HellaSwag, TruthfulQA, Arc, and Human Eval. Impressively, with a dataset of 600B tokens, preliminary assessments show Hi-NOLIN outpacing both Pythia 12B and the multilingual Bloom models in most evaluative measures, inching closer to the performance benchmarks set by LLaMa 2 models.

Hi-NOLIN represents a substantial leap towards linguistic diversity in a field predominantly occupied by English-centric language models, thereby bridging the critical divide in cutting-edge language technology for non-English linguistic landscapes.

EleutherAI, a nonprofit collective, is at the forefront of open-source LLM research, established by a cadre of visionaries—Connor Leahy, Sid Black, and Leo Gao in 2020—with the mission to democratize access to large language models and foster transparency as a counterbalance to proprietary alternatives.

Simultaneously, Indian tech conglomerate Tech Mahindra is poised to introduce Project Indus, its bespoke LLM tailored for Hindi and its 37 variants, with a launch slated for late December or the commencement of January.

Conclusion:

The entry of Hi-NOLIN into the language technology market marks a significant milestone in bilingual model development, particularly for the English and Hindi languages. This leap toward linguistic inclusivity can potentially disrupt the current market landscape, which is heavily skewed toward English-centric models. The notable performance of Hi-NOLIN in preliminary evaluations suggests that it could become a formidable competitor against established proprietary models. The concurrent development of Tech Mahindra’s Project Indus further indicates a growing industry trend towards catering to diverse linguistic demographics. Collectively, these developments signal a broader shift in language technology, opening up new avenues for business communication, consumer services, and AI-driven applications across different language-speaking markets.

Source