Telugu LLM Labs: Empowering Telugu NLP with Native and Romanised Datasets

TL;DR:

  • Ravi Theja Desetty and Ramsri Goutham Golla launch Telugu LLM Labs for Telugu NLP.
  • Aims to enhance the AI experience for 100 million Telugu speakers worldwide.
  • Focus on open datasets in both native and romanized Telugu scripts.
  • Introduction of “uonlp_culturaX_telugu_romanized_100k” dataset for romanized Telugu.
  • Two datasets for supervised finetuning in Telugu, addressing Indic language instruction needs.
  • Filtering to remove “English Language Specific” or “Coding related” content.
  • Commitment to refining open-source models like Llama 2, Mistral, and TinyLlama using new datasets.

Main AI News:

Ravi Theja Desetty from LlamaIndex and Ramsri Goutham Golla have joined forces to launch Telugu LLM Labs, a venture poised to revolutionize the Telugu Natural Language Processing (NLP) landscape. With over 100 million Telugu speakers worldwide, this initiative holds immense promise for the advancement of linguistic technology.

Telugu LLM Labs’ primary mission revolves around enriching the AI experience for the Telugu-speaking community. Their strategic focus is on open datasets tailored to Telugu, encompassing both native scripts and romanized versions. This ambitious endeavor seeks to share valuable insights, models, and embeddings, with a specific emphasis on Language Model Models (LLMs) optimized for Telugu.

The spotlight shines on the “uonlp_culturaX_telugu_romanized_100k” dataset, a key resource within the Romanized Telugu Pretraining Dataset. Recognizing the prevalent use of romanized Telugu in online platforms such as WhatsApp and YouTube comments, Telugu LLM Labs presents this dataset, which comprises the romanized version of the initial 108,000 rows from the culturaX_telugu dataset. It addresses the pressing need for datasets designed for additional model pre-training and fine-tuning, particularly in the context of romanized Telugu.

Additionally, Telugu LLM Labs addresses the scarcity of instruction datasets in Indic languages by introducing the Supervised Finetuning Dataset in Telugu, encompassing both native and romanized variants. This release includes two datasets, namely “yahma_alpaca_cleaned_telugu_filtered_and_romanized” and “teknium_GPTeacher_general_instruct_telugu_filtered_and_romanized.” These datasets, accessible on the HuggingFace Hub, offer translated and transliterated content to Telugu, further enriched by filtering processes using NLP classification systems to eliminate content deemed “English Language Specific” or “Coding related.”

Telugu LLM Labs is actively committed to the continuous finetuning and training of open-source models, such as Llama 2, Mistral, and TinyLlama. Leveraging the newly released Telugu translation and transliteration datasets, this initiative aims to elevate the capabilities of these models, unlocking new horizons in Telugu NLP. 

Conclusion:

The launch of Telugu LLM Labs and their comprehensive approach to enhancing Telugu Natural Language Processing with both native and romanized datasets is a significant development. With a vast community of 100 million Telugu speakers worldwide, this initiative holds the potential to foster innovation and drive growth in the Telugu NLP market. The availability of specialized datasets and the commitment to refining open-source models will likely lead to the emergence of new applications and services catering to this substantial linguistic demographic. Businesses in the NLP sector should keep a keen eye on Telugu LLM Labs’ progress as it could open up valuable opportunities in this niche market.

Source