- OpenAI’s claim on the necessity of copyrighted materials for training top AI models is questioned.
- Common Corpus initiative emerges as a vast public domain dataset for training LLMs.
- Fairly Trained certifies KL3M model by 273 Ventures, devoid of copyright infringement.
- Kelvin Legal DataPack, curated by Fairly Trained, offers valuable legal documents for AI training.
- Common Corpus and KL3M signify a shift towards fairer AI practices, challenging existing norms.
- Fairly Trained extends certifications beyond LLMs, showcasing a broader scope for AI certification.
- Limitations of the Kelvin Legal DataPack due to outdated public domain data noted.
Main AI News:
In the ever-evolving realm of Artificial Intelligence, the debate surrounding the necessity of copyrighted materials in training cutting-edge AI models has been longstanding. OpenAI’s bold declaration to the UK Parliament in 2023, asserting the impossibility of training such models without incorporating copyrighted content, reverberated across the industry, inciting legal disputes and ethical dilemmas. Nevertheless, recent advancements have cast doubt on this conventional wisdom, presenting compelling evidence that large language models can indeed be trained without the contentious use of copyrighted materials.
Enter the Common Corpus initiative, standing as the premier public domain dataset for training Large Language Models (LLMs). Spearheaded by Pleias and drawing together experts in LLM pretraining, AI ethics, and cultural heritage, this international endeavor has not only challenged the status quo but also heralded a new epoch of AI methodologies. This globally diverse and multilingual dataset exemplifies the feasibility of training LLMs without the encumbrance of copyright concerns, marking a profound shift in the landscape of AI development.
Fairly Trained, a prominent nonprofit within the AI industry, has taken a decisive stride toward fostering fairer AI practices. Bestowing its inaugural certification upon an LLM devoid of copyright infringement, Fairly Trained has recognized the KL3M model. Crafted by the Chicago-based legal tech consultancy startup 273 Ventures, KL3M stands not only as a model but as a beacon of hope for equitable AI. The stringent certification process, under the stewardship of Fairly Trained’s CEO, Ed Newton-Rex, instills confidence in the potential for fair AI, affirming that “the prospect of training an LLM fairly is indeed viable.”
The Kelvin Legal DataPack, meticulously curated by Fairly Trained, encompasses thousands of legal documents meticulously vetted to adhere to copyright regulations. Despite its scale of approximately 350 billion tokens, this dataset serves as a testament to the potency of curation. Though smaller in comparison to datasets compiled by entities such as OpenAI, its performance remains exceptional. Jillian Bommarito, the founder of Fairly Trained, attributes the success of the KL3M model to the rigorous vetting process applied to the data. The transformative potential of meticulously curated datasets like the Kelvin Legal DataPack to optimize AI models, tailoring them precisely to their intended applications, is indeed enthralling. 273 Ventures has now initiated a waitlist for clients eagerly seeking access to this invaluable resource.
Researchers involved in the Common Corpus initiative embarked on a daring venture by leveraging a text corpus equivalent in size to that used for training OpenAI’s GPT-3 model. This corpus has been made accessible via the open-source AI platform Hugging Face. While Fairly Trained has hitherto exclusively certified LLMs developed by 273 Ventures, the emergence of initiatives like the Common Corpus and the KL3M model signals a paradigm shift in the AI arena. Advocates for ethical AI, particularly those advocating for artists adversely affected by data scraping, perceive these initiatives as pivotal in challenging the prevailing norms. Fairly Trained’s recent certifications, which extend beyond LLMs to include entities such as the Spanish voice-modulation startup VoiceMod and the heavy-metal AI band Frostbite Orckings, hint at a broader horizon for AI certification.
While the Kelvin Legal DataPack, a creation of Fairly Trained, boasts undeniable merits, it is not without its limitations. Though comprising thousands of legal documents meticulously vetted for copyright compliance, it’s imperative to acknowledge that a significant portion of public domain data is antiquated, particularly in jurisdictions such as the US where copyright protection often extends beyond 70 years following the author’s demise. Consequently, this dataset may not be suitable for anchoring an AI model in contemporary affairs.
Conclusion:
The emergence of initiatives like the Common Corpus and the certification of models like KL3M by Fairly Trained mark a significant shift towards fairer AI practices. This trend challenges existing norms and emphasizes the importance of ethical considerations in AI development. Companies in the AI market must adapt to these changes by prioritizing ethical AI practices, ensuring compliance with copyright laws, and considering the implications of utilizing outdated datasets. Failure to do so could result in reputational damage and legal challenges, while embracing ethical AI practices presents opportunities for innovation and positive societal impact.