TL;DR:
- Karya, a startup based in India, is revolutionizing data collection for AI in non-English languages.
- Preethi P., a Karya worker, earns significantly more than her previous job as a tailor by collecting text data in her native Kannada language.
- Karya offers fair compensation to a predominantly female workforce in rural areas, differentiating itself by paying up to 20 times the minimum wage.
- Tech giants like Microsoft, Google, and the Bill & Melinda Gates Foundation are partnering with Karya to source high-quality data for AI products.
- The linguistic diversity gap in AI is being addressed, with Google planning to develop a generative AI model for 125 Indian languages.
- Karya’s social impact extends to poverty reduction, with the startup focusing on broadening language representation and fair compensation for workers.
Main AI News:
In the tranquil village of Agara, nestled amidst rice paddies and groundnut fields, Preethi P. resides in her one-room dwelling. Normally, her days are spent tirelessly mending clothes, earning less than a dollar for her labor. However, today is different – she is engaged in recording sentences in her native Kannada language through a mobile app. Preethi, a mononymous figure common in the region, is one of 70 individuals hired by Karya, a startup focused on amassing text, voice, and image data in India’s vernacular languages. This unseen global workforce, operating across countries like India, Kenya, and the Philippines, plays a pivotal role in collecting and labeling the data that fuels AI chatbots and virtual assistants. Remarkably, Preethi receives fair compensation for her work, a rarity in this field.
After just three days with Karya, Preethi earned 4,500 rupees ($54), a sum four times her usual monthly income as a tailor. This newfound income allows her to make monthly payments on a loan taken to repair her home’s crumbling mud walls, adorned with vibrant saris. She attests, “All I need is a phone and the internet.”
Karya, founded in 2021, has found its services in high demand, particularly with the recent surge in generative AI technology. India alone is projected to have nearly one million data annotation workers by 2030, as per Nasscom, the nation’s tech industry trade body. What sets Karya apart from other data vendors is its commitment to pay its predominantly female contractors in rural communities up to 20 times the prevailing minimum wage, with the promise of delivering superior quality Indian-language data that tech giants are willing to invest more in obtaining.
Manu Chopra, the 27-year-old Stanford-educated computer engineer behind Karya, emphasizes, “Every year, big tech companies spend billions of dollars collecting training data for their AI and machine learning models. Poor pay for such work is an industry failure.“
Silicon Valley, a key player in outsourcing data labeling and content moderation to cost-effective overseas contractors for years, is now turning to Karya to address one of the most significant challenges facing their AI products: securing high-quality data to cater to billions of potential non-English speaking users. These collaborations have the potential to reshape the data industry’s economics and Silicon Valley’s relationship with data providers.
Notably, Microsoft Corp. has engaged Karya to source local speech data for its AI products, while the Bill & Melinda Gates Foundation is working with Karya to rectify gender biases in data underpinning AI chatbots. Google, under Alphabet Inc., is also relying on Karya and other local partners to accumulate speech data across 85 Indian districts, with plans to expand to every district and develop a generative AI model for 125 Indian languages.
Many AI services have been predominantly developed using English-language internet data, leading to inadequate representation of languages used by internet users in other countries, particularly in India, where nearly one billion potential users are seeking AI-powered solutions in various domains. Addressing this linguistic gap is crucial to ensuring that AI systems do not perpetuate harmful stereotypes or produce misinformation.
Karya, a social impact startup headquartered in Bangalore, supported by grants, has succeeded in broadening language representation by specifically targeting rural workers who might otherwise be overlooked for such tasks. Karya’s app functions offline and offers voice support for those with limited literacy, attracting over 32,000 crowdsourced workers in India who have completed millions of paid digital tasks, including image recognition, contour alignments, video annotation, and speech annotation.
For Chopra, the mission extends beyond data supply to poverty alleviation. Growing up in an impoverished neighborhood in West Delhi, he has made it his life’s goal to leverage technology for poverty reduction. “It takes a mere $1,500 in savings to make an Indian eligible to enter the middle class,” says Chopra. “But the impoverished can take 200 years to reach that level of savings.“
Recognizing Microsoft’s substantial expenditures on collecting speech data, albeit of subpar quality, Chopra founded Karya, amassing 10,000 hours of Marathi speech data for Microsoft’s AI services. “Tech companies want the data, accent and all,” Chopra notes. “You cough, they want that in the speech – it represents natural language.”
Researchers, including Saikat Guha at Microsoft Research India, have acknowledged the superior quality of Karya’s data, emphasizing the positive impact of fair compensation on workers’ dedication and the resulting data quality.
Karya’s reach is not confined to India, as it explores opportunities to offer its platform as a service to organizations in Africa and South America, opening the door to similar transformative work.
For now, in the village of Yelandur, south of Bangalore, countless young, school-educated women eagerly anticipate Karya’s next project – transcribing Kannada audio recordings. Among them is Shambhavi S., who, while caring for her family, has found a means to earn and educate her children, driven by a desire for a brighter future.
Conclusion:
Karya’s disruptive approach to data collection for AI, focusing on fair compensation and linguistic diversity, is reshaping the market dynamics. Tech giants’ partnerships with Karya signify a shift towards prioritizing high-quality data to cater to non-English speaking users, potentially leading to more inclusive and effective AI products in the global market.