
India's Language Diversity: A Major Hurdle for Its Ambitious AI Dreams
2025-05-26
Author: John Tan
NEW DELHI: India is on an ambitious quest to create its own large language model aimed at rivaling OpenAI's famed ChatGPT. However, the nation’s vast array of languages and dialects presents a unique set of challenges.
A Linguistic Labyrinth
With 22 officially recognized languages and over 10,000 local dialects, India's linguistic landscape is as rich as it is complex. Languages such as Marathi share roots with Hindi and Gujarati, while Southern languages like Kannada, Telugu, Tamil, and Malayalam are distinctly different, adding layers of complexity to the training of AI models.
Digital Divide in Language Content
A significant challenge facing BharatGen, a government-funded consortium, is the stark lack of available online content in Indian languages. While English dominates the internet, accounting for about half of online data, Indian languages make up a mere 1 percent. Many literary texts have yet to be digitized, and a wealth of cultural knowledge remains traditionally transmitted, largely untapped in the digital realm.
A Silver Lining
On the bright side, experts believe this rich diversity might help develop AI models that are less biased. Ganesh Ramakrishnan, a professor at the Indian Institute of Technology Bombay, is actively collaborating with various organizations to gather data in local languages, aiming to digitize and incorporate it into the foundational AI models.
Small Business Struggles with Existing AI
Many small business owners have faced hurdles with existing AI solutions. Ghooran Yadav, a food cart vendor in New Delhi, recounted his experience using ChatGPT to get recipes. Although the model understood his query in Bhojpuri, it responded in Hindi, leaving him less than satisfied. Yadav believes that a locally developed AI tool would offer much more accurate and relevant information.
Transforming Agriculture Through Technology
BharatGen is also focused on leveraging generative AI for practical solutions, like aiding farmers through the Krishi Saathi app. Powered by BharatGen's Hindi language model, this app provides critical information on crop health and pest management and can even translate text into local languages. For those unable to read or write, voice communication capabilities make it accessible to all.
A Vision for Inclusive AI
"Making sure that even the most remote regions benefit from AI is central to our mission," states Ramakrishnan. The technology can mimic a speaker’s voice and tone, allowing for more natural interactions with users.
Pioneering Language AI in India
BharatGen stands as one of five major language-based AI initiatives driven by Indian Prime Minister Narendra Modi’s government, already rolling out a staggering 19 language models since its inception just last year. This is just the beginning of India’s exciting AI journey!