An epistemic case for 'low-resource' languages in AI

Despite having millions of speakers, languages like Hindi, Swahili, and Tamil remain underrepresented in AI models, resulting in subpar user experiences and limited access to culturally relevant knowledge. In this article, I explore the notion of “low-resource” languages, arguing that their marginalization is not due to a lack of linguistic value but rather a historical and epistemic failure of the digital ecosystem. Drawing from examples in internet history, agriculture, and education, I explore how the dominance of English and other colonial languages in digital content has entrenched linguistic and epistemic hierarchies. I argue that this underrepresentation risks not just usability issues but the erasure of centuries-old knowledge systems embedded in vernacular languages. As generative AI becomes a dominant mode of knowledge access, it may accelerate “knowledge collapse”—the narrowing of publicly available knowledge and perspectives. To counter this, I advocate for designing AI not merely as a knowledge oracle but as a participant and learner in diverse epistemic ecosystems. Preserving linguistic diversity in AI is not only a technical or inclusion imperative but a moral and intellectual one—critical to safeguarding the plurality of human understanding in an increasingly AI-mediated world.

Recently, I was watching a funny video on YouTube of the popular Punjabi musician Diljit Dosanjh interacting with Alexa (highly recommend it for some laughs). Speaking in Hinglish—a blend of Hindi and English—Diljit playfully tried to get Alexa to understand his accent and follow his requests. While he capitalized on his pain into a moment of entertainment that garnered millions of views, the underlying challenge he faced is far from unique. Millions of speakers of several languages such as Hindi, Swahili, Tamil encounter similar issues with AI tools. AI tools often struggle with these languages: they produce lower-quality or even biased responses, have difficulty understanding vernacular input, and fail to reflect cultural nuances. This creates a notable gap in usability, especially for tasks like translating content, explaining context-sensitive ideas, or engaging meaningfully with localized knowledge systems. The issue is that despite being spoken by large populations across the world, many languages remain underrepresented in AI models. For simplicity, I’ll use the term “AI models” to refer specifically to the family of generative models—such as language models and multimodal models—excluding predictive models.

In the computing world, these languages are often classified as "low-resource" languages. But “low-resource” doesn’t mean there’s a lack of speakers or rich knowledge in these languages. Instead, it refers to the scarcity of digitized content available in them—the kind of material that is used as a fodder for training the AI models. While it's true that this digital gap results in poorer user experiences for speakers of these languages, the consequences run far deeper. The underrepresentation of low-resource languages in AI is not just a usability issue; it is an epistemic concern. When AI systems fail to incorporate knowledge encoded in these languages, we risk the gradual erasure of vast reservoirs of human understanding—knowledge, culture, and literature developed over centuries. In this article, I explore how even widely spoken languages can become “low-resource” in the digital age—and why that should concern anyone who values collective human knowledge.

Let's begin with the fundamental question: Why are some languages such as Hindi and Bahasa Indonesia considered low-resource despite having millions of speakers? The answer lies in the history of the internet’s development. Beginning with ARPANET—a network that connected research institutions funded by the US Department of Defense—in the 1960s, the internet emerged and expanded primarily in the West. A language’s digital prominence is closely linked to how early and widely the internet was adopted by the communities that speak it. English became the de facto language of the internet because its early developers and users were predominantly English speakers. Globalization further entrenched this trend, driving demand for common languages in international communication. As a result, English dominates the web. For instance, at the time of writing, English accounts for 44% of the Common Crawl dataset^[1]—the largest publicly available web content collection crawled from the entirety of the internet. Common Crawl dataset serves a major source of training data for the AI models. The second most common language in the dataset is Russian at just 6%, highlighting the vast disparity between the top two languages itself. In terms of world human population^[2], Hindi has the third highest number of speakers across the world with over 609 million speakers. Yet Hindi makes up just 0.2% of the Common Crawl dataset—significantly less than languages like German, French, Italian, and Portuguese, despite having a much larger speaker base. This stark imbalance points to a deep digital divide in how languages are represented online and in AI training data. This disparity can be traced back to the historical power and knowledge structures. English speakers dominated the early internet, and computing systems have long been better suited to English, making it easier to create, store, and distribute content in that language. Colonial languages more broadly also had a head start in digital adoption, shaped by socio-economic factors that continue to influence today’s web. A recent research categorized 88% languages spoken in the world to be low-resource in the internet whereas only 25 languages belong to the high-resource category ^[3]. This has a significant effect on the knowledge distribution in the internet as languages serve as a vessel for holding knowledge. The knowledge encoded in English and other dominant languages (aka high-resource languages) has consistently flowed into the digital space more freely than in many other languages. Additionally, the lack of digital infrastructure and accessible input tools for vernacular languages has made it difficult for speakers of these languages to actively participate online. As a result, a vast amount of knowledge encoded in these languages never made it onto the internet making them 'low-resource'.

In addition, the web also has a clear participation asymmetry, which further deepens this divide. Participation Asymmetry refers to a dynamic where a small fraction of users generates a disproportionately large share of the content. Wikipedia is a prime example—77% of its articles are written by just 1% of editors, most of whom are overwhelmingly male^[4]. Similar trends are seen across platforms like Reddit and Facebook, where the 90-9-1 rule of participation inequality holds: 90% of users mostly consume content, 9% contribute occasionally, and just 1% create the bulk of what’s online^[5].

Because English speakers were the primary users during the internet’s formative years—when much of today’s foundational content was created—languages other than English were left trailing. This early imbalance, compounded by technological barriers faced by vernacular speakers, helps explain why many widely spoken languages are still considered “low-resource” in the digital world, despite their large populations.

But one might still ask: why is this actually a problem? Doesn’t the internet already contain enormous knowledge about the world? While the internet does hold a vast amount of information, it certainly doesn’t hold everything. Human beings have been accumulating knowledge for millennia, across languages and civilizations. Modern English, for instance, traces its roots only as far back as the 15th to 17th centuries. Yet long before that, humans were creating, refining, and transmitting knowledge in many other languages. Much of this knowledge stems from non-Western civilizations that thrived before the rise of the Eurocentric, industrial era. Scholars have argued that the global dominance of English is a legacy of colonialism^[6], and view its spread as a form of linguistic imperialism^[7]—a process that marginalized and delegitimized knowledge embedded in indigenous and vernacular languages. Social scientists have long called for dismantling these epistemological hierarchies^[8]. The hierarchies that position Western knowledge systems as inherently superior.

Relying solely on a narrow set of knowledge systems—especially in scientific decision-making—poses serious risks. Agriculture offers a stark example. The overreliance on high-yield seed varieties developed in distant, Western laboratories—often disconnected from local ecological contexts—has led to a sharp decline in crop genetic diversity. This, in turn, has made food systems increasingly vulnerable to disease and environmental stress. During India’s Green Revolution, traditional millets and pulses that had nourished communities for centuries were sidelined in favor of lab-bred wheat and rice. The outcome? Soil degradation, increased pest outbreaks, and a loss of dietary diversity. In the Philippines, the widespread adoption of IR8 “miracle rice” displaced indigenous rice varieties, weakening natural resistance to pests and diseases. In Latin America, the dominance of the commercially attractive Cavendish banana has left the crop highly susceptible to Panama disease due to its genetic uniformity.

These examples highlight how sidelining traditional knowledge systems in favor of dominant, centralized forms of "scientific" knowledge can backfire—often with long-term ecological consequences. They also point to a deeper issue: when traditional knowledge is unavailable, undervalued, or excluded from formal knowledge infrastructures, scientific decisions are made with critical blind spots. This challenge is particularly urgent today, in an era when global crises like climate change demand epistemic diversity and deeper ecological understanding. Indigenous languages, for instance, often encode centuries of environmental knowledge—insights that could inform more sustainable, context-sensitive practices. Activists and scholars have long called for integrating such knowledge into environmental policy and planning. Consider the concept of One Health, which is now gaining traction in global health discourse. It emphasizes the interconnectedness of human, animal, and environmental well-being^[9]. Yet, for many so-called “primitive” communities—such as the Quechua in South America or the Mijikenda in East Africa—this interconnected worldview has always been part of everyday life. For them, it isn’t a theoretical framework; it’s a way of being.

Research continues to uncover the richness of knowledge embedded in non-dominant knowledge systems—wisdom that could benefit all of humanity. From the survival expertise of the Anasazi in harsh climates, to the bureaucratic ingenuity of the Incas, and the advanced maritime knowledge of Pacific Island navigators, countless examples show that critical scientific and cultural insights exist well beyond the bounds of dominant cultures^[10]. Even today, researchers are trying to decipher the secrets of Roman concrete, which has lasted for centuries, and to recreate Damascus steel, the technique for which has been lost over time.

This is not to claim that traditional knowledge systems are categorically superior to contemporary scientific ones. Rather, it's a call to recognize the richness, relevance, and validity of knowledge systems that have long been forgotten. Knowledge, as scholars point out, is not static—it shifts over time and across individuals. I can attest to this firsthand. My dad, for example, knows the nooks and corners of Chennai, one of India’s major cities, without ever needing a map. In contrast, I can barely navigate the next street without Google Maps, but hey I can use Google Maps better than him.

Another important dimension to note is that knowledge is frequently linguistically unique—existing only in a single language and absent from others. One study on medicinal plants found that over 75% of the 12,495 distinct plant names were unique to just one language^[11]. Another study highlights how vocabulary depth reflects cultural priorities: Mongolian contains an extensive lexicon related to horses, while Inuit languages have highly nuanced terms for snow^[12]. These linguistic distinctions aren’t just trivia—they signal how languages serve as vessels for knowledge that can’t always be translated or transferred easily. Much of this knowledge remains undigitized, trapped in languages that are underrepresented online.

The bright side is that more people from diverse backgrounds are joining the digital world every day, and growing efforts are underway to support vernacular language use online. These emerging initiatives represent a vital step toward preserving and democratizing the full spectrum of human knowledge.

So, one might wonder: given that more people are coming online every day, won’t this lead to the eventual digitization of all information—including that found in low-resource languages? While it's true that the internet's user base is becoming more linguistically diverse, we’re also witnessing a major shift in how people access and share knowledge—driven by the rise of generative AI.

Setting aside the issue of low-resource languages for a moment, consider the idea of Knowledge Collapse, as proposed by Andrew Peterson^[13]. Peterson defines it as the progressive narrowing of the information available to humans over time, along with a parallel decline in how people perceive the usefulness or even existence of different sets of information. Generative AI models may inadvertently accelerate this phenomenon due to how they are trained and how they generate responses. AI models tend to reproduce information from the "center" of its distribution—if we imagine knowledge in AI as a bell curve, the more commonly a piece of information appears in the training data, the more likely it is to be reproduced. This means that lesser-known or niche knowledge—those in the "long tail"—may gradually fade from view due to lack of retrieval and use. Over time, these underrepresented perspectives risk being forgotten or becoming even more marginalized.

But Knowledge Collapse doesn’t just stem from how AI models generate information—it also reflects how humans are changing the way they seek and share information. Increasingly, people are turning to AI chatbots for answers instead of consulting other humans. For instance, platforms like Stack Overflow—once thriving hubs for peer-to-peer knowledge exchange—have experienced sharp declines in web traffic since the rise of public AI tools^[14]. This decline isn’t just about page views; it signals a shift in the culture of knowledge sharing. Platforms like Stack Overflow are built on a social ethos of ex-post reciprocity—users contribute answers because they themselves once benefited from others' help^[15]. This mutual exchange creates a rich, evolving public knowledge base. But when people consult AI instead of humans, this social loop breaks. There’s no one to thank, no incentive to share insights back with a broader community. This shift could have far-reaching implications. If fewer people contribute their experiences, corrections, or context-specific solutions in public forums, we risk losing the continuous, community-driven process that brings new knowledge into the digital sphere. Vernacular insights, local expertise, and real-time problem-solving—things AI systems can’t autonomously generate—may disappear from the public domain. In effect, the widespread reliance on AI could lead not to a democratization of knowledge, but to its gradual narrowing.

Moreover, AI is being increasingly embedded in education. Across the globe, companies are developing AI tutors to deliver personalized learning, not only in the West but in many parts of the world. While this has great potential, it also poses risks. There’s a Tamil saying: "Sattila irukathu thaan agappaila varum," which loosely means, “What’s in the pot is what comes out in the ladle.” It’s often used to call out people who speak without substance—and it applies to AI too. These models can only produce content based on the information they’ve been trained on. As we already say, AI models currently rely heavily on dominant and homogeneous knowledge sources, creating a danger of epistemic homogenization. However, this isn't a new concern pertaining only to AI. In Words to Play With, V. S. Naipaul highlighted how colonized communities were made to read English literature that bore little resemblance to their own lives. In the colonial education system, students were taught colonial knowledge systems to serve the colonizers rather than learning and building on the local knowledge that could serve their own communities better. The remnants of the colonial education system still persists. Although many postcolonial nations have achieved political independence, epistemic independence remains elusive. AI amplifies this dynamic. If AI with its current homogeneity becomes the default medium for learning and information access, and it draws from a narrow set of linguistic and cultural sources, it could further marginalize local languages and knowledge systems.

As homogenized AI spreads, the risk of language attrition—where speakers gradually lose fluency and cultural context—in their native tongue increases. This is especially troubling given recent research^[16] predicting that 1,500 languages may go extinct by the end of this century. Formal education, while beneficial in many ways, has also been linked to higher risks of language endangerment due to standardization. Standardization comes at the cost of loss of diversity leading to homogeneity. As we say earlier, just as a crop's resilience is tied to its genetic diversity, the strength of our intellectual future relies on the richness and variety of our knowledge systems. Therefore, we carry a pragmatic and moral responsibility to preserve the world’s linguistic and epistemic diversity to ensure a resilient future. If we fail to digitize the knowledge embedded in underrepresented languages, we risk losing immense diverse human knowledge accrued over centuries.

Having grown up in a colonized country, I’ve come to recognize how the delegitimization of non-dominant knowledge systems plays out subtly in everyday life. Certain cultures are still perceived as superior, and many people—consciously or not—strive to align themselves with these dominant cultures, often at the expense of their own. This quiet erosion leads to the marginalization and eventual disappearance of local knowledge systems. Truly acknowledging that valuable knowledge exists beyond dominant languages begins with intellectual humility. Only by accepting that what we know is partial—and that other worldviews, expressed in other languages, hold equally valuable insights—can we start designing systems that genuinely aim to capture and preserve this richness.

In a recent conversation with a group of Tamil-speaking school students in my village, I asked them what skills would be needed to build an AI model in Tamil. Among several thoughtful answers, one student said that a key skill would be translating from English to Tamil—because, in their words, “all the knowledge is now present only in English.” That response struck me deeply. It reflected the broader epistemic hierarchies we’ve already internalized: that English is where knowledge lives, and other languages exist only to catch up.

As a pragmatic researcher, I won’t claim that Tamil today holds more—or even equal—digital knowledge compared to English. Over the past century alone, vast amounts of scientific, technical, and institutional knowledge have accumulated in English. But what I would argue is this: languages like Tamil contain forms of knowledge that have not yet been encoded in English. They embody insights, metaphors, philosophies, ecological practices, and everyday wisdom that remain digitally invisible. Therefore, the goal of developing AI technologies for low-resource languages shouldn’t be limited to improving user experience for their speakers. It must also include a conscious effort to import the knowledge encoded in these languages—knowledge that is currently inaccessible to the dominant digital sphere.

Today’s digital systems often cast AI as a producer or dispenser of knowledge. But we should instead design AI as a participant—and more importantly, as a learner. We need digital ecosystems that empower vernacular users to share what they know, in their own languages, in ways that both people and machines can learn from. In such a system, humans remain the primary holders and stewards of knowledge. We don’t need AI chatbots dispensing agricultural advice to rural Indian farmers. What we need are collaborative platforms where those farmers can exchange insights with one another—platforms where AI supports this exchange by capturing, contextualizing, and amplifying their knowledge. In this future, AI is not an oracle. It is a facilitator.

If we continue to overlook "low-resource" languages, we do more than exclude their speakers—we risk losing vast reservoirs of knowledge in our pursuit of so-called superintelligence.

https://commoncrawl.github.io/cc-crawl-statistics/plots/languages ↩︎
https://www.ethnologue.com/insights/ethnologue200/ ↩︎
Language Technologies for Low Resource Languages - Sociolinguistic and Multilingual Insights ↩︎
https://www.vice.com/en/article/wikipedia-editors-elite-diversity-foundation/ ↩︎
https://www.nngroup.com/articles/participation-inequality/ ↩︎
Guo, Yan, and Gulbahar H. Beckett. "The hegemony of English as a global language: A critical analysis." Education and Social Development. Brill, 2008. 57-69. ↩︎
Phillipson, Robert. Linguistic imperialism continued. Routledge, 2013. ↩︎
Numerous studies highlight how indigenous knowledge systems have been profoundly shaped by colonialism, wherein Western thought is often imposed and regarded as superior due to cultural dominance ↩︎
https://www.who.int/health-topics/one-health#tab=tab_1 ↩︎
Watson-Verran, Helen, David Turnbull, and Sheila Jasanoff. "Science and other indigenous knowledge systems." Knowledge: Critical concepts, Routledge, London (2005): 345-369. ↩︎
Cámara-Leret, Rodrigo, and Jordi Bascompte. "Language extinction triggers the loss of unique medicinal knowledge." Proceedings of the National Academy of Sciences 118.24 (2021): e2103683118. ↩︎
Khishigsuren, Temuulen, Terry Regier, Ekaterina Vylomova, and Charles Kemp. 2025. “A Computational Analysis of Lexical Elaboration across Languages.” Proceedings of the National Academy of Sciences 122 (15): e2417304122. https://doi.org/10.1073/pnas.2417304122. ↩︎
AI and the problem of knowledge collapse ↩︎
https://www.ericholscher.com/blog/2025/jan/21/stack-overflows-decline/ ↩︎
https://dl.acm.org/doi/10.1145/3371388 ↩︎
https://www.nature.com/articles/s41559-021-01604-y ↩︎