AI and the problem of knowledge collapse

Peterson, Andrew J. 2025. “AI and the Problem of Knowledge Collapse.” AI & SOCIETY, January. https://doi.org/10.1007/s00146-024-02173-x.

Notes

In-text annotations

"We identify conditions under which AI, by reducing the cost of access to certain modes of knowledge, can paradoxically harm public understanding." (Page 1)

"Researchers have noted that the recursive training of AI models on synthetic text may lead to degeneration, known as “model collapse”" (Page 1)

"With increasing integration of LLM-based systems, certain popular sources or beliefs which were common in the training data may come to be reinforced in the public mindset (and within the training data), while other “long-tail” ideas are neglected and eventually forgotten." (Page 2)

"To the extent AI can radically discount the cost of access to certain kinds of information, it may further generate harm through the “streetlight effect”, in which a disproportionate amount of search is done under the lighted area not because it is more likely to contain one’s keys but because it’s easier to look there." (Page 2)

"We identify a dynamic whereby AI, despite only reducing the cost of access to certain kinds of information, may lead to “knowledge collapse,” neglecting the long-tails of knowledge and creating an degenerately narrow perspective over generations." (Page 2)

"In a more recent twist, (Sharma et al. 2024) find that LLM-powered search may generate more selective exposure bias and polarization by reinforcing pre-existing opinions based on finer-grained clues in the user’s queries." (Page 3)

"The role of self-selection into communities and recommendation algorithms provides a explanation for why this might not be the case." (Page 3)

"Finally, while much of the focus is naturally on overt racial and gender biases, there may also be pervasive but less observable biases in the content and form of the output. For example, current LLMs trained on large amounts of English text may ‘rely on’ English in their latent representations, as if a kind of reference language (Wendler et al. 2024)." (Page 5)

"In other domains, however, it is less clear, especially within regions. Historically, knowledge has not progressed monotonically, as evidenced by the fall of the Western Roman empire, the destruction of the House of Wisdom in Baghdad and subsequent decline of the Abbasid Empire after 1258, or the collapse of the Mayan civilization in the 8th or 9th century. Or, to cite specific examples, the ancient Romans had a recipe for concrete that was subsequently lost, and despite progress we have not yet re-discovered the secrets of its durability (Seymour et al. 2023), and similarly for Damascus steel (Kürnsteiner et al. 2020). Culturally, there are many languages, cultural and artistic practices, and religious beliefs that were once held by communities of humans which are now lost in that they do not exist among any known sources" (Page 5)

"For example, traditional hunter-gatherers could identify thousands of different plants and knew their medicinal usages, whereas most humans today only know a few dozen plants and whether they can be purchased in a grocery store." (Page 5)

that also we don't know :P
"Informally,2 we define knowledge collapse as the progressive narrowing over time (or over technological representations) of the set of information available to humans, along with a concomitant narrowing in the perceived availability and utility of different sets of information. The latter is important because for many purposes it is not sufficient for their to exist a capability to, for example, go to an archive to look up some information. If all members deem it too costly or not worthwhile to seek out some information, that theoretically available information is neglected and useless." (Page 5)

"We model knowledge as a process of approximating a (Students t) probability distribution 3. This is simply a metaphor, although it has parallels for example in the analysis of model collapse (Shumailov et al. 2023), but we make no claim that “truth” is in some deep way distributed 1-D Gaussian. This is a modeling assumption in order to work with a process with well-known properties, where there is both a large central mass and long-tails, which we take to be in some general way reflective of the nature of knowledge (and of the distribution of training data for LLMs.)" (Page 6)

"Finally, and relatedly, we could consider the extent to which the responses are representative of different cultures or traditions. For example, in the corpus overall (described below) there are 392 mentions of “Martin Seligman”, an American psychologist who has written on happiness and well-being, while only 62 for “Ghazali” (Al-Ghazali), and 52 for “Farabi” (Al-Farabi) two of the most influential Islamic philosophers of all time. Again, whether that should be considered culturally-biased or not may depend on specific use-cases, as some users might prefer a temporal focus on currently-living authors, for example. We argue that at a minimum, however, it is worth trying to measure the representativeness among diverse cultural traditions but also the tendency to mention only a narrow range of individuals even when specifically asked for a diverse list." (Page 11)

"There are many diverse texts that could be included to expand the corpus, but practically, the approach of market-focused participants may be to focus on seeking texts with the lowest marginal cost (conditional on quality). This might exacerbate a reliance on texts that are not representative of the general public, such as if social media texts are easy to collect but not representative of the perspective of people who don’t have access to social media or self-select out of them." (Page 16)