
Unraveling the Mystery: Why Do Language Models Fabricate Information? Insights from New Research
2025-03-28
Author: Jessica Wong
Introduction
One of the most exasperating challenges faced by users of large language models (LLMs) is their tendency to fabricate answers—commonly referred to as "hallucination." This phenomenon occurs when these models generate responses that sound plausible yet lack support from their training data. From a human standpoint, it raises the question: why don't these models simply respond with a definitive "I don't know" instead of spinning tales or providing inaccurate information?
Recent research conducted by Anthropic delves into the inner workings of LLMs, uncovering some of the neural pathways that influence their decision-making processes. While our comprehension of how LLMs internalize information is still incomplete, these findings pave the way for potential advancements in addressing the confabulation issue prevalent in AI systems.
What Happens When a "Known Entity" Isn’t Known?
In May of last year, Anthropic published a pioneering paper utilizing sparse auto-encoders to map out the activation of artificial neuron groups when the Claude model encounters familiar concepts like the "Golden Gate Bridge" or more intricate topics like "programming errors." Building on this foundational work, Anthropic's latest research sheds light on how these activated features impact decision-making circuits within the model when crafting responses.
Across two extensive papers, Anthropic offers an in-depth look at Claude's cognitive processes, including its multilingual capabilities, susceptibility to certain jailbreak tactics, and the reliability of its so-called "chain of thought" reasoning. Among these insights, the section that explicates Claude's "entity recognition and hallucination" processes stands out for its clarity and detail.
At the heart of these LLMs lies their primary function: to predict the subsequent text based on a given input. This design has led some critics to label them as "glorified auto-completers." While this core functionality performs well with commonly presented text from their extensive training datasets, it falls short in responding accurately to obscure facts or themes. This often results in models taking wild guesses in an attempt to fill in the blanks.
Fortunately, fine-tuning methods can alleviate this issue, enabling the model to serve as a more effective virtual assistant, which includes initially declining to answer prompts when it recognizes a lack of relevant training data. This tuning process activates distinct neurons that specify when an entity is "known" (like "Michael Jordan") versus "unfamiliar" (like "Michael Batkin").
Recognition Versus Recall
When it recognizes an "unfamiliar name," Claude tends to engage an internal "can't answer" circuit that leads to cautious responses such as, "I apologize, but I cannot..." Conversely, familiar terms disable this circuit, allowing Claude to delve into its reservoir of knowledge regarding well-known figures and provide informed answers.
Anthropic's findings also highlight a worrying aspect of these neural networks: by artificially inflating the weights of neurons associated with "known answers," Claude can be induced to confidently fabricate information about fictional entities, such as "Michael Batkin." This illuminates the potential reasoning behind some of Claude's hallucinations being a malfunction of the circuit designed to warn against answering questions without substantial backing.
In one experiment, when tasked with naming a publication by AI researcher Andrej Karpathy, Claude mistakenly cited the fictitious paper title "ImageNet Classification with Deep Convolutional Neural Networks." Conversely, when asked about the work of Anthropic mathematician Josh Batson, it correctly refrained from naming a specific paper, stating it cannot do so without verifying information.
Researchers speculate that the hallucination concerning Karpathy arises from Claude recognizing his name, which activates certain "known entity" features and inadvertently reduces the impulse to employ the "don't answer" circuit. This leads the model to generate information that it lacks comprehensive knowledge of.
Conclusion
The ongoing investigation into the intricate mechanics of LLMs is crucial for understanding how these models produce their outputs. However, Anthropic cautions that their current approach only scratches the surface, capturing merely a fraction of Claude's vast computational capabilities and requiring hours of human analysis to decode these neural circuits effectively.
These insights represent early steps towards refining research methods that should illuminate LLMs' confabulation tendencies and, ultimately, provide pathways to enhancing their reliability in generating accurate information. Future advancements could lead us closer to resolving these perplexing hallucinations and turning LLMs into more trustworthy tools. Stay tuned for what this breakthrough research may unveil!