Social reasoning in AI traced to an extremely small set of parameters

A new study reveals that the capacity for social reasoning in large language models, a trait similar to the human “theory of mind,” originates from an exceptionally small and specialized subset of the model’s internal parameters. Researchers found that these few parameters are deeply connected to the mechanisms that allow a model to understand word order and context. The work, published in npj Artificial Intelligence, provides a look into how complex cognitive-like abilities can emerge from the architecture of artificial intelligence.

Theory of mind is the ability to attribute mental states like beliefs, desires, and intentions to oneself and to others. It is what allows a person to understand that someone else might hold a false belief, for example, believing an object is in a box when it has been secretly moved to a drawer. This type of social reasoning is fundamental to human interaction.

In recent years, large language models have demonstrated an apparent ability to solve tasks designed to test this capacity, but the internal processes giving rise to this skill have remained largely opaque. Understanding these mechanics is a key goal for researchers working on making artificial intelligence more transparent and predictable.

This investigation was conducted by a team of researchers from Stanford University, Princeton University, the University of Minnesota, the University of Illinois Urbana-Champaign, and the Stevens Institute of Technology. Their work aimed to move beyond simply testing a model’s performance on social reasoning tasks.

Instead, they sought to identify the specific internal components responsible for this behavior, effectively looking under the hood to see how the machine performs its reasoning. The central questions were to pinpoint which of the billions of parameters in a model are most sensitive to theory-of-mind tasks and to determine how these parameters influence the model’s computational workflow.

To identify the parameters responsible for theory of mind, the researchers developed a novel method based on a mathematical tool that measures how much the model’s performance changes when a specific parameter is slightly altered. They first calculated this sensitivity for parameters while the model performed theory-of-mind tasks, specifically “false-belief” scenarios.

These tasks test if a model can recognize that an agent’s belief about the world is different from reality. For instance, a model would be presented with a story where a character places an item in one location, and then another character moves it without the first one’s knowledge. The model must correctly predict that the first character will look for the item in its original location.

This initial process identified a set of parameters sensitive to these social reasoning puzzles. However, the team recognized that some of these parameters might also be essential for general language processing. To isolate the ones specifically related to theory of mind, they performed a second sensitivity analysis on a general language modeling task and created a map of parameters vital for basic language functions. By subtracting this general language map from the theory-of-mind map, they were left with a very small, specialized set of parameters primarily dedicated to social reasoning.

With these “ToM-sensitive” parameters identified, the team conducted a perturbation experiment. They altered the values of this tiny group of parameters, which constituted as little as 0.001% of the model’s total. The effect on the model’s performance was significant.

Across several different language models, this small change caused a substantial drop in their ability to correctly answer theory-of-mind questions. As a control, the researchers also perturbed a randomly selected group of parameters of the same size. This random alteration had almost no effect on performance, indicating that the identified ToM-sensitive parameters have a specialized function.

The researchers discovered that this performance degradation was not just limited to social reasoning. The models also became worse at tasks requiring contextual localization, which is the ability to understand where a piece of information is located within a longer text. This suggested a link between the model’s ability to reason about mental states and its more fundamental ability to track the position of words and concepts in a sequence. The findings pointed toward the model’s positional encoding system, the architectural component that gives it a sense of word order.

The investigation then turned to how these sensitive parameters interact with the model’s core architecture. Many modern language models use a technique called Rotary Position Embedding, or RoPE, to understand word order. This method encodes the position of a word by applying a rotation to its numerical representation, with different dimensions of the representation rotating at different frequencies.

The analysis showed that the identified ToM-sensitive parameters were not random; they were precisely aligned with what are known as dominant frequency activations. These are the specific frequencies that the model relies on most heavily to process positional information.

When the ToM-sensitive parameters were perturbed, these dominant frequency patterns were disrupted. This effectively damaged the model’s internal map of the text, explaining why its ability for contextual localization diminished. The effect was specific to models that use the RoPE system.

In a model from a different family, which uses an alternative method for positional encoding, the same kind of sparse, sensitive parameter pattern was not found. This architectural contrast confirmed that the social reasoning ability in RoPE-based models is tightly coupled with this particular mechanism for handling word order.

The final piece of the puzzle was to trace how this disruption in positional encoding affects the model’s attention mechanism. The attention mechanism is what allows a model to weigh the importance of different words in a text when making a prediction. Many models exhibit a phenomenon known as an “attention sink,” where a significant amount of attention is consistently directed toward the very first token in a sequence. This first token acts as a stable anchor, helping the model organize its processing of the rest of the text.

The researchers found that the ToM-sensitive parameters play a role in maintaining the geometric relationship between the vector for the current word being processed and the vector for the first, anchor token. Perturbing these parameters altered the angle between these two vectors, making them more orthogonal, or perpendicular.

This change destabilized the attention sink. As a result, the model’s attention, no longer properly anchored, began to scatter to irrelevant parts of the text, such as punctuation. This breakdown in the model’s focus directly impaired its ability to form a coherent understanding of the language, leading to the observed failures in both social reasoning and general comprehension.

While this work provides a mechanistic explanation for theory-of-mind-like abilities in some models, the researchers note certain limitations. The analysis was primarily focused on specific types of false-belief tasks, and future work could explore whether similar parameter patterns govern more nuanced social skills like detecting irony or social faux pas. The findings also suggest that what appears to be a sophisticated cognitive skill may emerge from more fundamental mechanisms related to language structure and context.

The identification of such a localized set of parameters opens up new directions for research. It could lead to more efficient ways to align model behavior with human values or ethical norms. At the same time, it highlights potential vulnerabilities; if social reasoning is concentrated in such a small area, it could be a target for adversarial attacks designed to manipulate a model’s behavior. Understanding these structural underpinnings is a step toward developing artificial intelligence systems that are more transparent, reliable, and better aligned with human social cognition.

The study, “How large language models encode theory-of-mind: a study on sparse parameter patterns,” was authored by Yuheng Wu, Wentao Guo, Zirui Liu, Heng Ji, Zhaozhuo Xu & Denghui Zhang.

Stay up to date
Register now to get updates on promotions and coupons
HTML Snippets Powered By : XYZScripts.com

Shopping cart

×