Advanced AI models suffer a near-total collapse on classic psychology test as cognitive demands increase

New research provides evidence that while advanced artificial intelligence models process language with remarkable skill, they struggle significantly with tasks requiring the kind of sustained focus and conflict resolution seen in human attention.

The study, published in PNAS Nexus, indicates that as cognitive demands increase, these programs experience a complete collapse in their ability to override automatic responses. The findings suggest that artificial intelligence systems currently lack the fundamental executive control necessary for developing true artificial general intelligence.

To understand these findings, it helps to look at how modern artificial intelligence works. Programs like ChatGPT rely on a framework called a transformer architecture. This system uses a specialized attention mechanism that allows the model to assign weight to different parts of a text, predicting which words should come next based on statistical patterns.

Suketu Patel is a doctoral candidate in comparative and cognitive psychology at the Graduate Center of the City University of New York. Patel and his colleagues conducted this research in the laboratory of Jin Fan at Queens College, CUNY. He noted that the initial public reception of modern language models inspired the research team to investigate the software’s true cognitive capabilities.

“When ChatGPT arrived, much of the excitement centered on its capacity for task completion, theory of mind, and emotional intelligence,” Patel said. “Yet it was also prone to hallucination and confabulation. LLM performance was strong on some tasks and surprisingly weak on others. We wanted a canonical attention task to rigorously probe these systems and compare them to biological attention.”

Human attention is a complex process supported by multiple interconnected brain networks. “The Stroop task is fitting because the success of LLMs rests on the transformer’s attention mechanism,” Patel said. “In humans, attention comprises three dissociable yet overlapping systems: alerting, orienting, and executive control. So we set out to test whether these models possess all three.”

The Stroop task, first introduced in the 1930s, measures how well a subject handles conflicting information. In a standard version, a participant might see the word “BLUE” printed in red ink, and they must name the ink color instead of reading the text. “It is worth emphasizing that the Stroop task is not a test of thinking or higher-order reasoning,” Patel said. “It specifically targets conflict resolution and inhibition.”

The automatic human response is to simply read the word itself, which requires active mental suppression to overcome. “The core idea is that word reading is essentially automatic in humans, a heavily trained prior that becomes what we call a prepotent response, the one that fires first and strongest,” Patel explained. “AI is in a similar position, since it is far more trained to read words than to name colors.”

The researchers examined two leading artificial intelligence models: OpenAI’s GPT-4o and Anthropic’s Claude 3.5 Sonnet. The models received an image prompt and were asked to either read the text of the words presented or name the physical color of the text. The team tested the programs using five different conditions, including words printed in matching colors, non-matching colors, a mixed condition, neutral office words, and strings of the letter “X.”

To test how well the programs could sustain their attention, the scientists varied the number of words presented in each image, ranging from one to forty words. “Goal maintenance is the ability to hold onto an instruction and keep following it in any context while filtering out interfering information,” Patel said. “Humans develop this capacity over time. AI can certainly follow instructions and reach goals, but it does so in a fundamentally different way, and that difference becomes more visible as the context grows longer or contains conflicting information.”

When processing short lists of one or five words, the artificial intelligence models performed much like humans. They achieved high accuracy on the word-reading task and showed a slight dip in performance during the mismatched color-naming trials. However, as the lists grew longer, the performance of both models on the incongruent condition collapsed completely.

GPT-4o accurately named the ink color on incongruent trials 91 percent of the time with five-word lists. This accuracy plummeted to just 1 percent on both the twenty-word and forty-word lists. Claude 3.5 Sonnet maintained stability slightly longer but eventually dropped to just 10 percent accuracy on the forty-word incongruent lists.

During these failures, the models entirely abandoned the instruction to name the color and defaulted back to reading the text. “We were surprised by how accuracy broke down at relatively small context sizes, with lists as short as 10 words,” Patel said. “What made this striking was the contrast with the nonword conditions, i.e., XXXX, where accuracy was nearly perfect. That gap highlights just how automatic reading behavior in LLMs, like in humans, also requires meaningful words.”

The researchers suggest that artificial models experience this breakdown because their programming lacks the forceful oversight found in the human brain. “Our central argument is that the limitation stems from the lack of an explicit mechanism for top-down modulation,” Patel told PsyPost. “This is when a rule or goal enforces priority among competing representations from the outset, proactively, and can sustain a constraint by inhibiting a prepotent prior rather than down-weighting.”

Without this mental override, the models are overwhelmed by their basic programming habits. “The study shows that, at the signal level, the ability to detect and resolve the conflict degrades because transformer attention can only impose a soft constraint on that automatic reading, rather than the hard one that an executive control mechanism would provide,” Patel added.

Newer artificial intelligence systems sometimes attempt to bypass this problem using added programming layers. “Scaffolding methods we see in the latest AI systems have tool use, thinking, and code generation to stand in for that missing component, but each is bolted onto a base model that still propagates errors,” Patel said.

Relying on outside code to solve the test fundamentally misses the point of the cognitive assessment. “This is why any strategy that avoids suppressing prepotent word reading defeats the purpose of the Stroop task,” Patel explained. “A few of the models we studied are inconsistent about whether they reach for code, but when they do run code, they tend to solve the task perfectly.”

The scientists address this issue extensively in their report, noting that relying on code generation is not true cognitive control. “Shortcutting the task through chain-of-thought reasoning or code generation is really just avoiding it, papering over a deficiency at the signal level that becomes critical as goals grow more complex,” Patel said. “Humans can cheat in exactly the same way. We can verbalize the answer, blur our vision, or use a tool to keep from reading the word, and each of those moves invalidates the assessment.”

The study does carry certain methodological constraints, and the researchers note that models might eventually pass similar tests through brute-force pattern recognition. “We are not claiming that LLMs cannot do this task,” Patel said. “With more training data, they could likely handle even larger contexts reliably.”

“But that would be a task-specific kind of gating, achieved through sheer exposure, rather than the general form of control that does not depend on heavy training,” Patel added. “It is also worth noting that few tasks share the Stroop task’s particular dynamic, in which one response (reading) is so strongly pre-activated that it competes with the instructed response (naming the color).”

These findings present a challenge to current assumptions within the technology industry. “So the Stroop task is diagnostic of a structural constraint in LLMs, not simply a measure of task performance,” Patel said. “The bitter lesson, and the implicit wager behind scaling to larger models toward artificial superintelligence (ASI), is that this gating mechanism, what neuroscience calls executive control, will emerge from more scale and data without any dedicated architecture.”

Future development in artificial intelligence may need to move beyond simply increasing data processing speeds or expanding text databases. “We have begun exploring how executive control could be built directly into current AI architecture,” Patel said. “We see it as an essential ingredient for long-horizon instruction following, the ability to stay on task across extended and complex interactions.”

The study, “Deficient executive control in transformer attention,” was authored by Suketu Chandrakant Patel, Hongbin Wang, and Jin Fan.

Leave a comment
Stay up to date
Register now to get updates on promotions and coupons
HTML Snippets Powered By : XYZScripts.com

Shopping cart

×