A decades-old psychology test a exposed a surprising weakness in AI’s ability to remain focused.
A classic psychology test has disclosed a unexpected weakness in some of these today’s most advanced artificial intelligence structures, indicating that AI attention may work very differently from human attention.
Researchers headed by Suketu Patel investigated how large language models (LLMs), the technology behind systems which includes GPT-5, Claude, and Gemini, manage a well-known cognitive challenge known as the Stroop task. The findings suggest that while AI can carry out remarkably on many complex tasks, it may struggle to maintain focus when faced with competing information over prolonged periods.
What Is the Stroop Task?
The Stroop task is a classic psychology experiment that has been used for decades to study attention and mental control. In the test, participants see words that name colors, consisting of “red” or “blue,” showed in colored ink.
Sometimes the phrase and the ink colour healthy. For example, the phrase “crimson” may additionally appear in crimson ink. Other instances they conflict, inclusive of the phrase “pink” appearing in blue ink.
Participants are asked to detect the color of the ink while ignoring the meaning of the word itself.
Although this sounds easy, it creates a mental conflict. Most people are highly practiced at reading words automatically, so suppressing that instinct needs what psychologists call executive control. This refers to the brain’s ability to target on a goal, resist distractions, and override automatic responses.
Humans generally take a little longer to answer when the word and colour do not match, a phenomenon refers as the Stroop effect. Moreover, even when the task becomes prolonged, people typically maintain high accuracy and stay focused on the instructions.
AI Performs Well at First
To see how modern AI systems would manage the same challenge, the researchers tested numerous leading language models using lists of color words.
When showcased with short lists containing 5-words whose meanings conflicted with their ink colors, the models carried out quite well.
GPT-4o obtained 91% accuracy on these shorter tests. Claude 3.5 Sonnet also executed strongly.
At first look, the outcomes counseled that AI systems ought to successfully follow the task and ignore the distracting word meanings.
Performance Collapses as Lists Get Longer
The picture modified dramatically as the researchers accelerated the length of the word lists.
GPT-4o’s accuracy dropped from 91% with 5-words to 57% with 10-words. By the time the listing reached 40 words, accuracy had dropped to just 15%.
Claude 3.5 Sonnet proved more resilient, keeping stable performance via lists of 20-words. Moreover, it too experienced a sharp decline, falling to 24% accuracy when confronted with 40-words.
The researchers found similar patterns in GPT-5, Claude Opus 4.1, and Gemini 2.5.
Performance became even worse whilst matching and mismatched color words appeared collectively within the same list. Under those conditions, accuracy on the mismatched items dropped to nearly zero.
Why Humans and AI Respond Differently
The outcomes point to an vital difference among human cognition and the way large language models process information.
Like people, AI structures have correctly obtained far more training in recognizing and decoding words than in detecting colours. This forms a natural tendency to focus on the written word.
Moreover, humans are normally able to suppress that automatic response and remain focused on the task they have been instructed to perform, even across long sequences of items.
The language models, by contrast, an increasingly reverted to reading the words rather than to naming the colors as the tests persisted. In other words, they appeared to lose track of the original goal.
As per to the researchers, this breakdown suggests that the attention mechanisms used by transformer-based AI systems differ essentially from the biological attention systems found in the human brain.
A Window Into AI’s Limitations
Large language models have verified remarkable abilities in writing, reasoning, coding, and communique. Yet research like this emphasize that impressive performance does not necessarily imply AI processes information the same way humas do.
The findings suggest that modern AI may have hidden weaknesses when tasks needs sustained focus, inhibition of automatic responses, and long-term maintenance of specific instructions.
As AI systems become increasingly incorporated into everyday life, understanding those limitations can be just as crucial as measuring their strengths.












