A new study suggests that advanced artificial intelligence models, increasingly used in medicine, can exhibit human-like errors in reasoning when making clinical recommendations. The research found these AI models were susceptible to cognitive biases, and in many cases, the magnitude of these biases was even greater than that observed in practicing human doctors. The findings were published in NEJM AI.
The use of generative artificial intelligence in healthcare has expanded rapidly. These AI models, often called large language models, can draft medical histories, suggest diagnoses, and even pass medical licensing exams. They develop these abilities by processing immense quantities of text from the internet, including everything from scientific articles to popular media. The sheer volume and variety of this training data, however, means it is not always neutral or factual, and it can contain the same ingrained patterns of thought that affect human judgment.
Cognitive biases are systematic patterns of deviation from rational judgment. For example, the “framing effect” describes how the presentation of information can influence a decision. A surgical procedure described as having a 90 percent survival rate may seem more appealing than the same procedure described as having a 10 percent mortality rate, even though the outcomes are identical.
Researchers Jonathan Wang and Donald A. Redelmeier, from research institutes in Toronto, hypothesized that AI models, trained on data reflecting these human tendencies, might reproduce similar biases in their medical recommendations.
To test this idea, the researchers selected 10 well-documented cognitive biases relevant to medical decision-making. For each bias, they created a short, text-based clinical scenario, known as a vignette. Each vignette was written in two slightly different versions. While both versions contained the exact same clinical facts, one was phrased in a way designed to trigger a specific cognitive bias, while the other was phrased neutrally.
The researchers then tested two leading AI models: GPT-4, developed by OpenAI, and Gemini-1.0-Pro, from Google. They prompted the models to act as “synthetic respondents,” adopting the personas of 500 different clinicians. These personas were given a unique combination of characteristics, including medical specialty, years of experience, gender, and practice location. Each of these 500 synthetic clinicians was presented with both versions of all 10 vignettes, and the AI’s open-ended recommendations were recorded.
The results for GPT-4 showed a strong susceptibility to bias in nine of the ten scenarios tested. A particularly clear example was the framing effect. When a lung cancer surgery was described using survival statistics, 75 percent of the AI responses recommended the procedure. When the identical surgery was described using mortality statistics, only 12 percent of the responses recommended it. This 63-percentage-point difference was significantly larger than the 34-point difference observed in previous studies of human physicians presented with a similar scenario.
Another prominent bias was the “primacy effect,” where information presented first has an outsized influence. When a patient vignette began with the symptom of coughing up blood, the AI included pulmonary embolism in its potential diagnosis 100 percent of the time.
When the vignette began by mentioning the patient’s history of chronic obstructive pulmonary disease, pulmonary embolism was mentioned only 26 percent of the time. Hindsight bias was also extremely pronounced; a treatment was rated as inappropriate in 85 percent of cases when the patient outcome was negative, but in zero percent of cases when the outcome was positive.
In one notable exception, GPT-4 demonstrated a clear superiority to human reasoning. The model showed almost no “base-rate neglect,” a common human error of ignoring the overall prevalence of a disease when interpreting a screening test. The AI correctly calculated the probability of disease in both high-prevalence and low-prevalence scenarios with near-perfect accuracy (94 percent versus 93 percent). In contrast, prior studies show human clinicians struggle significantly with this type of statistical reasoning.
The researchers also explored whether the characteristics of the synthetic clinician personas affected the AI’s susceptibility to bias. While there were minor variations, with family physician personas showing slightly more bias and geriatrician personas slightly less, these differences were not statistically significant. No single characteristic, such as years of experience or practice location, appeared to protect the AI model from making biased recommendations.
A separate analysis was conducted using the Gemini-1.0-Pro model to see if the findings would be replicated. This model also displayed significant biases, but its patterns were different from both GPT-4 and human doctors. For example, Gemini did not exhibit the framing effect in the lung cancer scenario. In other tests, it showed biases in the opposite direction of what is typically seen in humans.
When testing for a bias related to capitulating to pressure, Gemini was less likely to order a requested test, not more. These results suggest that different AI models may have their own unique and unpredictable patterns of error.
The authors acknowledge certain limitations to their study. The AI models tested are constantly being updated, and future versions may incorporate safeguards against these biases. Detecting and correcting these ingrained reasoning flaws, however, is far more complex than filtering out obviously false or inappropriate content. The biases are often subtle and woven into the very medical literature used to train the models.
Another caveat is that the study used simulated clinical scenarios and personas, not real-world patient interactions. The research measured the frequency of biased recommendations but did not assess how these biases might translate into actual patient outcomes, costs, or other real-world impacts. The study was also limited to 10 specific cognitive pitfalls, and many other forms of bias may exist within these complex systems.
The findings suggest that simply deploying AI in medicine is not a guaranteed path to more rational decision-making. The models are not detached, purely logical agents; they are reflections of the vast and imperfect human data they were trained on. The authors propose that an awareness of these potential AI biases is a necessary first step. For these powerful tools to be used safely, clinicians will need to maintain their critical reasoning skills and learn to appraise AI-generated advice with a healthy degree of skepticism.
The study, “Cognitive Biases and Artificial Intelligence,” was authored by Jonathan Wang and Donald A. Redelmeier.