AI chatbots tend to overdiagnose mental health conditions when used without structured guidance

A new study published in Psychiatry Research suggests that while large language models are capable of identifying psychiatric diagnoses from clinical descriptions, they are prone to significant overdiagnosis when operating without structured guidance. By integrating expert-derived decision trees into the diagnostic process, researchers from the University of California San Francisco found they could improve the precision of these artificial intelligence models and reduce the rate of incorrect false positives.

The rapid development of artificial intelligence has led to increased interest in its potential applications within healthcare. Large language models like OpenAI’s ChatGPT have shown an ability to process and generate complex text, which has raised the possibility of their use in mental health settings for tasks such as decision support or documentation.

Many patients are already accessing these public tools to interpret their own symptoms and seek medical advice. However, these models are trained on vast datasets from the internet rather than specific medical curricula. This training method means the models function based on statistical probability and linguistic patterns rather than genuine clinical understanding.

There is a concern that without specific medical training or guardrails, these general-purpose models might generate inaccurate or harmful advice. The ability of a computer program to produce coherent text does not necessarily equate to the ability to perform the complex reasoning required for a psychiatric diagnosis.

The authors of the new study sought to evaluate whether generic large language models could effectively reason about mental health cases. They also aimed to determine if feeding the models specific, expert-created rules could enhance their accuracy and safety.

“There has been considerable interest in using Large Language Model (LLM)-based technologies to build clinical and research tools for behavioral health. Additionally, individuals are increasingly using LLM-based chatbots (such as ChatGPT, Claude, Gemini, etc.) as health information tools and for emotional support,” explained study author Karthik V. Sarma, who founded the UCSF AI in Mental Health Research Group within the Department of Psychiatry and Behavioral Sciences at UCSF.

“We were interested in seeing how well these LLMs worked in our field, and chose vignette diagnosis as an example problem for evaluation. We also wanted to know if we could improve the performance of the models by constraining them to use reasoning pathways (decision trees) designed by psychiatric experts.”

To conduct this investigation, the researchers utilized a set of 93 clinical case vignettes drawn from the DSM-5-TR Clinical Cases book. These vignettes serve as standardized examples of patients with specific psychiatric conditions, such as depression, bipolar disorder, or schizophrenia. The team divided these cases into a training set, which was used to refine their prompting strategies, and a testing set, which was used to evaluate the final performance of the models. They tested three versions of the GPT family of models: GPT-3.5, GPT-4, and GPT-4o.

The researchers designed two distinct experimental approaches to test the models. The first was a “Base” approach, where the artificial intelligence was simply given the clinical story and asked to predict the most likely diagnoses. This method mimics how a casual user might interact with a chatbot by describing symptoms and asking for an opinion. The second method was a “Decision Tree” approach. This involved adapting the logic from the DSM-5-TR Handbook of Differential Diagnosis, a professional guide that uses branching logic to rule conditions in or out.

In the Decision Tree approach, the researchers did not ask the model for a diagnosis directly. Instead, they converted the expert logic into a series of “yes” or “no” questions. The model was prompted to answer these questions based on the case vignette.

For example, the model might be asked if a patient was experiencing a specific symptom for a certain duration. The answers to these sequential questions would then lead the system down a path toward a potential diagnosis. This method forced the model to follow a step-by-step reasoning process similar to that of a trained clinician.

The results showed a clear distinction between the two methods. When the models were directly prompted to guess the diagnosis in the Base approach, they demonstrated high sensitivity. The most advanced model, GPT-4o, correctly identified the author-designated diagnosis in approximately 77 percent of the cases. This indicates that the models are quite good at picking up on the presence of a disorder based on the text.

However, this high sensitivity came at the cost of precision. The Base approach resulted in a low positive predictive value of roughly 40 percent. This metric reveals that the models were casting too wide a net. They frequently assigned diagnoses that were not present in the vignettes.

On average, the base models produced more than one incorrect diagnosis for every correct one. This tendency toward overdiagnosis represents a significant risk, as it could lead to patients believing they have conditions they do not actually possess.

“This suggests to everyone that diagnoses generated by generalist chatbots may not be accurate, and it is important to consult with a health professional,” Sarma told PsyPost.

The implementation of the Decision Tree approach yielded different results. By forcing the models to adhere to expert reasoning structures, the researchers increased the positive predictive value to approximately 65 percent. This improvement means that when the system suggested a diagnosis, it was much more likely to be correct. The rate of overdiagnosis dropped compared to the direct prompting method.

There was a trade-off associated with this increased precision. The sensitivity of the Decision Tree approach was slightly lower than that of the Base approach, coming in at around 71 percent. This suggests that the strict rules of the decision trees occasionally caused the model to miss a diagnosis that the more open-ended method might have caught. Despite this slight drop in sensitivity, the overall performance as measured by the F1 statistic—a metric that balances precision and recall—was generally higher for the Decision Tree approach.

The study also highlighted the importance of refining the prompts used to guide the artificial intelligence. During the training phase, the researchers found that the models sometimes misunderstood medical terminology or the structure of the decision trees. For instance, the models initially struggled to differentiate between “substance use” and medical side effects, or they would misinterpret clinical terms like “ego-dystonic.” The researchers had to iteratively refine their questions to ensure the models interpreted the clinical criteria correctly.

The findings provide evidence that generalist large language models possess an emergent capability for psychiatric reasoning. Performance improved with each successive generation of the model, with GPT-4 and GPT-4o outperforming the older GPT-3.5. This trajectory suggests that as these models continue to evolve, their capacity for handling complex medical tasks may increase.

“Practically speaking, the reduction in overdiagnosis using our decision trees was significant,” Sarma explained. “However, the task we used (vignette diagnosis) is a much easier task than real-world diagnosis. I would expect performance at this stage to be much worse in the real world, and we are still working on methods to address this problem. For now, I do not believe that these generalist models are ready for use as mental health support agents, though there may be other specialist models that are more capable.”

The tendency for overdiagnosis observed in the Base approach is particularly relevant for the general public. Individuals using chatbots for self-diagnosis should be aware that these systems may be biased toward finding pathology where none exists. The study suggests that while artificial intelligence can be a powerful tool for analyzing behavioral health data, it works best when constrained by expert medical knowledge and validated guidelines.

“It was not our goal to produce an actual clinical tool that is ready to use, and that was not the outcome of our work,” Sarma noted. “Instead, we focused on investigation how well current models work, and on whether or not our idea to integrate the current models with expert guidelines was helpful. We hope our finding can be used to develop better real-world tools in the future.”

Future research will need to focus on testing these systems with real-world patient data to see if the findings hold up in clinical practice. The authors also suggest that future work could explore using these models to identify new diagnostic patterns or language-based phenotypes that go beyond current classifications. For now, the integration of expert reasoning appears to be a necessary step in making these powerful tools safer and more accurate for potential psychiatric applications.

“We are now working on developing systems that can operate on real-world data, and measuring the impact of different methods in this setting,” Sarma explained. “We’re also working on better understanding how the use of chatbots by people with diagnosed mental illnesses impact their health (see more here).”

The study, “Integrating expert knowledge into large language models improves performance for psychiatric reasoning and diagnosis,” was authored by Karthik V. Sarma, Kaitlin E. Hanss, Andrew J. M. Halls, Andrew Krystal, Daniel F. Becker, Anne L. Glowinski, and Atul J. Butte.

Leave a comment
Stay up to date
Register now to get updates on promotions and coupons
HTML Snippets Powered By : XYZScripts.com

Shopping cart

×