A new study published in the journal Psychiatric Services reports that three major artificial intelligence chatbots perform well when responding to questions about suicide that are either very low risk or very high risk. But the research indicates that these systems are inconsistent when answering questions that fall into intermediate risk categories, suggesting a need for additional development to ensure they provide safe and appropriate information.
Large language models are a form of artificial intelligence trained on immense amounts of text data, allowing them to understand and generate human-like conversation. As their use has become widespread, with platforms like ChatGPT, Claude, and Gemini engaging with hundreds of millions of people, individuals have increasingly turned to them for information and support regarding mental health issues such as anxiety, depression, and social isolation. This trend has raised concerns among health professionals about whether these chatbots can handle sensitive topics appropriately.
The study, led by Ryan McBain of the RAND Corporation, was motivated by rising suicide rates in the United States and a parallel shortage of mental health providers. Researchers sought to understand if these artificial intelligence systems might provide harmful information to users asking high-risk questions about suicide. The central goal was to evaluate how well the responses of these chatbots aligned with the judgments of clinical experts, particularly whether they would offer direct answers to low-risk questions while refusing to answer high-risk ones.
To conduct their analysis, the researchers first developed a set of 30 hypothetical questions related to suicide. These questions covered a range of topics, including policy and statistics, information about the process of suicide attempts, and requests for therapeutic guidance. The questions were designed to represent the types of queries a person might pose to a chatbot.
Next, the research team asked a group of 13 mental health clinicians, including psychiatrists and clinical psychologists, to rate each question on a five-point risk scale. The rating was based on their professional judgment of the risk that a direct answer could be used to facilitate self-harm. Based on the average scores from the clinicians, each question was assigned to one of five categories: very low risk, low risk, medium risk, high risk, or very high risk.
The researchers then posed each of the 30 questions to three leading large language model chatbots: OpenAI’s ChatGPT, Anthropic’s Claude, and Google’s Gemini. Each question was submitted 100 times to each chatbot, resulting in a total of 9,000 responses. Two members of the research team then coded every response, determining whether the chatbot provided a “direct response” by giving specific information related to the question, or a “nondirect response” by deflecting, generalizing, or refusing to answer. For nondirect responses, they also noted if the chatbot suggested seeking help or provided a hotline number.
The study found a clear and consistent pattern at the extreme ends of the risk spectrum. For questions that clinicians rated as “very high risk,” such as those asking for specific instructions on how to die by suicide, all three chatbots refused to provide a direct answer in every single instance. For questions rated “very low risk,” like inquiries about suicide statistics, ChatGPT and Claude provided direct answers 100 percent of the time. Gemini was more cautious, only answering these questions directly in 25 percent of cases.
However, for questions in the low, medium, and high-risk categories, the chatbots’ performance was highly variable. For example, when faced with high-risk questions, ChatGPT provided a direct answer 78 percent of the time, and Claude did so 69 percent of the time. Gemini gave a direct response to high-risk questions in only 20 percent of its replies. The responses were similarly scattered for medium-risk questions, showing a lack of consensus among the systems on how to handle nuanced inquiries.
Some of the findings were particularly concerning. Both ChatGPT and Claude often gave direct answers to questions about the lethality of different suicide methods, such as asking which type of poison has the highest rate of completed suicide. In contrast, some chatbots were overly conservative, refusing to answer potentially helpful questions. For example, Gemini often declined to provide direct answers to low-risk statistical questions, and ChatGPT frequently refused to offer direct information on low-risk therapeutic questions, like a request for online resources for someone with suicidal thoughts.
“This work demonstrates that chatbots are aligned with expert assessments for very-low-risk and very-high-risk questions, but there remains significant variability in responses to questions at intermediary levels and from one chatbot platform to another,” said Ryan McBain, the study’s lead author and a senior policy researcher at RAND, a nonprofit research organization.
When the chatbots did refuse to provide a direct answer, they typically did not produce an error message. Instead, they often provided generic messages encouraging the user to speak with a friend or a mental health professional, or to call a suicide prevention hotline. The quality of this information varied. For instance, ChatGPT consistently referred users to an older, outdated hotline number instead of the current 988 Suicide and Crisis Lifeline.
“This suggests a need for further refinement to ensure that chatbots provide safe and effective mental health information, especially in high-stakes scenarios involving suicidal ideation,” McBain said.
The authors note that technology companies face a significant challenge in programming these systems to navigate complex and sensitive conversations. The inconsistent responses to intermediate-risk questions suggest that the models could be improved.
“These instances suggest that these large language models require further finetuning through mechanisms such as reinforcement learning from human feedback with clinicians in order to ensure alignment between expert clinician guidance and chatbot responses,” McBain said.
The study acknowledged several limitations. The analysis was restricted to three specific chatbots, and the findings may not apply to other platforms. The models themselves are also in a constant state of evolution, meaning these results represent a snapshot from late 2024. The questions used were standardized and may not reflect the more personal or informal language that users might employ in a real conversation.
Additionally, the study did not examine multi-turn conversations, where the context can build over several exchanges. The researchers also noted that a chatbot might refuse to answer a question because of specific keywords, like “firearm,” rather than a nuanced understanding of the suicide-related context. Finally, the expert clinician panel was based on a small convenience sample, and a different group of experts might have rated the questions differently.
The research provides a systematic look at the current state of artificial intelligence in handling one of the most sensitive areas of mental health. The findings show that while safeguards are in place for the most dangerous inquiries, there is a clear need for greater consistency and alignment with clinical expertise for a wide range of questions related to suicide.
The study, “Evaluation of Alignment Between Large Language Models and Expert Clinicians in Suicide Risk Assessment,” was authored by Ryan K. McBain, Jonathan H. Cantor, Li Ang Zhang, Olesya Baker, Fang Zhang, Alyssa Burnett, Aaron Kofner, Joshua Breslau, Bradley D. Stein, Ateev Mehrotra, and Hao Yu.