A new investigation into the reliability of advanced artificial intelligence models highlights a significant risk for scientific research. The study, published in JMIR Mental Health, found that large language models like OpenAI’s GPT-4o frequently generate fabricated or inaccurate bibliographic citations, with these errors becoming more common when the AI is prompted on less familiar or highly specialized topics.
Researchers are increasingly turning to tools known as large language models, or LLMs, to help manage demanding workloads. These complex AI systems are trained on immense quantities of text from the internet and licensed databases, enabling them to produce human-like text for tasks like summarizing articles, drafting emails, or writing code.
One of the known limitations of these models is a tendency to produce “hallucinations,” which are confident-sounding statements that are factually incorrect or entirely made up. In academic writing, a particularly problematic form of this is the fabrication of scientific citations, which are the bedrock of scholarly communication.
While past studies have documented that LLMs can invent citations, it has been less clear how the nature of a given topic might influence the frequency of these errors. A team of researchers from the School of Psychology at Deakin University in Australia sought to explore this question within the field of mental health.
They designed an experiment to test whether the AI’s performance would change based on a topic’s public visibility and the depth of its existing scientific literature. The team’s objective was to determine if citation fabrication and accuracy rates in GPT-4o’s output systematically varied depending on the subject matter.
To conduct their study, the researchers prompted GPT-4o, a recent model from OpenAI, to generate six different literature reviews. These reviews centered on three mental health conditions chosen for their varying levels of public recognition and research coverage: major depressive disorder (a widely known and heavily researched condition), binge eating disorder (moderately known), and body dysmorphic disorder (a less-known condition with a smaller body of research). This selection allowed for a direct comparison of the AI’s performance on topics with different amounts of available information in its training data.
For each of the three disorders, the team requested two types of reviews. One prompt asked for a general overview covering symptoms, societal impacts, and treatments. The other prompt requested a specialized review focused on a narrower subject: the evidence for digital health interventions. The researchers instructed the AI to produce reviews of about 2000 words and to include at least 20 citations from peer-reviewed academic sources.
After generating the reviews, the researchers methodically extracted all 176 citations provided by the AI. Each reference was painstakingly verified using multiple academic databases, including Google Scholar, Scopus, and PubMed. Citations were sorted into one of three categories: fabricated (the source did not exist), real with errors (the source existed but had incorrect details like the wrong year, volume number, or author list), or fully accurate. The team then analyzed the rates of fabrication and accuracy across the different disorders and review types.
The analysis showed that across all six reviews, nearly one-fifth of the citations, 35 out of 176, were entirely fabricated. Of the 141 citations that corresponded to real publications, almost half contained at least one error, such as an incorrect digital object identifier, which is a unique code used to locate a specific article online. In total, nearly two-thirds of the references generated by the model were either invented or contained bibliographic mistakes.
The rate of citation fabrication was strongly linked to the topic. For major depressive disorder, the most well-researched condition, only 6 percent of citations were fabricated. In contrast, the fabrication rate rose sharply to 28 percent for binge eating disorder and 29 percent for body dysmorphic disorder. This suggests the AI is less reliable when generating references for subjects that are less prominent in its training data.
The specificity of the prompt also had an effect, particularly for less common topics. When asked to write about binge eating disorder, the specialized review on digital interventions had a much higher fabrication rate (46 percent) compared to the general overview (17 percent).
A similar pattern appeared in the accuracy of real citations. For major depressive disorder, the general review was significantly more accurate than the specialized one. Accuracy rates were also lowest overall for body dysmorphic disorder, where only 29 percent of real citations were free of errors.
The study has some limitations that the authors acknowledge. The findings are specific to one AI model, GPT-4o, and may not be representative of others. The experiment was also confined to three specific mental health topics and used straightforward prompts that did not involve advanced techniques to guide the AI’s output. Repeating the same prompt can also produce different results, and the team analyzed only a single output for each one.
Future research could examine a wider range of topics and AI models to see if these patterns hold. Still, the study’s results have clear implications for the academic community. Researchers using these models are advised to exercise caution and perform rigorous human verification of every reference an AI generates. The findings also suggest that academic journals and institutions may need to develop new standards and tools to safeguard the integrity of published research in an era of AI-assisted writing.
The study, “Influence of Topic Familiarity and Prompt Specificity on Citation Fabrication in Mental Health Research Using Large Language Models: Experimental Study,” was authored by Jake Linardon, Hannah K Jarman, Zoe McClure, Cleo Anderson, Claudia Liu, and Mariel Messer.