Artificial intelligence struggles to consistently evaluate scientific facts

Generative artificial intelligence programs can write fluently, but they still struggle to accurately and consistently evaluate basic scientific statements. A recent study shows that when an artificial intelligence is asked the exact same question multiple times, it often gives completely different answers. These results, published in the Rutgers Business Review, highlight the limits of current automated reasoning and the ongoing need for human oversight.

Generative artificial intelligence is a type of technology trained on massive databases of text to produce human-like writing. Millions of people now use these applications daily for tasks ranging from marketing to software development. The software writes with an authoritative tone that often sounds correct even when it is entirely wrong. Some high-profile consulting firms have even faced public embarrassment after relying on automated reports that included fabricated data.

Despite these known flaws, many businesses have partnered with technology vendors to incorporate these tools into their daily operations. Professionals frequently rely on automated software to analyze data, answer customer queries, and summarize research. The researchers wanted to know if the logical abilities of these programs actually matched their impressive vocabularies. They designed a test to see if the technology could reliably evaluate rigorous business concepts.

Mesut Cicek, an associate professor in the Department of Marketing and International Business at Washington State University, led the investigation. His co-authors included Sevincgul Ulu of Southern Illinois University, Can Uslay of Rutgers University, and Kate Karniouchina of Northeastern University. The team designed an experiment to test the software’s ability to interpret academic literature.

The researchers collected 719 scientific hypotheses from nine open-access business journals published since 2021. A hypothesis is a formal, testable prediction about how two or more things interact in the real world. For example, a statement might predict that a specific type of advertising increases consumer spending.

The team presented these statements to ChatGPT, a highly popular automated text generator. The program was asked to determine whether each statement was ultimately proven true or false by the actual research data. To test the stability of the program, the researchers submitted the exact same prompt ten separate times for each statement.

The entire experiment was run twice to track technological progress over time. The first test occurred in mid-2024 using an older version of the software. The researchers repeated the entire process in mid-2025 with an updated version of the application.

The results revealed a modest improvement in overall correctness, but the raw numbers were highly misleading. The software chose the correct answer 76.5 percent of the time in 2024 and 80 percent of the time in 2025. Because the questions only had two possible answers, a completely blind guess would be right half the time.

Once the researchers mathematically adjusted the scores to account for random guessing, the true performance dropped substantially. The effective accuracy rate hovered around a mere 60 percent. The software essentially earned a barely passing grade when it came to anticipating actual scientific findings.

The program performed exceptionally poorly when evaluating ideas that the original researchers had found to be false. The software correctly identified these unsupported statements only 16.4 percent of the time in 2025. The program displayed a strong bias toward agreeing with whatever statement it was fed, acting as a compliant assistant rather than an objective analyst. This tendency to blindly confirm existing ideas creates an echo chamber that can mislead decision-makers.

Consistency proved to be an even bigger problem for the automated system. When asked the same question ten times in a row, the software frequently contradicted itself. Sometimes the program would flip back and forth between true and false on consecutive attempts.

“We’re not just talking about accuracy, we’re talking about inconsistency, because if you ask the same question again and again, you come up with different answers,” Cicek said. In 2025, the program provided identical answers across all ten attempts for only 73 percent of the statements. For more than a quarter of the questions, the software gave at least one wrong answer during the ten trials.

The lack of a stable response pattern makes the software highly unreliable for individual searches. Users who ask a question once might get a completely different answer if they simply refresh the page. “There were several cases where there were five true, five false,” Cicek said.

The researchers also categorized the test questions by their logical difficulty. The software did best with direct cause-and-effect relationships, where one event leads straight to another. It struggled the most with conditional statements, which are ideas that depend on changing variables to be true.

These outcomes suggest that the program relies on recognizing common word patterns rather than actually understanding the concepts. It can mimic the structure of a logical argument without grasping the underlying meaning or context. The system possesses a high degree of linguistic fluency, but it lacks genuine theoretical flexibility. When faced with complex scenarios, the technology fails to adapt its reasoning.

The software remains bound by pattern recognition rather than true comprehension. “They just memorize, and they can give you some insight, but they don’t understand what they’re talking about,” Cicek said. The apparent improvements over the past year seem to stem from better text processing rather than deeper cognitive abilities.

For managers and analysts, these limitations carry substantial risks. The findings reveal that automated systems are currently too shallow to handle high-stakes decision-making on their own. As the text generated by these programs becomes smoother, users might easily miss hidden conceptual flaws.

The researchers advise professionals to use artificial intelligence for speed rather than substitution. A marketing team might use a text generator to brainstorm ideas or summarize long reports quickly. However, human experts must step in to verify whether the logic aligns with actual market evidence.

Professionals should also verify automated insights through repetition. Asking the same question multiple times can help expose underlying bias or instability in the software. Any conclusions generated by artificial intelligence should be treated as diagnostic clues rather than absolute facts.

The authors advocate for building organizational literacy regarding automated tools. Employees need to understand exactly where these programs excel and where they fail. Organizations should train their staff to audit the reasoning behind automated answers, rather than just trusting the numerical output.

The ultimate goal is to create a hybrid system that pairs human intelligence with automated speed. In this arrangement, software handles structural analysis while humans preserve interpretive judgment. This balanced approach ensures that technology supports human understanding rather than replacing it.

The authors noted a few minor limitations to their experiment. The study assumed that every published, peer-reviewed finding was entirely true or false, which leaves out some nuance in real-world science. Sometimes a scientific finding has mixed results that do not easily fit into a strict binary category.

The team also limited their consistency test to ten repetitions per question using a single software platform. Future investigations should involve a higher number of repetitions to confirm these patterns. Researchers should also test a wider variety of artificial intelligence programs to see if the flaws are universal.

Despite these limitations, the research suggests that users must remain vigilant. Human judgment remains a necessary check on these increasingly common digital systems. “Always be skeptical,” Cicek said. “I’m not against AI. I’m using it. But you need to be very careful.”

The study, “Unstable Intelligence: GenAI Struggles with Accuracy and Consistency,” was authored by Mesut Cicek, Sevincgul Ulu, Can Uslay, and Kate Karniouchina.

Leave a comment
Stay up to date
Register now to get updates on promotions and coupons
HTML Snippets Powered By : XYZScripts.com

Shopping cart

×