HomeBlogAplenty Icon PublicationAI is flooding peer review, and editors say it’s making science harder to judge

AI is flooding peer review, and editors say it’s making science harder to judge

by mcsvtln@gmail.com

May 4, 2026

Aplenty Icon Publication

The pileup began quietly, then turned into something journal editors could no longer ignore.

At Organization Science, one of the leading journals in management research, submissions jumped 42% after ChatGPT arrived in late 2022. That alone might have looked like a burst of productivity. But the journal’s editors say the extra papers were not, on average, better papers. Many were harder to read, more stuffed with jargon, and less likely to survive the review process.

A new analysis from the journal’s AI Task Force lays out what the editors have been seeing across the peer review pipeline. Drawing on 6,957 initial submissions and 10,389 text-entry reviews handled between January 2021 and February 2026, the team found that heavier AI use tracked with weaker prose, higher rejection rates, and a growing burden on the unpaid academics who keep the system running.

The numbers suggest a problem bigger than awkward phrasing. The same tools that can speed up writing are also colliding with an academic culture that often rewards output more than care.

Monthly Submission Volume at Organization Science from January 2013 Until the End of 2025. (CREDIT: Organization Science)

“We didn’t come with a point to make,” Claudine Gartenberg, a senior editor on the team and a professor at Wharton, said in the source material. “We just said, let’s put some facts to this feeling.”

She is not arguing from the sidelines. “I use Claude Code and Codex all day long,” she said. “Every aspect of my research program over the last year.”

More papers, worse prose

The editorial team used Pangram, an AI detection tool, to score submissions and reviews on a continuous scale from zero to one. Rather than trying to label any one paper as definitively human or machine-written, the analysis looked for large shifts across thousands of texts.

What emerged was a sharp change after ChatGPT’s release.

Submissions judged to contain little or no AI writing declined, while AI-assisted and heavily AI-generated submissions climbed. By February 2026, the majority of papers sent to the journal showed at least some degree of AI involvement. The fastest-growing segment was the most machine-heavy one, manuscripts with AI scores above 70%.

The writing quality moved in the opposite direction. Flesch Reading Ease, one standard measure of readability, fell sharply after late 2022. The journal reports that by January 2026, submission writing quality sat 1.28 standard deviations below its January 2021 level.

Across the journal’s data, higher AI scores went with lower readability. The correlation between AI score and Flesch Reading Ease was negative, with rho equal to -0.4 and p less than or equal to .001. More AI-laden writing also tended to demand a higher reading grade level, use more jargon, and rely more heavily on nominalizations, the abstract noun forms that can turn plain actions into bureaucratic fog.

Monthly Submission Volume by AI Use Categories over Time. (CREDIT: Organization Science)

That does not mean every metric worsened. The analysis found that AI-heavy prose showed less hedging, less passive voice, and more specificity. Yet the broader effect was still prose that felt denser and harder to move through.

Gartenberg compared it to the language criticized in George Orwell’s famous essay on political writing: swollen, abstract, and oddly slippery.

The incentive problem behind the surge

The editors argue that AI alone is not the full story.

Their central claim is that generative tools are amplifying incentives already baked into academic life, especially the pressure to produce a high count of papers. In business schools, one of the strongest symbols of that pressure is the UT-Dallas journal ranking list, which tracks faculty publications across 24 designated journals.

The research team examined whether schools that historically responded most strongly to that ranking system also changed their behavior most after ChatGPT’s debut. They did.

Schools classified as stronger “UTD Responders” increased their submissions after ChatGPT, and the growth was concentrated in papers with AI writing scores above 15%. The pattern remained directionally similar even after excluding schools in Mainland China and Hong Kong from one version of the analysis.

That matters because it suggests heavy AI use is not simply random or spread evenly across the field. It appears tied to institutional reward systems.

“AI, as it’s being used today, is colliding with institutional incentives to create more rather than better research,” Gartenberg said. “It’s not AI on its own. It’s AI plus publish-or-perish incentives.”

Trends in AI Use Categories over Time. (CREDIT: Organization Science)

The journal also found that using AI this way did not seem to help authors much. Papers with more AI writing were more likely to be rejected at the desk stage and more likely to be rejected after review. The break point looked especially clear once manuscripts crossed about 30% AI use.

After ChatGPT’s launch, 11.9% of papers in the 0% to 15% AI category received a revise-and-resubmit decision. For papers in the 70% and above category, that figure dropped to 3.2%.

The same drift is showing up in peer review

The submission side is only half the story. The journal found the same pattern creeping into peer review itself.

More than 30% of text-entered reviews now show detectable AI use. Before ChatGPT, that figure was close to zero. And as with manuscripts, those reviews became harder to read as AI scores rose. They contained more jargon, more nominalization, and lower readability.

The content of the reviews shifted too.

Using word-frequency measures, the editors found that AI-heavy reviews put more emphasis on theory and less on data. In their regression analysis, AI score was positively associated with theory emphasis and negatively associated with data emphasis. The team also used principal component analysis to show that AI-written reviews occupied a narrower evaluative range than human ones.

That narrowing could matter. A review that leans too heavily toward abstract theory and away from methods or evidence may give editors and authors a thinner account of what is actually wrong, or right, in a paper.

AI Scores for Each Section of a Sample of Manuscripts Stratified by AI Scores Falling into Low (<30), Medium (between 30 and 70), and High (>70) Categories — AI Scores for Each Section of a Sample of Manuscripts Stratified by AI Scores Falling into Low (70) Categories. (CREDIT: Organization Science)

Most striking of all, AI-heavy reviews did not appear to help determine editorial outcomes. Human reviews correlated with decisions. AI-heavy reviews did not.

“It’s not like the editors know that those are AI reviews and they’re throwing them out,” Gartenberg said. “They’re reading them and they’re not informing the editor’s ultimate recommendation.”

That leaves editors doing more of the judging themselves, which protects the journal’s standards but adds strain to a system built on volunteer labor.

The humans are still holding the line

For now, the gatekeeping still works.

The journal found that published articles remain overwhelmingly human-written, at least based on detectable signals in abstracts. Heavily AI-generated manuscripts rarely make it through the funnel. The editors are catching most of the weak work before it reaches print.

But catching it takes people.

To manage the growing load, Organization Science increased its number of deputy editors from six to eleven. Its number of active senior editors rose from about 30 in the earlier period to about 60 in the later one. Some deputy editors now handle more than 250 manuscripts a year.

The report’s conclusion is not that AI has no place in science. In fact, the authors say the opposite. They used AI themselves while preparing the editorial, including for coding, outlining, phrasing, and comparing their essay with earlier work. Even with that assistance, they note, the editorial scored 8.8% on Pangram and still fell within their human-first range.

Scatterplot of AI Use in Abstract (x-axis) vs. AI Use Within the Manuscript Body (Mean of AI Use in Introduction, Theory, Methods, Results, Discussion, Conclusion). (CREDIT: Organization Science)

The problem, they argue, is not tool use by itself. It is what happens when researchers offload too much of the thinking and writing process.

“People think as they write,” Gartenberg said, “and so if you don’t write, you’re not thinking as deeply about it.”

The study also comes with limits. Pangram is treated here as a strong detector, but the authors stress that no detection system is fully reliable for judging individual texts. Their claims apply to aggregate patterns, not to any one manuscript or review. The journal also focused on one field and one outlet, Organization Science, even though the authors suspect similar patterns extend more broadly. And much of the writing in the dataset likely came from older models such as ChatGPT 3.5 and GPT-4, whose prose habits were easier to spot and often clumsier than those of newer systems.

So this is not a final verdict on AI in research. It is a snapshot of a fast-changing moment.

Where the technology may help next

The irony is that the same technology now swelling the pipeline may eventually help manage it.

The report argues that the true bottleneck in publishing is no longer producing papers but evaluating them. Journals struggle to find reviewers, and editors drown in submissions. In that setting, AI may be more useful as a screening and triage tool than as a ghostwriter.

The authors raise several possibilities. A journal could use AI to flag unreadable prose, high jargon density, or weak alignment between claims and methods before editors invest much time. It could help steer reviewers toward neglected questions about data and evidence rather than replace their judgment. It could act as scaffolding, not as a substitute for expertise.

The team stops short of calling for automated gatekeeping. It warns that disclosure rules and outright bans will not solve the deeper issue, which lies in institutions that reward paper counts and journal placements more than sustained intellectual contribution.

That is where the longer fight may be headed, toward tenure decisions, hiring norms, journal lists, and the culture of academic productivity itself.

Practical implications of the research

The most immediate lesson is not that AI should be pushed out of science. It is that science may need to get much clearer about what it wants AI to do.

For journals, the findings point to a practical need for better triage, better reviewer support, and policies that reduce the burden of low-quality high-volume submissions before they consume scarce editorial attention.

For universities, the work raises a harder question about whether publication counts and journal-list incentives are now actively encouraging lower-value output. For researchers, the message is blunt: using AI to save time may backfire if it replaces the thinking that strong writing reflects.

The journal’s own data suggest that heavy AI writing does not improve a paper’s chances, and may damage them. If AI is going to strengthen research rather than swamp it, the tool will have to be aimed at better work, not just more of it.

Research findings are available online in the journal Organization Science.

The original story “AI is flooding peer review, and editors say it’s making science harder to judge” is published in The Brighter Side of News.