A new theoretical analysis published in the Journal of Creative Behaviour challenges the prevailing narrative that artificial intelligence is on the verge of surpassing human artistic and intellectual capabilities. The study provides evidence that large language models, such as ChatGPT, are mathematically constrained to a level of creativity comparable to an amateur human.
The study was conducted by David H. Cropley, a professor of engineering innovation at the University of South Australia. Cropley initiated this research to bring objective measurement to the polarized debate surrounding generative AI. While some proponents argue that AI can already outperform humans in creative tasks, others maintain that these systems merely mimic existing data without genuine understanding.
Cropley sought to move beyond subjective opinions by applying the standard definition of creativity to the probabilistic mechanics of large language models. His goal was to determine if the way these models operate places an inherent limit on the quality of their output.
To evaluate the creative potential of artificial intelligence, the researcher first established a clear definition of what constitutes a creative product. He utilized the standard definition of creativity, which posits that for an output to be considered creative, it must satisfy two specific criteria: effectiveness and originality.
Effectiveness refers to the product being useful, appropriate, or fit for its intended purpose. Originality refers to the product being novel, unusual, or surprising. In high-level human creativity, these two traits exist simultaneously; a masterpiece is both highly unique and perfectly executed.
Cropley focused his analysis on the “product” aspect of creativity rather than the psychological processes or environmental factors that influence humans, as AI does not possess personality traits or experience workplace culture. He examined the “next-token prediction” mechanism used by large language models.
These systems function by breaking text into smaller units called tokens and calculating the probability of which token should logically follow the previous ones based on their training data. This process is transparent and deterministic, allowing for a mathematical calculation of creativity that is not possible when studying the opaque cognitive processes of the human brain.
The investigation revealed a fundamental trade-off embedded in the architecture of large language models. For an AI response to be effective, the model must select words that have a high probability of fitting the context. For instance, if the prompt is “The cat sat on the…”, the word “mat” is a highly effective completion because it makes sense and is grammatically correct. However, because “mat” is the most statistically probable ending, it is also the least novel. It is entirely expected.
Conversely, if the model were to select a word with a very low probability to increase novelty, the effectiveness would drop. Completing the sentence with “red wrench” or “growling cloud” would be highly unexpected and therefore novel, but it would likely be nonsensical and ineffective. Cropley determined that within the closed system of a large language model, novelty and effectiveness function as inversely related variables. As the system strives to be more effective by choosing probable words, it automatically becomes less novel.
By expressing this relationship through a mathematical formula, the study identified a specific upper limit for AI creativity. Cropley modeled creativity as the product of effectiveness and novelty. Because these two factors work against each other in a probabilistic system, the maximum possible creativity score is mathematically capped at 0.25 on a scale of zero to one.
This peak occurs only when both effectiveness and novelty are balanced at moderate levels. This finding indicates that large language models are structurally incapable of maximizing both variables simultaneously, preventing them from achieving the high scores possible for human creators who can combine extreme novelty with extreme effectiveness.
To contextualize this finding, the researcher compared the 0.25 limit against established data regarding human creative performance. He aligned this score with the “Four C” model of creativity, which categorizes creative expression into levels ranging from “mini-c” (interpretive) to “Big-C” (legendary).
The study found that the AI limit of 0.25 corresponds to the boundary between “little-c” creativity, which represents everyday amateur efforts, and “Pro-c” creativity, which represents professional-level expertise.
This comparison suggests that while generative AI can convincingly replicate the work of an average person, it is unable to reach the levels of expert writers, artists, or innovators. The study cites empirical evidence from other researchers showing that AI-generated stories and solutions consistently rank in the 40th to 50th percentile compared to human outputs. These real-world tests support the theoretical conclusion that AI cannot currently bridge the gap to elite performance.
“While AI can mimic creative behaviour – quite convincingly at times – its actual creative capacity is capped at the level of an average human and can never reach professional or expert standards under current design principles,” Cropley explained in a press release. “Many people think that because ChatGPT can generate stories, poems or images, that it must be creative. But generating something is not the same as being creative. LLMs are trained on a vast amount of existing content. They respond to prompts based on what they have learned, producing outputs that are expected and unsurprising.”
The study highlights that human creativity is not symmetrically distributed; most people perform at an average level, which explains why AI output often feels impressive to the general public. Since a large portion of the population produces “little-c” level work, an AI that matches this level appears competent.
However, highly creative professionals quickly recognize the formulaic nature of AI content. The mathematical ceiling ensures that while the software can be a helpful tool for routine tasks, it cannot autonomously generate the kind of transformative ideas that define professional creative work.
“A skilled writer, artist or designer can occasionally produce something truly original and effective,” Cropley noted. “An LLM never will. It will always produce something average, and if industries rely too heavily on it, they will end up with formulaic, repetitive work.”
There are limitations to the theory presented in the paper. The model uses a linear approximation to define novelty as the inverse of effectiveness, which is a simplification of more complex concepts from information theory.
The study also assumes a standard mode of operation for these models, known as greedy decoding or simple sampling, and does not account for every possible variation in prompting strategies or human-in-the-loop editing that might artificially enhance the final product. The analysis focuses on the autonomous output of the system rather than its potential as a collaborative tool.
Future research is likely to investigate how different temperature settings—parameters that control the randomness of AI responses—might allow for slight fluctuations in this creativity ceiling. Additionally, researchers may explore whether reinforcement learning techniques could be adjusted to weigh novelty more heavily without sacrificing coherence. Cross-lingual studies could also determine if this mathematical limit holds true across different languages and cultural contexts.
“For AI to reach expert-level creativity, it would require new architecture capable of generating ideas not tied to past statistical patterns,” Cropley concluded. Until such a paradigm shift occurs in computer science, the evidence indicates that human beings remain the sole source of high-level creativity.
The study, ““The Cat Sat on the …?” Why Generative AI Has Limited Creativity,” was authored by David H. Cropley.