A new study published in Computers in Human Behavior: Artificial Humans provides evidence that ChatGPT’s judgments of facial traits such as attractiveness, dominance, and trustworthiness tend to align with those made by humans. Across multiple experiments, researcher Robin S.S. Kramer of the University of Lincoln found that the AI’s evaluations of faces generally reflected average human opinions. ChatGPT was prone to the same “attractiveness halo effect” seen in human judgments.
First impressions formed from faces are known to influence how people treat others. Traits such as trustworthiness, dominance, and attractiveness are rapidly inferred from facial features and can affect real-life decisions, including hiring and criminal sentencing. While these judgments are shaped by personal experience and culture, past research has shown that there is often a surprising degree of agreement across individuals.
ChatGPT, although not originally designed for visual analysis, now includes functionality that allows it to interpret images by converting them into text-like representations. Because it was trained on large amounts of human-created image-text pairs, it is plausible that it may have developed internal associations between facial features and social traits.
“Since the release of ChatGPT, I’ve been really interested in the capabilities of this new wave of AI tools. Once users were able to upload images to it and interrogate what ChatGPT could ‘see,’ I was fascinated to understand its perceptions of face photos,” explained Kramer, a senior lecturer at the University of Lincoln.
“Since the chatbot has been trained on a vast amount of images and text from the internet (presumably including lots of faces), it was logical to predict that its judgements would, at least to some extent, align with our own. Even so, this needed to be tested rather than assumed.”
To evaluate ChatGPT’s ability to interpret social traits from faces, Kramer conducted a series of studies using a well-established set of face photographs from the Chicago Face Database. This database contains images of people with neutral expressions and is accompanied by human ratings on traits like attractiveness, dominance, and trustworthiness. Importantly, the image files themselves are not publicly accessible online, making it unlikely that these specific faces were part of ChatGPT’s training data.
In the first study, the researcher paired faces that had been rated by humans as either very high or very low in one of the three social traits. ChatGPT was shown these pairs and asked to choose which person looked more attractive, dominant, or trustworthy. Across all 360 image pairs, the chatbot’s choices agreed with human ratings more than 85% of the time. Agreement was especially high for attractiveness judgments, where the AI selected the human-rated “high” model nearly every time.
However, some variation existed across traits and demographic groups. Agreement was slightly lower for trustworthiness and dominance, especially when the human-rated difference between paired faces was small. This suggests that ChatGPT’s judgments may be more reliable when the contrast is clearer, which mirrors human perception.
In a follow-up experiment, the researcher examined how ChatGPT’s ratings of individual faces compared with those of human participants. In this case, both ChatGPT and 63 human participants rated the same 40 White faces for attractiveness on a 1–7 scale. ChatGPT repeated the task twice to assess its consistency.
ChatGPT’s ratings showed a moderate correlation with average human judgments (around 0.52) and were somewhat less aligned with individual raters (around 0.36 on average). Its internal consistency, measured by how similar its first and second ratings were, was fairly strong at 0.64—close to the human average of 0.74 for test-retest reliability.
These findings suggest that ChatGPT behaves similarly to an average human rater: not identical to any one person, but generally in line with group-level judgments. Moreover, the variation in its responses across sessions mirrors the variability seen in human behavior, although this is partially due to the way ChatGPT generates outputs with some randomness.
“ChatGPT’s perceptions of human faces aligns with our own judgements,” Kramer told PsyPost. “In other words, faces that we see as attractive or trustworthy, for instance, are also ‘seen’ that way by the chatbot.”
Given longstanding concerns about racial bias in AI systems, another goal of the research was to assess whether ChatGPT showed any preference for one racial group over another when evaluating faces. To do this, the researcher created pairs of images where the most highly rated faces from one racial group were matched against the least highly rated faces from another group. If ChatGPT had consistently favored a particular race, even when human ratings suggested otherwise, it would have indicated a potential bias.
However, ChatGPT chose the higher-rated model in 58 out of 60 such comparisons. This pattern was consistent across traits and genders, suggesting that the chatbot’s judgments were not strongly influenced by race, at least in these clearly defined cases.
That said, this method could only detect overt bias. It could not capture more subtle forms, such as associations with skin tone or facial features that may influence perception in less obvious ways. As a result, the absence of explicit racial bias in this study does not rule out the possibility of more nuanced biases, which the author notes as a direction for future research.
The study also examined whether ChatGPT is prone to the same type of “halo effect” often seen in human judgments, where individuals seen as attractive are also assumed to possess other positive traits. Using image pairs where ChatGPT had already judged one face as more attractive, the researcher asked the AI to evaluate which person looked more intelligent, sociable, or confident. In 92.5% of these comparisons, ChatGPT selected the more attractive face for at least one of these additional traits.
“While I suspected that humans and ChatGPT would agree in terms of which faces were more attractive, I was somewhat surprised by ChatGPT’s demonstrating a halo effect,” Kramer said. “The tool considered more attractive faces to also be more confident, intelligent, and sociable (as humans do), which was presumably due to information present but more implicit in the training data. I suppose that the text accompanying images online consistently labelled or described more attractive faces in these ways, resulting in the bias present here.”
The findings provide support for the idea that ChatGPT’s facial trait judgments resemble those of humans. But as with all research, there are some caveats.
“One caveat is that I utilized a constrained set of face images in my work,” Kramer noted. “All identities showed a neutral expression, were forward-facing, and wore a grey t-shirt in front of a white background. As such, I avoided the possible influence of variation that comes with unconstrained, real-world images. This was an initial investigation and I think it would be really interesting to explore how ChatGPT might be affected by things like facial expressions, clothing color, and so on, and whether these influences mirror how humans are affected by such changes.”
Another limitation is the challenge of measuring alignment between AI and human judgments. Because people often disagree with each other, it can be difficult to establish a clear benchmark for comparison. The researcher addressed this by using both average human ratings and analyses of consistency, but further work is needed to refine these methods.
“While agreement between humans and ChatGPT seemed to be fairly large, I focussed here on relatively clear cut comparisons (e.g., pairing two faces that were rated highest and lowest for a particular trait like attractiveness),” Kramer explained. “As such, further work might explore more nuanced judgements to better quantify this agreement. Problematically, of course, humans don’t even perfectly agree with each other when making such judgements, so any approach also needs to take this issue into account.”
Additionally, because ChatGPT’s responses are non-deterministic, identical prompts can lead to slightly different outputs. This randomness is intended to make interactions feel more natural, but it complicates efforts to measure reliability. Even so, the study found that ChatGPT’s repeated judgments were fairly stable and similar to patterns seen in human test-retest data.
In future research, Kramer plans to explore how ChatGPT’s internal models of facial traits might influence the images it generates. Since ChatGPT is now capable of producing synthetic faces, it will be important to see whether its concept of an “attractive” or “trustworthy” face shapes how those faces appear. This could have implications for how AI-generated content is interpreted and used in real-world applications.
“One thing I hadn’t realized until after completing this research was that uploading images to ChatGPT (and other such tools) can be considered sharing with a third party, which may go against the terms of use for some image sets,” Kramer added. “As such, I urge researchers, and indeed any users of AI tools, to make sure that they are aware of what can and cannot be uploaded when interacting with these chatbots.”
The study, “Comparing ChatGPT with human judgements of social traits from face photographs,” was authored by Robin S.S. Kramer.