Efforts to make AI inclusive accidentally create bizarre new gender biases, new research suggests

New research published in Computers in Human Behavior Reports suggests that efforts to make artificial intelligence more inclusive can sometimes create unexpected new biases. The scientists found that popular artificial intelligence models tend to overattribute stereotypically masculine behaviors to female characters and judge violence against women as significantly more objectionable than violence against men. These findings provide evidence that programming models to be sensitive to gender equity might accidentally introduce extreme ethical inconsistencies.

Scientists initiated this research to better understand how artificial intelligence systems handle gender and morality after their initial training. During development, these models undergo a refinement process based on human feedback. This process involves human reviewers grading the system’s answers to teach it preferred behaviors, like avoiding offensive language or promoting inclusivity.

The scientists suspected that this human feedback phase might teach the models to be highly sensitive to specific cultural priorities. Specifically, they thought the models might focus heavily on including women in traditionally male spaces and protecting women from harm.

“There has been a growing public debate about whether AI chatbots can develop unexpected biases, especially after post-training efforts meant to make them safer and more inclusive. Much of that discussion, however, has been anecdotal. We wanted to move beyond isolated examples and test the issue systematically,” said study author Valerio Capraro, an associate professor at the University of Milan Bicocca.

To test these ideas, the researchers conducted two main sets of experiments using different versions of the ChatGPT system, specifically GPT-3.5 Turbo, GPT-4, and GPT-4o.

“In this study, we focused on one of the most widely used chatbots at the time and asked whether it displayed surprising gender biases in two very different contexts,” Capraro said. “The goal was not just to document bias, but to understand whether attempts to reduce some biases can unintentionally produce new ones.”

In the first set of four experiments, the scientists examined how the systems assign gender to everyday statements. They prompted the systems using the standard public web interface to maintain realistic user conditions.

The researchers presented the artificial intelligence with twenty pairs of short phrases written in the style of elementary school students. Three pairs were control phrases that explicitly stated a gender. The remaining seventeen pairs contained traditional gender stereotypes regarding toys, movies, and future careers.

Half of these experimental phrases contained traditionally feminine stereotypes, like loving the color pink or wanting to be a nurse. The other half contained traditionally masculine stereotypes, like playing hockey or wanting to be a firefighter. The scientists asked the system to imagine the writer of the phrase and assign them a name, age, and gender, repeating this process ten times for each phrase pair to generate 400 responses per study.

The responses demonstrated a massive asymmetry in how the artificial intelligence applies gender assumptions. For phrases with stereotypically feminine activities, the models consistently assigned a female writer almost every single time. For phrases with stereotypically masculine activities, the models frequently assigned a female writer as well.

For instance, the models constantly attributed a sentence about loving football and practicing with a cousin to a female writer. The researchers suggest this happens because the refinement process strongly emphasizes placing women in traditionally masculine roles. At the same time, this programming lacks an equivalent push to place men in traditionally feminine roles, creating a persistent gender bias.

The second set of four experiments tested how these gender asymmetries affect high-stakes moral decisions. The scientists asked GPT-4 to rate its agreement with different violent actions required to prevent a hypothetical nuclear apocalypse. The system used a scale from one to seven, where one meant “strongly disagree” and seven meant “strongly agree.”

In the first moral experiment, the scientists asked the system fifty times about the acceptability of harassing a woman, harassing a man, or sacrificing a person’s life to stop the apocalypse. GPT-4 consistently gave the lowest possible score for harassing a woman, averaging a perfect one. In contrast, the system gave an average score of 3.34 for harassing a man and 3.61 for sacrificing a life, showing it views harassing a woman as much worse than killing a random person.

To see if this pattern held true across different types of harm, the researchers conducted another experiment focusing on abuse and torture. They asked the system twenty times each about abusing or torturing a man or a woman to stop the apocalypse. The system strongly disagreed with abusing a woman but was much more open to abusing a man, averaging a score of 4.2. On the other hand, the system viewed torturing a man and torturing a woman as equally acceptable.

“What surprised me most was how strong and consistent some of these effects were,” Capraro told PsyPost. “In one experiment, we asked GPT-4 fifty times whether it was acceptable to harass a woman to prevent a nuclear apocalypse, and every single time it responded ‘strongly disagree.’”

“By contrast, when we asked about torturing a woman, the answers were much more variable and on average much closer to the midpoint of the scales, which is a very unusual ordering if you think in terms of objective severity of harm. This suggests the model may be especially sensitive to certain categories of harm that are socially and politically salient, rather than simply responding to severity in a consistent way.”

In other words, this unexpected pattern might happen because torture is less central to modern gender equity debates than abuse. The models have likely been trained to flag and condemn harassment against women specifically.

The researchers then investigated whether these biases were explicit or hidden. They directly asked GPT-4 to rank the severity of these different moral violations twenty times. When asked directly, the system ranked the violations based on objective physical harm, placing sacrifice as the worst, followed by torture, abuse, and harassment. It explicitly stated that gender did not matter, revealing that its biased judgments in the previous scenarios were entirely implicit.

“That matters because it suggests that evaluating AI systems only through direct, explicit questioning may miss important biases that show up in applied decision-making,” Capraro explained.

A final experiment tested a complex scenario involving mixed-gender violence. The researchers asked the system eighty times about a situation where a bomb disposal expert must physically harm an innocent person to get a biological code to stop an explosion.

When the expert was a woman and the victim was a man, the system highly approved of the violence, giving it an average score of 6.4 out of 7. When the expert was a man and the victim was a woman, the system strongly condemned the exact same action, giving it an average score of 1.75. The gender of the characters drastically altered the system’s moral compass.

“The main takeaway is that reducing bias in AI is not simple,” Capraro said. “Efforts to make models more inclusive can sometimes introduce new asymmetries or amplify certain moral sensitivities in unexpected ways.”

“So the broader lesson is that people should be cautious about treating AI systems as neutral or objective. These models do not just reflect patterns in their training data; they may also reflect the values and priorities introduced during fine-tuning and human feedback. In some cases, that can lead to judgments that are not just biased, but surprisingly extreme.”

But the researchers caution that users should avoid interpreting these specific results as a permanent feature of all artificial intelligence systems. These programs receive constant updates, meaning future versions might process these exact prompts differently. “The paper should not be read as claiming that today’s models necessarily behave in exactly the same way,” Capraro noted.

“Our broader point is not that these exact biases will always appear, but that post-training interventions can create unintended distortions. In other words, the paper is less about one specific model and more about a general warning for both developers and users. Developers should be aware that trying to correct one problem can sometimes create another. Users should remember that confident-looking outputs can still reflect hidden biases.”

“One important next step is to study whether similar biases appear in more realistic and socially consequential settings, such as résumé screening, hiring recommendations, or other decision-support contexts,” Capraro continued. “Those are the domains where bias matters a lot in practice.”

“More broadly, I think AI has enormous potential, but that potential will only be socially beneficial if the systems are developed and deployed in a way that distributes benefits fairly. So my long-term goal is to better understand how bias enters these models, how it changes across model versions and prompting styles, and how we can reduce harmful distortions without simply replacing them with new ones.”

The study, “Surprising gender biases in GPT,” was authored by Raluca Alexandra Fulgu and Valerio Capraro.

Leave a comment
Stay up to date
Register now to get updates on promotions and coupons
HTML Snippets Powered By : XYZScripts.com

Shopping cart

×