Researchers are using Dungeons & Dragons to find the breaking points of major AI models

A new study presented at the NeurIPS 2025 conference suggests that the tabletop game Dungeons & Dragons can serve as a tool for testing the intelligence of artificial intelligence agents. Researchers found that while current models can handle simple questions, they struggle to manage the multiple steps, strict rules, and team coordination required by a full game session. The results suggest that certain models are much more reliable than others at following instructions over long periods, though all models eventually show a decline in accuracy as the game progresses.

The investigation into these digital adventurers was led by Ziyi Zeng, a researcher at the University of California, San Diego. Zeng worked alongside colleagues from the same institution and the University of Pennsylvania to build a bridge between human language and rigid game mechanics. This work addresses a specific gap in how researchers measure the abilities of Large Language Models, which are the engines behind modern chatbots.

Many existing tests for these models only look at how they answer a single question or solve a short task. However, these programs are increasingly used as autonomous agents that must operate independently to solve multi-step problems in the real world. A game like Dungeons & Dragons provides a controlled environment to see if an agent can remember past events, cooperate with allies, and obey a set of physical and magical rules.

The researchers focused on the combat portion of the game, where characters must move across a map and use their abilities to defeat monsters. In this setting, the stakes are high for the characters, and every decision is governed by the roll of a die and a thick rulebook. This requires the artificial intelligence to balance its creative storytelling with the mathematical reality of the game world.

To conduct the study, Zeng and the team developed a framework called D&D Agents. This system acts as a simulator where different models can play against each other or with humans. Instead of just letting the models talk freely, the researchers forced them to use specific digital tools to interact with the game.

These tools allowed the models to query the state of the world, such as checking how much health a monster had left or if a wall was blocking their view. When an agent wanted to take an action, it had to call a specific function that calculated the outcome based on the official rules. This prevented the models from simply making up results, which is a common problem in language generation.

The team tested three specific models: Claude 3.5 Haiku, GPT-4o, and DeepSeek-V3. Each model was given 27 different combat scenarios to play through, ranging from simple skirmishes to difficult battles. The researchers measured performance across six different categories, including how well they used their tools and how effectively they planned their tactics.

The data showed that Claude 3.5 Haiku was the most reliable agent in these sessions. It was particularly good at using the provided software functions correctly and staying in its assigned role. GPT-4o followed closely behind, showing strong performance but slightly less consistency in its tool usage.

DeepSeek-V3 performed at a lower level than the other two models. The researchers also tried to test a large open-source model with 120 billion parameters, but it failed to complete basic tasks and could not produce valid game sessions. This suggests that the size of a model is not the only factor that determines its ability to act as a functional agent.

One of the most frequent problems the researchers observed was a loss of focus over time. As the game turns went on, the models began to make more mistakes about the state of the world. For instance, a model might try to attack an enemy that had already been defeated or ignore a status effect that was currently affecting its character.

The researchers categorized these mistakes as hallucinations of the game state. They found that errors regarding the health of a character or their position on the map became more common as the history of the conversation grew longer. This indicates that current technology still has difficulty maintaining an accurate mental map of a situation during extended interactions.

The study also looked at how well the models could act in character while playing. To do this, they used an automated judge to scan the dialogue for “persona density” and “trait diversity.” This measured whether a model sounded like a heroic knight or a cunning rogue while it was announcing its moves.

Claude 3.5 Haiku was particularly successful at varying its vocabulary based on the specific character it was playing. It could switch between the wit of a bard and the calm of a druid with more reliability than the other models. DeepSeek-V3 tended to use the same few voices repeatedly, even when the situation changed.

The researchers also noted some unusual behaviors during the simulations. Monsters controlled by the models occasionally developed distinct personalities, taunting players with phrases like “shiny man’s gonna bleed” in the middle of a fight. Some heroic characters would stop to give speeches in the middle of dangerous situations, even when those actions were not tactically wise.

Tactical optimality was another major focus of the evaluation. The team checked if the models were choosing the best possible actions, such as attacking when an enemy was in range or moving to safety when injured. Claude 3.5 Haiku again led this category, showing a more aggressive and efficient use of its resources.

In easier scenarios, all the models managed to keep their players alive at a relatively high rate. However, when the difficulty increased, the differences in tactical planning became more apparent. The top models were better at using high-level spells and abilities to end fights quickly, while other models were more conservative.

The researchers used a combination of automated checks and human reviewers to verify these results. They found that their automated scoring system matched human judgment with very high accuracy. This means the framework could be used to test many more models in the future without needing a human to watch every game.

There are some limitations to the current work that the team plans to address in the future. The study primarily focused on combat, which is only one part of the Dungeons & Dragons experience. The researchers did not evaluate the social negotiation or the open-ended exploration that happens in a full campaign.

Future research will look at how fine-tuning a model on specific game data might improve its performance. The team also hopes to expand the simulator to include legal case simulations or business strategy games. These environments share the same need for strict rule adherence and long-term planning seen in tabletop role-playing.

The researchers suggest that improving how models track the state of a game will eventually lead to better AI assistants in the real world. If an agent can keep track of a complex battle on a digital map, it might also be better at managing a long-term project or coordinating a multi-party negotiation. For now, the game remains a difficult but productive hurdle for the next generation of artificial intelligence.

The study, “Setting the DC: Tool-Grounded D&D Simulations to Test LLM Agents,” was authored by Ziyi Zeng, Shengqi Li, Jiajun Xi, Andrew Zhu, and Prithviraj Ammanabrolu.

Leave a comment
Stay up to date
Register now to get updates on promotions and coupons
HTML Snippets Powered By : XYZScripts.com

Shopping cart

×