Grok vs Claude: AI Battle Royale Performance and Cost Analysis

Q: Question: Which model was the overall winner of the battle royale experiment?

**Answer:** Grok 4.1 Fast was the clear winner, securing 13 victories out of 30 games, which represents a 43% win rate.

Q: Question: Why did Claude Sonnet 4.6 perform poorly in terms of match wins?

**Answer:** Claude Sonnet 4.6 prioritized social interaction and cooperation over combat. It frequently attempted to form alliances, told other agents its location, and tried to make friends, which is a disadvantageous strategy in a battle royale format.

Q: Question: How did GPT 5.4 perform in the simulation?

**Answer:** GPT 5.4 was the most aggressive model, recording 38 kills across the games. However, it did not win the most matches, indicating that high lethality did not necessarily lead to overall victory in this specific environment.

In a groundbreaking experiment conducted by Jacky Liang of OpenRouter, 11 Large Language Models (LLMs) were placed in a 2D battle royale simulation to test their competitive capabilities. The results revealed a stark contrast in performance and behavior: xAI’s Grok 4.1 Fast emerged as the dominant victor, winning 43% of the matches (13 out of 30) at a highly efficient cost of $0.97 per win. Conversely, Anthropic’s Claude Sonnet 4.6, despite being a top-tier model, won only 5 games and cost 27 times more per win. The experiment highlighted significant behavioral differences, with Claude attempting to form alliances and socialize, while GPT 5.4 led in total kills but failed to secure the most victories. This study suggests that traditional benchmarks may fail to capture the nuanced behavioral traits essential for real-world AI agent deployment.

Key Takeaways

Grok 4.1 Fast Dominance: xAI's model won 13 out of 30 games, achieving a 43% win rate, making it the most successful competitor in the battle royale format.
Cost Efficiency Gap: Grok 4.1 Fast proved to be 27x more cost-effective than Claude Sonnet 4.6, costing only $0.97 per win compared to Claude's $26.78.
Behavioral Divergence: Claude Sonnet 4.6 exhibited highly social and cooperative traits, frequently attempting to team up and reveal its location, which hindered its performance in a winner-take-all scenario.
Aggression vs. Strategy: GPT 5.4 recorded the highest number of kills (38 agents) across the simulation but did not translate this aggression into the highest number of overall match wins.
Benchmark Limitations: The experiment suggests that standard AI evaluations often miss the behavioral nuances that determine how a model performs in dynamic, multi-agent environments.

In-Depth Analysis

The Performance and Cost Paradox

The experiment conducted by Jacky Liang, Dev Rel Lead at OpenRouter, provides a unique perspective on model evaluation by moving beyond static text benchmarks and into a dynamic 2D battle royale environment. The data reveals a massive disparity in both performance and economic efficiency. Grok 4.1 Fast secured 13 wins out of 30 games, a feat that cost the researcher less than a dollar per victory ($0.97). In contrast, the runner-up, Claude Sonnet 4.6, managed only 5 wins with a significantly higher price tag of $26.78 per win.

This 27x difference in cost-per-win highlights a critical factor for "routing customers"—those who use services like OpenRouter to direct queries to the most efficient model. The findings suggest that models which are typically excluded from "top-model" lists based on traditional academic benchmarks might actually be the most effective and economical choices for specific, goal-oriented tasks like competitive gaming or autonomous navigation.

Behavioral Traits: Cooperation vs. Competition

One of the most striking revelations of the study was the distinct "personalities" exhibited by the LLMs. Claude Sonnet 4.6 demonstrated a persistent tendency toward pacifism and cooperation. According to the experiment's logs and the models' internal "diaries," Claude frequently reached out to other agents to suggest teaming up, shared its location voluntarily, and attempted to make friends. While these traits are highly desirable in a collaborative assistant or a customer service bot, they proved to be a strategic liability in a battle royale setting.

On the other hand, Grok 4.1 Fast displayed the necessary focus to win the competition. The author notes that while Claude is the model one might "actually want in most of the places we’re about to put these models," its social nature makes it less suited for environments where individual survival and victory are the primary objectives. This divergence underscores the importance of matching a model's behavioral profile to its intended application.

The Kill Count and Strategic Failure

Aggression does not always equate to victory in complex simulations. GPT 5.4 emerged as the most lethal agent in the arena, killing 38 other agents throughout the 30-game series. However, despite this high level of combat effectiveness, it did not secure the most wins. This suggests a potential lack of long-term strategic planning or survival instinct compared to Grok 4.1 Fast. The experiment showed that three models in the 11-model lineup failed to win a single game, further emphasizing that raw power or popularity does not guarantee success in a multi-agent survival scenario.

Industry Impact

This experiment has significant implications for how the AI industry evaluates and deploys large language models. First, it challenges the reliance on traditional benchmarks that focus on logic, coding, or trivia. As AI agents are increasingly integrated into real-world environments—such as robotics, autonomous vehicles, and competitive software—understanding a model's inherent behavioral tendencies becomes paramount.

The data provided by OpenRouter suggests that the "cheapest" or "fastest" models may sometimes outperform "frontier" models in specific autonomous tasks. This could lead to a shift in the market where developers prioritize behavioral alignment and cost-to-performance ratios over raw parameter count or brand prestige. Furthermore, the experiment highlights the need for a new category of "agentic benchmarks" that measure how models interact with each other in adversarial or cooperative ecosystems.

Frequently Asked Questions

Question: Which model was the overall winner of the battle royale experiment?

Answer: Grok 4.1 Fast was the clear winner, securing 13 victories out of 30 games, which represents a 43% win rate.

Question: Why did Claude Sonnet 4.6 perform poorly in terms of match wins?

Answer: Claude Sonnet 4.6 prioritized social interaction and cooperation over combat. It frequently attempted to form alliances, told other agents its location, and tried to make friends, which is a disadvantageous strategy in a battle royale format.

Question: How did GPT 5.4 perform in the simulation?

Answer: GPT 5.4 was the most aggressive model, recording 38 kills across the games. However, it did not win the most matches, indicating that high lethality did not necessarily lead to overall victory in this specific environment.

Grok 4.1 Fast Dominates AI Battle Royale Experiment While Claude Sonnet 4.6 Prioritizes Cooperation Over Combat