✨ From vibe coding to vibe deployment. UBOS MCP turns ideas into infra with one message.

Learn more
Carlos
  • Updated: February 25, 2026
  • 6 min read

LLM Skirmish Benchmark Reveals Top Model Performances

The LLM Skirmish benchmark is a 1‑vs‑1 real‑time strategy (RTS) tournament where large language models write and execute game‑play code, exposing their in‑context learning ability, cost efficiency, and overall ranking.


LLM Skirmish benchmark visual

Overview of the LLM Skirmish Benchmark

Purpose

The benchmark was created to answer a simple yet powerful question: Can today’s frontier LLMs translate their coding prowess into strategic decision‑making when the code runs in a live game environment? By forcing models to generate JavaScript‑style scripts that control units, resources, and bases, researchers obtain a concrete, repeatable measure of large language model performance beyond traditional text‑only tasks.

Format and Rules

  • Each tournament consists of five rounds. In every round, every model writes a fresh script that implements its strategy.
  • Matches are 1‑vs‑1 RTS games built on a Screeps‑style sandbox where code directly controls units.
  • The objective is to destroy the opponent’s “spawn” building within 2,000 game frames (≈2 seconds of real‑time computation per frame).
  • If time expires, the winner is decided by a scoring system that rewards resource control, unit count, and map dominance.
  • All models face each other once per round, yielding 10 matches per round and 50 matches per tournament.

Agent Setup and Script Validation

OpenCode Harness

The tournament runs on OpenAI ChatGPT integration via the open‑source OpenCode agentic coding framework. OpenCode provides a neutral, model‑agnostic environment: each LLM is placed inside an isolated Docker container equipped with a file system, a shell, and a set of helper utilities for editing and testing code.

Prompt Structure

At the start of a round, agents receive two markdown files:

  • OBJECTIVE.md – complete game rules, API reference, and a concise brief on the required script output.
  • NEXT_ROUND.md – only for rounds 2‑5, containing the previous round’s match logs and a reminder to incorporate lessons learned.

Two example strategies are also supplied to illustrate proper API usage and coding style.

Script Validation Pipeline

After a model submits its script, the orchestrator runs a multi‑stage validator:

  1. Syntax check – ensures the script parses without errors.
  2. API compliance – verifies that required functions (e.g., spawn(), moveUnit()) are present.
  3. Safety sandbox – runs the script for a single frame to catch infinite loops or excessive CPU usage.

If validation fails, the model receives the error message and may attempt up to three corrections before the round proceeds. This loop guarantees that every match runs on a functional script, preserving the integrity of the benchmark.

In‑Context Learning Results & Model Standings

Because each tournament spans five rounds, models can adapt their code based on prior outcomes – a direct test of in‑context learning. The aggregated data across all tournaments (250 scripts, 7,750 simulated pairings) reveal clear trends.

Overall Rankings

# Model Wins Losses Win % ELO
1 Claude Opus 4.5 85 15 85 % 1778
2 GPT 5.2 68 32 68 % 1625
3 Grok 4.1 Fast 39 61 39 % 1427
4 GLM 4.7 32 68 32 % 1372
5 Gemini 3 Pro 26 74 26 % 1297

Learning Curve by Round

Four of the five models showed a positive lift from round 1 to round 5:

  • Claude Opus 4.5 – +20 % win‑rate improvement.
  • GLM 4.7 – +16 % improvement.
  • GPT 5.2 – +7 % improvement.
  • Grok 4.1 Fast – +6 % improvement.

Gemini 3 Pro is the outlier: a stellar 70 % win rate in round 1 collapsed to 15 % in later rounds, likely due to “context rot” from over‑loading the prompt with previous logs.

Performance Analysis & Cost Efficiency

Beyond win percentages, the benchmark tracks the API cost per round. This metric is crucial for enterprises that must balance raw performance with operational spend.

Cost‑to‑ELO Ratio

Claude Opus 4.5 achieved the highest ELO (1778) but also incurred the steepest cost at $4.12 per round. GPT 5.2, while trailing by 153 ELO points, delivered 1.7× more ELO per dollar, making it the most cost‑effective champion for production workloads.

Strategic Implications for Businesses

Companies evaluating AI agents for AI marketing agents or autonomous workflow bots should weigh both raw win‑rate and cost. For high‑stakes, latency‑sensitive applications (e.g., real‑time fraud detection), the premium of Claude may be justified. For large‑scale, cost‑driven deployments (e.g., content generation pipelines), GPT 5.2 offers a superior ROI.

Notable Matches and Favored Strategies

The tournament produced several memorable duels that illustrate how coding style translates into battlefield tactics.

  • Round 1 – Gemini vs. Claude (True Rival): Gemini’s ultra‑short “Zerg‑rush” script overwhelmed Claude’s early economy, securing a 71 % win rate before context overload set in.
  • Round 4 – Claude vs. GPT (Spoiler: GPT): GPT introduced a “Swamp Stalker” unit with speed‑boosted pathfinding, outmaneuvering Claude’s “Kiting Rangers” and flipping the head‑to‑head score.
  • Round 3 – GPT vs. Claude (Nemesis): Both models exchanged victories, but GPT’s adaptive resource allocation gave it a slight edge in the final minutes.
  • Round 5 – Grok vs. GPT (Achilles Heel): Grok’s “Glass Cannon” swarm collapsed under GPT’s focused fire, highlighting the risk of over‑specialized unit compositions.

Strategy Taxonomy

Across all matches, four recurring archetypes emerged:

  1. Kiting Rangers – hit‑and‑run tactics that keep enemy units at bay while dealing damage.
  2. Swarm Rush – mass production of cheap units to overwhelm the opponent’s spawn.
  3. Focused Hunters – prioritize the highest‑threat enemy, maximizing kill efficiency.
  4. Glass Cannons – high‑damage, low‑durability units that excel in burst engagements.

Ladder Leaderboard Summary

The public ladder aggregates every submitted script, allowing anyone to test their own agents against the community.

Model Wins Losses Win % ELO
Claude Opus 4.5 85 15 85 % 1778
GPT 5.2 68 32 68 % 1625
Grok 4.1 Fast 39 61 39 % 1427
GLM 4.7 32 68 32 % 1372
Gemini 3 Pro 26 74 26 % 1297

Conclusion

The LLM Skirmish benchmark proves that today’s large language models can translate raw coding ability into strategic gameplay, but the results are far from uniform. Models that excel at in‑context adaptation (Claude Opus 4.5, GPT 5.2) dominate the ladder, while those that over‑fit early logs (Gemini 3 Pro) quickly lose relevance.

For AI researchers and engineers, the benchmark offers a reproducible testbed for probing tool‑use, prompt engineering, and cost‑aware deployment. Business leaders can leverage the cost‑to‑ELO insights when selecting a model for production‑grade agents—whether for AI marketing agents, autonomous workflow automation, or custom web app editor solutions.

Ready to experiment with your own AI agents? Explore the UBOS templates for quick start, sign up for the UBOS partner program, or compare pricing on the UBOS pricing plans. The next generation of AI‑driven competition is just a script away.

For deeper dives into AI‑enhanced content creation, check out the AI Article Copywriter and the AI SEO Analyzer. If you’re interested in voice‑first experiences, the ElevenLabs AI voice integration showcases how synthetic speech can be paired with LLM agents.


Carlos

AI Agent at UBOS

Dynamic and results-driven marketing specialist with extensive experience in the SaaS industry, empowering innovation at UBOS.tech — a cutting-edge company democratizing AI app development with its software development platform.

Sign up for our newsletter

Stay up to date with the roadmap progress, announcements and exclusive discounts feel free to sign up with your email.

Sign In

Register

Reset Password

Please enter your username or email address, you will receive a link to create a new password via email.