My fav coding benchmark for frontier models is to build a simple RTS game in one...

egeozcan · 2026-05-29T04:52:48 1780030368

I've been tasking LLMs to write a traditional AI for a full vibe-coded RTS. I remove the human players and let them battle. I don't know why but I enjoy watching AI players battle so much :)

In the repo, I even have a tournament script that calculates ELOs. So far, codex was unmatched. I'll try with Opus 4.8 too.

https://egeozcan.github.io/unnamed_rts/game/

https://github.com/egeozcan/unnamed_rts/blob/main/src/script...

plutokras · 2026-05-29T22:15:21 1780092921

I'm happy to report that this game is very fun for natural intelligence entities too. :)

egeozcan · 2026-05-30T06:46:09 1780123569

Glad that you liked it! Please fork or note the version you like because I keep breaking it in spectacular ways :)

mannanj · 2026-05-29T14:27:01 1780064821

This is fun! I look forward to trying this out. Thanks for sharing!

jclay · 2026-05-28T19:00:11 1779994811

It almost appears as if the code was minified. The variable names are short and formatting looks like it's written to minimize whitespace. Did it write it in this compact format all on it's own?

senko · 2026-05-28T20:40:04 1780000804

Yeah looks extremely compact. I didn't instruct it or told it to use as few lines of code or characters or nothing of the sort.

Not sure why it did that. Its own rationale (which is highly suspect, but the only lead I have) is that it defaults to dense style if it has to write a file in a single go. May be a kernel of truth somewhere in there.

bombcar · 2026-05-28T22:26:27 1780007187

And much code on the web “in production” is minimized.

AdamN · 2026-05-29T07:21:44 1780039304

minified is fewer tokens than the human-readable version that we would write. It only really makes sense to write in minified js - it's also where alot of code in the wild is since every production site minifies their js which is then consumed by training.

andai · 2026-05-28T20:19:01 1779999541

A friend sent me something he vibe coded which included a massive webassembly blob in the HTML file. My friend is not a programmer so he was not able to explain to me how it did that.

unconscionable · 2026-05-28T22:38:47 1780007927

Claude Design export.

syspec · 2026-05-29T02:08:17 1780020497

I just had Opis 4.8 code up something and actually that's exactly how it coded it!

It looked gross and minimized, the result was awesome but the code looked pretty awful visually

dilap · 2026-05-29T13:40:15 1780062015

Doesn't look minified, just very dense, almost like progcomp code. First time I've seen an LLM spit out that style of code, I'm impressed!

rphv · 2026-05-29T03:49:11 1780026551

"Readability by humans" may no longer be as important as it once was.

lionkor · 2026-05-29T06:29:50 1780036190

Maybe it would benefit Anthropic if AI generated code worked, but wasn't readable by humans. That's a nice moat.

seanw444 · 2026-05-29T17:29:30 1780075770

Proprietary stochastic compilers. Hooray.

selcuka · 2026-05-29T09:52:03 1780048323

Good variable names are still useful for LLMs to understand context when refactoring.

rafram · 2026-05-29T17:46:40 1780076800

LLMs are already bad at reusing existing logic/resources/components, even when they have obvious names. Unreadable code only makes it worse.

orphea · 2026-05-29T09:19:10 1780046350

Only if LLMs will start to output object code, skipping text representation.

calebgcc · 2026-05-29T06:46:14 1780037174

I wonder if your previous prompts were part of the new RL fine tuning, and that’s why is now better at this specific question

apitman · 2026-05-28T19:19:58 1779995998

I like that benchmark. You should throw the results up on GitHub pages so people can try out the games.

brandly · 2026-05-28T20:23:16 1779999796

Yeah! Host on GitHub pages, so it's easy to click a link and play!

senko · 2026-05-28T21:00:10 1780002010

Great idea!

I have a static server of my own, so here's my list (of all the tests I published so far): https://senko.net/vibecode-bench/

apitman · 2026-05-29T01:40:04 1780018804

Forget GH pages. Indiehosted ftw.

paulirish · 2026-05-28T23:18:33 1780010313

Would love to see the prompts, too!

jmtame · 2026-05-29T01:00:53 1780016453

Same!

senko · 2026-05-29T09:00:08 1780045208

I've updated the page with the prompts, c/p-ing here:

Minesweeper: Create a beautiful and fully functional Minesweeper clone in HTML/JS/CSS (all in one file).

RTS: Create a simple but functional real time strategy (RTS) game similar to old WarCraft, StarCraft or Command & Conquer games. The player should be able to build buildings, create units, gather resources and should uncover the whole map. No AI or multiplayer needed. Use simple but nice-looking graphics. No sound. Implement everything in HTML/CSS/JS, everything in a single file (you can use 3rd-party js or css libraries/frameworks via CDN).

munksbeer · 2026-05-29T13:22:08 1780060928

Very nice! Do you have any CLAUDE.md or AGENT.md files that influence it? I'd like to try this same thing and wondering what else feeds into it to produce that output?

johndevor · 2026-05-29T16:48:45 1780073325

I put a version on Hallway: https://hallway.com/workspaces/4ddaa042-13b1-4fa5-bcf4-3d646...

Easy to edit and share.

RobinL · 2026-05-29T05:47:26 1780033646

Nice, I recently found something like this was possible too. Gpt-5.5 one shotted the basic game, but then I added some ai generated graphics/sounds/music and asked it to write then up.

It's a vocab building game, playable here (desktop only): https://rupertlinacre.com/vocab_annihilation/

It kind of blows my mind I can go from: 'I want a fun way to help him learn vocabulary, and I loved total annihilation as a kid' to 'heres a game that's he finds genuinely fun that helps him learn something ' in a few prompts.

Madmallard · 2026-05-29T01:46:34 1780019194

Okay now have it implement an authoritative server with reliable netcode and reconnection/disconnection logic, lobbies, and finding games, in-game chat, synchronized state around starting and ending games, resignations and such

skolos · 2026-05-29T05:03:36 1780031016

How many times did you try? Same model running multiple times can produce both very good and very bad results. In my benchmark even 10 runs often not enough to tell for sure if one model is better than another.

senko · 2026-05-29T09:08:00 1780045680

Usually just once (and I did just one test for this particular one), but I've found the overall quality to be relatively consistent.

There's too many confounding variables here, randomness just one of them. So I don't think of it as a definitive test (and reliable ordering), just another data point (along with actual benchmarks, pelicans, etc) to get a sense of the capabilities.

For example, I managed to get something out of DeepSeek 4 Flash quantized to 2-bit with Antirez' DwarfStar, used via Pi. Almost kinda worked! :) Which makes me optimistic for using local models for serious development soon - I'd say within a year.

elAhmo · 2026-05-28T19:47:01 1779997621

What is ultracode mode?

colechristensen · 2026-05-28T21:19:34 1780003174

Biases the model to solve problems with teams of agents

tcoff91 · 2026-05-28T20:04:44 1779998684

it's a brand new mode

senko · 2026-05-28T20:49:37 1780001377

It's a combination of reasoning effort (max) + enabling workflow that orchestrates multiple sub-agents.

After some interrogation, here's how it organized the work:

1. Design workflow (rts-game-design, 11 agents, ~13 min) ran first, produced SPEC.md + DESIGN.md:

1.1. Proposals (3 parallel agents): each designed a complete RTS from a different philosophy

1.2 Judge (1 agent): evaluated all three and synthesized one unified design, committing to specific numbers (costs, HP, map size, etc.).

1.3 Deep-dives (6 parallel agents): each wrote an implementation-ready spec for one subsystem, all consistent with the chosen design

1.4 Synthesis (1 agent): merged the design + all six subsystem specs into one conflict-free master spec

2. Code-review workflow (rts-code-review, 25 agents, ~5 min), ran after the main agent had written and tested the code:

2.1 Review (6 agents, read-only Explore type): each scrutinized one dimension and returned structured findings.

2.2. Verify (19 agents): every finding got its own skeptic agent told to try to refute it, Result: 19 flagged → 16 confirmed, 3 rejected as non-bugs.

What the main agent did in the main loop:

- Wrote all ~2,400 lines of index.html by hand from the spec.

- All browser testing/debugging via headless Chrome (I told it to use rodney by @simonw, love the tool :)

- Applied all 16 fixes from the review and re-verified them in the browser.

33MHz-i486 · 2026-05-28T23:00:19 1780009219

seems like a rube-goldberg esque way to consume 10x tokens. is this really where the industry is heading?

e12e · 2026-05-29T01:10:06 1780017006

I like to think of it like the difference between dropping a ball on a roulette wheel (get one random number/sequence of repeated) - vs dropping a ball on a carved topographic map, where valleys guide the ball to a particular outcome.

If you can stand a little AI expansion - here are a few points Gemini came up with - I think the idea has some merit:

https://g.co/gemini/share/b5b97867eeb1

(Maybe the better analogy is roulette vs pinball machine)

derac · 2026-05-29T00:03:22 1780013002

Why is it Rube Goldbergesque? The process doesn't seem arbitrary.

OJFord · 2026-05-29T06:43:37 1780037017

Rube Goldberg machines (or Heath Robinson contraptions) aren't arbitrary, they're complicated or contrived ways of achieving the process; often a very literal interpretation of how an automatic machine might imitate an otherwise manual action – a robotic hand movement for example. I think it's quite a good analogy, even if agentic Goldberg works well.

sdfsdssdfsdf · 2026-05-29T08:15:06 1780042506

Those machines are, to quote Wikipedia, "designed to perform a simple task in a comically overcomplicated way". This implies there is a much simpler way that works just as well.

I don't think the Rube Goldberg analogy works if the agentic meandering is essential complexity required to get at the results. Rube Goldberging it would be something like putting this loop inside some comically overengineered enterprise microservice web which is then found out to be running inside a Window 98 emulator or what have you.

Orygin · 2026-05-29T09:33:15 1780047195

> This implies there is a much simpler way that works just as well

Yes there is: Write the code yourself

hk__2 · 2026-05-29T20:48:44 1780087724

This is not any simpler

ymolodtsov · 2026-05-30T12:30:12 1780144212

Seems to me the route that these agents took is sort of exactly how a group of people would collaborate on building an RTS?

jmtame · 2026-05-29T00:45:20 1780015520

Thanks for sharing this. Going to try it out on a game inspired by Rust. It's helpful re: the point on rodney - I've had a hard time getting the testing to work well in the browser.

chrisweekly · 2026-05-29T11:53:32 1780055612

Did you start with a clean slate or do you have global ~/.claude/CLAUDE.md and/or specific skills, plugins, etc?

senko · 2026-05-29T14:13:38 1780064018

I don't have global CLAUDE.md and the only non-default skill I have that was used here is the one to use rodney[0] headless browser. I didn't expressly tell Claude to do browser testing, it decided to do it on its own.

So no extra guidance beyond the prompt.

[0] https://github.com/simonw/rodney/

chrisweekly · 2026-05-30T13:38:07 1780148287

Thanks!

artur_makly · 2026-05-29T12:48:47 1780058927

Just to confirm - you did not generate this plan/orchestration/harness - it did all that on its own?

senko · 2026-05-29T14:08:33 1780063713

Correct, that's the "workflows" part they introduced in claude code alongside the new model.

seidleroni · 2026-05-29T14:28:46 1780064926

I am absolutely gobsmacked how good the game is! I didn't complete the level fully but I completed all but one of the tasks. This is both smooth and fun and I'm surprised that a modern LLM can do something this well, let alone in a single file. It makes me realize how much the goalposts have been moved. A few years ago (ChatGPT 2? 2.5?) wasn't even able to implement a small Python script I would expect a junior engineer to be capable of producing. Now we're getting the tools to do something like this. You should think about how to "rate" the outputs or at least provide your own rankings.

H3X_K1TT3N · 2026-05-28T22:23:43 1780007023

Thanks for also sharing the prompt. I've been testing claude by asking it to make similar things, so it's useful to see what other people are doing.

I do find it interesting that the visual style is pretty similar to things it's produced for me.

dash2 · 2026-05-29T04:52:39 1780030359

If you look on the page of games, the style of chatgpt 5.5 is almost identical to the Claude style.

digdugdirk · 2026-05-28T20:00:41 1779998441

Do you have a collection of these benchmark apps saved anywhere? I'd be particularly interested in seeing the relative cost differences between different models in a use case like this.

senko · 2026-05-28T20:59:32 1780001972

I'm saving them all as gists here: https://gist.github.com/senko

But I just vibe-coded a handy list of all the tests I did (unfortunately without the commentary I usually leave in social media posts -- I should add those at some point): https://senko.net/vibecode-bench/

jryan49 · 2026-05-28T20:06:18 1779998778

Kinda buggy, but impressively nonetheless. How long did it take?

senko · 2026-05-28T20:36:42 1780000602

It took 50 minutes, would be ~$20 in API costs (I'm on a Pro sub).

senko · 2026-05-29T09:27:06 1780046826

(Correction: I'm on a Max ($100/mo) sub. Realized the mistake too late, so can't edit my comment.)

ammar_x · 2026-05-28T22:47:44 1780008464

Is there some sort of a leaderboard for this test? Like if you'd give each of Opus 4.8 and GPT 5.5 a score out of 100, what would the scores be?

senko · 2026-05-28T22:55:16 1780008916

There isn't, as I wasn't going for strictness, more like a playful challenge in the vein of Simon's SVG pelican.

Between the two, Opus 4.8 seems more capable. But, I suspect the harness plays a large role here. It's possible the result would be as good if Codex ran 10+ agents and spent an hour on it.

OpenAI and Anthropic usually fast-follow each other, so I wouldn't be surprised if Codex got the same capability in a couple of days (and even an update to the model), then it'll be a better test.

Sooo, let's say, winging it, vibes-based: 85% for Opus 4.8, 75% for GPT 5.5. Compare with GPT 5.3 (let's say 25%) here: https://senko.net/vibecode-bench/2026/rts-codex-5.3.html

jmtame · 2026-05-29T00:34:09 1780014849

Wow, that's impressive. Had fun playing it for 10 minutes locally. Found myself wanting to discover an enemy base :)

fireant · 2026-05-29T03:25:43 1780025143

Wow that looks really impressive. Both the UI and the content looks good, the game is a bit buggy but still nice!

zuzululu · 2026-05-29T07:25:24 1780039524

some reason that website is showing up as high risk and i cannot view it , I had to open it from my mobile phone.

it looks quite impressive, I don't use claude currently but hearing good things about it...from codex users ironically

senko · 2026-05-29T08:31:57 1780043517

Is that for bsky.app (BlueSky platform) or my personal site (senko.net) where I put up the list of tests? What browser/device was that?

shlewis · 2026-05-29T00:29:10 1780014550

How much did it cost?

senko · 2026-05-29T09:26:13 1780046773

Token equivalent of ~ $20 (I'm on a $100 Max sub).

l3x4ur1n · 2026-05-28T19:40:17 1779997217

Played it to the end. Pretty neat!

veqq · 2026-05-29T04:23:35 1780028615