How well can AI actually build on Cloudflare?
We hand a model a real task. It builds a Worker in an agent loop. Then we deploy it live and use it — and grade what actually happened. Not a quiz, not a diff. The real thing, on the real platform.
How it works
We hold the model fixed and vary one leg of the context tripod, then measure the effect. Ground truth is a real deploy and a real request — the strongest possible signal.
What it knows
Base model, plus optional hand-written skills or live docs. We measure how much closing the gap actually helps.
What it can do
Files and a sandboxed shell — the same for every model. It builds, validates, and we deploy.
How it's asked
From a crisp spec to a one-liner to a nervous beginner to a spoken ramble. Most real prompts aren't tidy.
The same task, asked two ways
A good model shouldn't care whether you write a formal spec or just talk. We grade the outcome in the user's own words — not whether an exact string appeared.
label because the casual user said "label". An exact-string grader would call that a failure. It isn't — the counter works. FlareBench grades whether the thing does what was asked, and only checks exact API shapes when a formal contract genuinely specified them.Leaderboard — capability × cost
Task: a KV-backed hit-counter Worker, deployed live and exercised over HTTP. On an easy task capability saturates — so cost is the signal. 21 models across generations; single run each (pass@k in progress).
| Model | Result | Out tokens | Calls | Wall | Cost / run |
|---|---|---|---|---|---|
| z-ai/glm-4.7 | pass | 1,609 | 4 | 57s | $0.0048 |
| deepseek/deepseek-chat-v3.1 | pass | 2,431 | 10 | 110s | $0.0071 |
| qwen/qwen3-max | pass | 913 | 5 | 31s | $0.0095 |
| openai/gpt-5.1 | pass | 777 | 4 | 28s | $0.0105 |
| moonshotai/kimi-k2.6 | pass | 2,417 | 5 | 50s | $0.0115 |
| openai/gpt-5.4 | pass | 609 | 4 | 20s | $0.0171 |
| openai/gpt-5.2 | pass | 1,029 | 4 | 27s | $0.0202 |
| openai/gpt-5 | pass | 2,794 | 4 | 80s | $0.0333 |
| anthropic/claude-sonnet-4.6 | pass | 1,057 | 4 | 34s | $0.0376 |
| openai/gpt-5.5 | pass | 919 | 5 | 27s | $0.0479 |
| anthropic/claude-sonnet-4.5 | pass | 1,473 | 6 | 33s | $0.0495 |
| anthropic/claude-opus-4.6 | pass | 844 | 4 | 26s | $0.0499 |
| anthropic/claude-opus-4.8 | pass | 811 | 4 | 25s | $0.0499 |
| anthropic/claude-opus-4.7 | pass | 1,032 | 4 | 26s | $0.0632 |
| google/gemini-3.1-pro-preview | pass | 4,922 | 4 | 66s | $0.0673 |
| anthropic/claude-sonnet-4 | pass | 1,639 | 7 | 42s | $0.0710 |
| google/gemini-3.5-flash | pass | 5,945 | 7 | 45s | $0.0791 |
| anthropic/claude-opus-4.5 | pass | 974 | 6 | 30s | $0.0810 |
| anthropic/claude-opus-4 | pass | 1,348 | 6 | 100s | $0.2723 |
| anthropic/claude-opus-4.1 | pass | 1,460 | 7 | 103s | $0.3312 |
| z-ai/glm-4.6 | 75% | 1,137 | 4 | 45s | $0.0037 |
ctx.waitUntil on a KV write, so reads went stale — a real, specific Cloudflare mistake. Cheap without correct is worthless.
Where it's heading
Building a Worker is the start. The harder, less-tested question is whether AI can do the actual work people do — so the benchmark grows in realism.
- Code — build a Worker or a site. Graded by deploying it and, for frontends, driving it in a real browser (elements present, the button works, zero console errors). working
- Artifact — produce a document or report from real data. Graded on the facts deterministically, with a thin AI judge only for clarity. live
- Real work — multi-step office tasks through a browser, against environments we host on Cloudflare, graded by inspecting the database afterward. next
Honest by construction
A benchmark is only as good as its refusal to fool itself. Every "failure" gets inspected before it's believed — which is how we caught four harness bugs masquerading as model failures (route propagation, an invalid-hostname naming bug that looked exactly like "Sonnet can't build a Worker", an npm leak, and KV consistency). We grade outcomes, validate every verifier against a reference solution, and keep the base prompt free of any Cloudflare coaching so the numbers stay clean.