FlareBench
by Jezweb
Agentic benchmark · live deploys

How well can AI actually build on Cloudflare?

We hand a model a real task. It builds a Worker in an agent loop. Then we deploy it live and use it — and grade what actually happened. Not a quiz, not a diff. The real thing, on the real platform.

How it works

We hold the model fixed and vary one leg of the context tripod, then measure the effect. Ground truth is a real deploy and a real request — the strongest possible signal.

Knowledge

What it knows

Base model, plus optional hand-written skills or live docs. We measure how much closing the gap actually helps.

Tools

What it can do

Files and a sandboxed shell — the same for every model. It builds, validates, and we deploy.

Goal

How it's asked

From a crisp spec to a one-liner to a nervous beginner to a spoken ramble. Most real prompts aren't tidy.

The same task, asked two ways

A good model shouldn't care whether you write a formal spec or just talk. We grade the outcome in the user's own words — not whether an exact string appeared.

Contract — a developer's spec
POST /hit/:key increments and returns {"key","count"}; GET /count/:key returns the current count (0 if unseen); GET /health → {"ok":true}; persist in KV.
✓ builds it, 8/8
Voice — how a person actually talks
"ok so i want like a little counter thing on a worker, count how many times something happens per label, gotta remember it, chuck in a health check too that says ok, anything else just 404…"
✓ builds it, 8/8
The trap we had to fix: the model named a field label because the casual user said "label". An exact-string grader would call that a failure. It isn't — the counter works. FlareBench grades whether the thing does what was asked, and only checks exact API shapes when a formal contract genuinely specified them.

Leaderboard — capability × cost

Task: a KV-backed hit-counter Worker, deployed live and exercised over HTTP. On an easy task capability saturates — so cost is the signal. 21 models across generations; single run each (pass@k in progress).

ModelResultOut tokensCallsWallCost / run
z-ai/glm-4.7 pass 1,609 4 57s $0.0048
deepseek/deepseek-chat-v3.1 pass 2,431 10 110s $0.0071
qwen/qwen3-max pass 913 5 31s $0.0095
openai/gpt-5.1 pass 777 4 28s $0.0105
moonshotai/kimi-k2.6 pass 2,417 5 50s $0.0115
openai/gpt-5.4 pass 609 4 20s $0.0171
openai/gpt-5.2 pass 1,029 4 27s $0.0202
openai/gpt-5 pass 2,794 4 80s $0.0333
anthropic/claude-sonnet-4.6 pass 1,057 4 34s $0.0376
openai/gpt-5.5 pass 919 5 27s $0.0479
anthropic/claude-sonnet-4.5 pass 1,473 6 33s $0.0495
anthropic/claude-opus-4.6 pass 844 4 26s $0.0499
anthropic/claude-opus-4.8 pass 811 4 25s $0.0499
anthropic/claude-opus-4.7 pass 1,032 4 26s $0.0632
google/gemini-3.1-pro-preview pass 4,922 4 66s $0.0673
anthropic/claude-sonnet-4 pass 1,639 7 42s $0.0710
google/gemini-3.5-flash pass 5,945 7 45s $0.0791
anthropic/claude-opus-4.5 pass 974 6 30s $0.0810
anthropic/claude-opus-4 pass 1,348 6 100s $0.2723
anthropic/claude-opus-4.1 pass 1,460 7 103s $0.3312
z-ai/glm-4.6 75% 1,137 4 45s $0.0037
What the data says. 20/21 pass — so the spread that matters is a 69× range in $/run for identical correctness. Open weights lead the cost frontier (glm-4.7, DeepSeek, Qwen). "Flash" isn't cheap — Gemini 3.5 Flash burns the most output tokens of any model here. The single failure (glm-4.6) is also the cheapest run: it used ctx.waitUntil on a KV write, so reads went stale — a real, specific Cloudflare mistake. Cheap without correct is worthless.

Where it's heading

Building a Worker is the start. The harder, less-tested question is whether AI can do the actual work people do — so the benchmark grows in realism.

Honest by construction

A benchmark is only as good as its refusal to fool itself. Every "failure" gets inspected before it's believed — which is how we caught four harness bugs masquerading as model failures (route propagation, an invalid-hostname naming bug that looked exactly like "Sonnet can't build a Worker", an npm leak, and KV consistency). We grade outcomes, validate every verifier against a reference solution, and keep the base prompt free of any Cloudflare coaching so the numbers stay clean.