CommunityMay 23, 2026 · 1 month ago

Which LLM Writes the Best Specifications? A Side-by-Side Across 13 Models

v0.9.6 let you point SPECLAN at any LLM you trust. The natural next question — which one should you actually use? — has no benchmark yet, because the literature stops at code. So we ran the same brief through 13 models on the same codebase (excalidraw), built a side-by-side viewer at speclan.net/compare, and let Claude Opus judge every other tree. The result is a 7-minute video and a gallery you can walk yourself: Anthropic's three sit at the top with Haiku 4.5 quietly out-counting Opus on requirements; Qwen 3.6 35B closes the cloud gap on a laptop; GPT-OSS 20B produces a coherent feature tree with zero goals; Gemma 4 26B leaves the literal string "Goal description goes here." in a Goal body; and Gemini Pro invents an account-billing system for a drawing tool. None of them are wrong — they're answering the same question with different products.

Which LLM Writes the Best Specifications? A Side-by-Side Across 13 Models

There are benchmarks for the code an LLM writes — HumanEval, MBPP, SWE-Bench, LiveCodeBench. There are no benchmarks for the specifications an LLM writes. The upstream half of agentic delivery, the half where the model decides what should be built, has been flying blind.

SPECLAN v0.9.6 opened the extension up to any AI provider you trust. The very next question that followed — which provider should you actually pick? — is the one this video tries to answer, not with a single number but with a method you can run yourself.

Watch the Video

About seven minutes, end to end. The first half is Infer Specs from Code running locally — Qwen 3.6 35B in LM Studio, on an M4 Max, GPU monitor pinned to the corner so you can watch the work actually happen on the machine. The second half is speclan.net/compare, a side-by-side viewer that holds the spec trees thirteen different models produced from the same codebase and the same brief, with Claude Opus's verdict on every other tree pinned to each candidate card.

What "Spec Creation" Means Here

"Spec creation" in this comparison is one concrete task: Infer Specs from Code — point SPECLAN at an existing codebase and let an LLM read it, decide what's a Goal, what's a Feature, what's a Requirement, and write the whole tree out as Markdown files with frontmatter. The codebase is excalidraw, the open-source diagram tool. The brief is identical for every model. What changes between cells in the gallery is only the model behind the wheel.

The reason that task is the interesting one is what a good output of it has to be. A spec is not source code. It compiles to nothing. It is reviewed by the customer, the PM, the QA lead — not by a compiler. So the right measure of its quality isn't "does it parse"; it's whether a downstream coding agent — Claude Code, Google Antigravity, or Codex — can implement it without drifting from what it says. We call that property driftless implementability, and it is the bar every tree in the gallery is judged against.

The 13 Models

The gallery on /compare covers thirteen candidates, grouped by family:

Anthropic (Claude)

Claude Opus 4.7 (1M context)
Claude Sonnet
Claude Haiku 4.5

OpenAI

GPT 5.4
GPT 5.4 mini
GPT-OSS 20B (open weights)

Google (Gemini and Gemma)

Gemini 3.1 Pro (preview)
Gemini 3.1 Flash (preview)
Gemma 4 31B
Gemma 4 26B A4B
Gemma 4 8B (local, Ollama)

Other

Qwen 3.6 35B A3B (local, LM Studio)
NVIDIA Nemotron 3 Nano (open weights)

Same codebase, same brief, thirteen trees. Pick any two on /compare, share the URL, and you and a teammate are looking at the same comparison.

Key Takeaways

For viewers who want the headline before pressing play:

The density spread between models is about 17 times. From 12 requirements (Nemotron 3 Nano) at the low end to 203 (Haiku 4.5) at the top. Same codebase. Same brief. Picking a model is picking a level of detail.
Anthropic's three sit at the top — and Haiku quietly out-counts Opus. Opus 4.7 lands at 197 requirements (the reference baseline), Sonnet at 196, and Haiku 4.5 at 203 — a smaller model producing a denser tree by splitting what Opus treats as one requirement into two or three smaller ones. The SDK scaffolding the Anthropic models run with does a lot of this work.
One local model closed the cloud gap. Qwen 3.6 35B A3B, running in LM Studio on a laptop through the OpenAI-compatible API, produced 174 requirements — within 12% of Opus. No tokens left the machine; the GPU monitor in the video is on-screen proof.
Some models invent products that don't exist. Gemini 3.1 Pro added User Identity, Account Creation, and Secure Sign-In features to a drawing tool. Gemini 3.1 Flash leaned the whole top level into a generic SaaS scaffold — "System Foundation", "Data Management Suite", "User Identity and Access" — with no trace of shape tools or collaboration. The Vision prose reads fine on both; the decomposition is where the drift shows.
Some models can't fill the slots. Gemma 4 26B A4B's Goal G-093 body contains the literal string "Goal description goes here." — the placeholder, untouched. GPT-OSS 20B produced a coherent feature tree (Whiteboard Drawing, Canvas Navigation, Undo/Redo, Collaboration, Export & Import) with zero goals — the whole Goal layer missing.
Different trees aren't better or worse — they're different carvings. The same brief decomposed into different products. Opus's Vision: "A World Where Ideas Flow Freely Through Visual Thinking." GPT 5.4's: "Sketching Shared Understanding for Everyone." Same input, two valid framings, two different products. The model is a product decision.

And Opus is the judge for every other candidate in the gallery — the verdict sits at the top of each card under METHODOLOGY NOTES → JUDGEMENT. Haiku's verdict, for example: "Higher requirement count than Opus from a noticeably smaller model… acceptance criteria read more mechanical than Opus's, but nothing major is missing."

Two Caveats Worth Naming

Two asterisks are on screen the whole way and worth repeating in print, because they shape how you should read the gallery.

SDK asymmetry. Anthropic models run through the Anthropic SDK with built-in Todo-List and planner scaffolding. OpenAI-path models — the GPT family and every local model, because LM Studio, Ollama, and vLLM all ship OpenAI-compatible APIs — go through the OpenAI SDK with the MCP tools we expose and nothing else. Most of the requirement-count gap between SDK families is scaffolding, not raw model. Read across families with that in mind.

Excalidraw familiarity. Every LLM in the gallery has excalidraw in its training data. Your private repository won't. The structural findings — density spread, missing layers, domain drift — generalize. The raw polish on a tree built from a codebase the model has never seen probably won't be quite this good.

Try It Yourself

The gallery is the artifact. Open /compare, pick any two candidates, walk through Vision → Goals → Features → Requirements on both sides, and judge the trees against your own bar for what an implementable spec looks like.

The comparison: speclan.net/compare
The local LLM how-to: We Gave SPECLAN a Local Brain
The long-form write-up: We Ran the Same Brief Through Thirteen LLMs
Install SPECLAN: VS Code Marketplace. It's free.

A spec is good when Claude Code, Google Antigravity, or Codex can implement it without drift. Until somebody builds the closed-loop benchmark, the gallery is the next-best instrument — read the trees, judge for yourself, and pick the model that writes the spec you'd want to hand off.

Back to News