SWE-bench Verified scores of leading AI coding modelsHorizontal bars comparing the SWE-bench Verified score of each verified frontier coding model, against the late-2024 frontier baseline of about 49 percent.SWE-BENCH VERIFIED (% RESOLVED)2024 ≈ 49%Claude Opus 4.886% Claude Sonnet 585.2% GPT-5.582.6% DeepSeek V4 Pro80.6% Gemini 3.1 Pro80.6% MiniMax M380.5% Qwen3.7 Max80.4% Kimi K2.680.2% Gemini 3.5 Flash78.8%</> genztech.blog
Fig 1 · benchmark SWE-bench Verified — the share of real, human-validated GitHub issues an AI resolves end-to-end. Only models whose scores we have confirmed against primary sources are charted. Source: swebench.com + maker reports.

today's standings

#ModelSWE-bench VerifiedSWE-bench ProInputBest for
1 Claude Opus 4.8 Anthropic ~86% 69.2% $5/1M The hardest agentic refactors and long, autonomous multi-file tasks where every point of accuracy saves a human review cycle.Anthropic-reported; independent evals (vals.ai) track within ~1 point.
2 Claude Sonnet 5 Anthropic 85.2% 63.2% $2/1M The best closed-model value — near-Opus scores at ~2.5× less, and the default daily driver for most developers.Anthropic-reported. Intro pricing $2/$10 per 1M through Aug 31, 2026, then $3/$15.
3 GPT-5.5 OpenAI 82.6% 58.6% OpenAI's strongest agentic coder, with the deepest tooling and ecosystem breadth of the closed labs.Verified score from vals.ai independent eval; Pro is OpenAI-reported (rivals flag possible memorization on Pro).
4 DeepSeek V4 Pro open DeepSeek 80.6% 55.4% $0.435/1M The cheapest frontier-class coder — top open-weights score at ~11× less than Opus. Best pick when cost or self-hosting rules.Independent tracker (llm-stats, June 2026); tied with Gemini 3.1 Pro on Verified, ahead on Pro.
5 Gemini 3.1 Pro Google DeepMind 80.6% 54.2% Google's strongest coding model today, with deep Workspace/Cloud integration. (A 3.5 Pro is expected but not shipped.)DeepMind-reported pass rate; ties DeepSeek V4 on Verified, trails it on Pro.
6 MiniMax M3 open MiniMax 80.5% 59.0% $0.60/1M Open weights with 1M context, multimodal input and computer use — beats GPT-5.5 on SWE-bench Pro at 5–10% of the cost.Vendor-reported at launch (Jun 1, 2026); no independent eval published yet.
7 Qwen3.7 Max Alibaba 80.4% 60.6% The best non-Claude score on the hardest benchmark — 60.6% SWE-bench Pro — built for long-horizon coding agents.Vendor-reported (May 20, 2026). Proprietary — the open-weights sibling is Qwen3.6-35B.
8 Kimi K2.6 open Moonshot AI 80.2% 58.6% A top-three open coder whose 58.6% SWE-bench Pro beats several closed flagships.Vendor-reported (10-run average on Moonshot's SWE-agent harness).
9 Gemini 3.5 Flash Google DeepMind 78.8% Frontier-ish coding at Flash speed and price, with computer use built in as a native tool.vals.ai independent eval. See our decode of its native computer-use tool.

We rank by SWE-bench Verified (real, human-validated GitHub issues resolved end-to-end), tiebroken by the harder SWE-bench Pro. A score is only printed once confirmed against the maker's primary source, an independent evaluation (vals.ai, llm-stats), or the official leaderboard — and each row says which kind it is. Where independent and vendor numbers differ we prefer the independent one. Models still being checked are marked “verifying” and shown without a number rather than estimated. Prices are per 1M input tokens on the standard API tier and can change — always confirm current pricing with the provider.

our picks

Best overallClaude Opus 4.8

Highest verified scores on the hardest agentic benchmarks. Reach for it on the gnarliest, largest refactors.

Best value (closed)Claude Sonnet 5

85.2% SWE-bench Verified at $2/1M — near-frontier coding at a rounding-error price. The default for most work.

Best budget / openDeepSeek V4 Pro

80.6% Verified at $0.435/1M with MIT open weights — frontier-class results at ~11× less than Opus.

Best for autonomous agentsClaude Opus 4.8

Top Terminal-Bench and long-horizon reliability for delegated, repo-wide tasks in Claude Code or CI.

Best free optionClaude Sonnet 5

It's the default model for free claude.ai users — frontier-class coding at no cost for everyday tasks.

Hardest-tasks dark horseQwen3.7 Max

Its 60.6% on SWE-bench Pro is the best non-Claude score on the benchmark that's hardest to game.

compare head-to-head

how the field got here

  1. 2021GitHub Copilot preview Autocomplete-in-the-editor goes mainstream.
  2. 2023ChatGPT + GPT-4, then Cursor Chat-based coding and the first AI-native editor arrive.
  3. Aug 2024SWE-bench Verified launches A human-validated benchmark of real GitHub issues sets an honest bar.
  4. Oct 2024Claude 3.5 Sonnet hits ~49% Agents begin resolving real issues, not just snippets.
  5. 2025Terminal agents Claude Code and Codex CLI move AI out of the editor into the whole repo.
  6. Apr–Jun 2026Open weights close the gap DeepSeek V4, Kimi K2.6 and peers cluster at ~80% Verified — for pennies.
  7. 2026Verified saturates in the mid-80s SWE-bench Pro and Terminal-Bench become the real differentiators.
Primary sources