today's standings
| # | Model | SWE-bench Verified | SWE-bench Pro | Input | Best for |
|---|---|---|---|---|---|
| 1 | Claude Opus 4.8 ↗Anthropic | ~86% | 69.2% | $5/1M | The hardest agentic refactors and long, autonomous multi-file tasks where every point of accuracy saves a human review cycle.Anthropic-reported; independent evals (vals.ai) track within ~1 point. |
| 2 | Claude Sonnet 5 ↗Anthropic | 85.2% | 63.2% | $2/1M | The best closed-model value — near-Opus scores at ~2.5× less, and the default daily driver for most developers.Anthropic-reported. Intro pricing $2/$10 per 1M through Aug 31, 2026, then $3/$15. |
| 3 | GPT-5.5 ↗OpenAI | 82.6% | 58.6% | — | OpenAI's strongest agentic coder, with the deepest tooling and ecosystem breadth of the closed labs.Verified score from vals.ai independent eval; Pro is OpenAI-reported (rivals flag possible memorization on Pro). |
| 4 | DeepSeek V4 Pro open ↗DeepSeek | 80.6% | 55.4% | $0.435/1M | The cheapest frontier-class coder — top open-weights score at ~11× less than Opus. Best pick when cost or self-hosting rules.Independent tracker (llm-stats, June 2026); tied with Gemini 3.1 Pro on Verified, ahead on Pro. |
| 5 | Gemini 3.1 Pro ↗Google DeepMind | 80.6% | 54.2% | — | Google's strongest coding model today, with deep Workspace/Cloud integration. (A 3.5 Pro is expected but not shipped.)DeepMind-reported pass rate; ties DeepSeek V4 on Verified, trails it on Pro. |
| 6 | MiniMax M3 open ↗MiniMax | 80.5% | 59.0% | $0.60/1M | Open weights with 1M context, multimodal input and computer use — beats GPT-5.5 on SWE-bench Pro at 5–10% of the cost.Vendor-reported at launch (Jun 1, 2026); no independent eval published yet. |
| 7 | Qwen3.7 Max ↗Alibaba | 80.4% | 60.6% | — | The best non-Claude score on the hardest benchmark — 60.6% SWE-bench Pro — built for long-horizon coding agents.Vendor-reported (May 20, 2026). Proprietary — the open-weights sibling is Qwen3.6-35B. |
| 8 | Kimi K2.6 open ↗Moonshot AI | 80.2% | 58.6% | — | A top-three open coder whose 58.6% SWE-bench Pro beats several closed flagships.Vendor-reported (10-run average on Moonshot's SWE-agent harness). |
| 9 | Gemini 3.5 Flash ↗Google DeepMind | 78.8% | — | — | Frontier-ish coding at Flash speed and price, with computer use built in as a native tool.vals.ai independent eval. See our decode of its native computer-use tool. |
We rank by SWE-bench Verified (real, human-validated GitHub issues resolved end-to-end), tiebroken by the harder SWE-bench Pro. A score is only printed once confirmed against the maker's primary source, an independent evaluation (vals.ai, llm-stats), or the official leaderboard — and each row says which kind it is. Where independent and vendor numbers differ we prefer the independent one. Models still being checked are marked “verifying” and shown without a number rather than estimated. Prices are per 1M input tokens on the standard API tier and can change — always confirm current pricing with the provider.
our picks
Highest verified scores on the hardest agentic benchmarks. Reach for it on the gnarliest, largest refactors.
85.2% SWE-bench Verified at $2/1M — near-frontier coding at a rounding-error price. The default for most work.
80.6% Verified at $0.435/1M with MIT open weights — frontier-class results at ~11× less than Opus.
Top Terminal-Bench and long-horizon reliability for delegated, repo-wide tasks in Claude Code or CI.
It's the default model for free claude.ai users — frontier-class coding at no cost for everyday tasks.
Its 60.6% on SWE-bench Pro is the best non-Claude score on the benchmark that's hardest to game.
compare head-to-head
how the field got here
- 2021GitHub Copilot preview Autocomplete-in-the-editor goes mainstream.
- 2023ChatGPT + GPT-4, then Cursor Chat-based coding and the first AI-native editor arrive.
- Aug 2024SWE-bench Verified launches A human-validated benchmark of real GitHub issues sets an honest bar.
- Oct 2024Claude 3.5 Sonnet hits ~49% Agents begin resolving real issues, not just snippets.
- 2025Terminal agents Claude Code and Codex CLI move AI out of the editor into the whole repo.
- Apr–Jun 2026Open weights close the gap DeepSeek V4, Kimi K2.6 and peers cluster at ~80% Verified — for pennies.
- 2026Verified saturates in the mid-80s SWE-bench Pro and Terminal-Bench become the real differentiators.
- BenchmarkSWE-bench — the real-GitHub-issue benchmark & leaderboard
- Benchmarkvals.ai — SWE-bench Verified — independent third-party evaluations
- BenchmarkSWE-bench Pro (Scale) — the harder long-horizon leaderboard
- PaperSWE-bench (arXiv) — how the benchmark is constructed
- MakerAnthropic — news — Claude Opus 4.8 / Sonnet 5 model cards & pricing
- MakerOpenAI — GPT-5.5 — GPT-5.5 release post
- MakerMoonshot AI — Kimi K2.6 — open weights & reported scores
- MakerGoogle DeepMind — Gemini — Gemini model pages
- PressVentureBeat — MiniMax-M3 debut — M3 launch scores & pricing
- BenchmarkW&B ml-news — Qwen3.7-Max scores — Qwen3.7-Max benchmark table