What is the best AI model for coding right now?

As of July 2026, Claude Opus 4.8 posts the highest verified coding scores (~86% SWE-bench Verified, 69.2% SWE-bench Pro). Claude Sonnet 5 is the best value at 85.2% Verified for $2 per million tokens, and GPT-5.5 is third at 82.6% on independent evaluation.

What is the best open-source AI coding model?

DeepSeek V4 Pro leads open weights at 80.6% SWE-bench Verified for $0.435 per million tokens (MIT license), with MiniMax M3 at 80.5% (1M context, multimodal) and Kimi K2.6 at 80.2% right behind — a three-way open-weights photo finish at frontier-class scores.

What is SWE-bench Verified?

SWE-bench Verified measures the share of real, human-validated GitHub issues an AI resolves end-to-end. Scores climbed from ~49% in late 2024 to the mid-80s in 2026, so it's nearing saturation — the harder SWE-bench Pro and Terminal-Bench now separate the field.

What is the cheapest good AI coding model?

DeepSeek V4 Pro at $0.435 per million input tokens is the cheapest frontier-class coder (80.6% SWE-bench Verified, open weights). Among closed models, Claude Sonnet 5 at $2 per million (85.2% Verified) is the value pick, and it's free as the claude.ai default.

Is Claude better than GPT for coding?

On current verified numbers, yes: Claude Opus 4.8 (~86%) and Sonnet 5 (85.2%) lead GPT-5.5 (82.6%, vals.ai independent eval) on SWE-bench Verified, and the gap widens on SWE-bench Pro (69.2% vs 58.6%). GPT-5.5 counters with ecosystem breadth and strong terminal scores.

How often is this leaderboard updated?

We refresh it with every major coding-model release or newly confirmed score — it's checked on our twice-daily publishing runs. It was last updated on the date shown at the top of the page.

AI Coding Leaderboard — Best AI Models for Coding, Ranked (2026)

Fig 1 · benchmark SWE-bench Verified — the share of real, human-validated GitHub issues an AI resolves end-to-end. Only models whose scores we have confirmed against primary sources are charted. Source: swebench.com + maker reports.

today's standings

#	Model	SWE-bench Verified	SWE-bench Pro	Input	Best for
1	Claude Opus 4.8 ↗Anthropic	~86%	69.2%	$5/1M	The hardest agentic refactors and long, autonomous multi-file tasks where every point of accuracy saves a human review cycle.Anthropic-reported; independent evals (vals.ai) track within ~1 point.
2	Claude Sonnet 5 ↗Anthropic	85.2%	63.2%	$2/1M	The best closed-model value — near-Opus scores at ~2.5× less, and the default daily driver for most developers.Anthropic-reported. Intro pricing $2/$10 per 1M through Aug 31, 2026, then $3/$15.
3	GPT-5.5 ↗OpenAI	82.6%	58.6%	—	OpenAI's strongest agentic coder, with the deepest tooling and ecosystem breadth of the closed labs.Verified score from vals.ai independent eval; Pro is OpenAI-reported (rivals flag possible memorization on Pro).
4	DeepSeek V4 Pro open ↗DeepSeek	80.6%	55.4%	$0.435/1M	The cheapest frontier-class coder — top open-weights score at ~11× less than Opus. Best pick when cost or self-hosting rules.Independent tracker (llm-stats, June 2026); tied with Gemini 3.1 Pro on Verified, ahead on Pro.
5	Gemini 3.1 Pro ↗Google DeepMind	80.6%	54.2%	—	Google's strongest coding model today, with deep Workspace/Cloud integration. (A 3.5 Pro is expected but not shipped.)DeepMind-reported pass rate; ties DeepSeek V4 on Verified, trails it on Pro.
6	MiniMax M3 open ↗MiniMax	80.5%	59.0%	$0.60/1M	Open weights with 1M context, multimodal input and computer use — beats GPT-5.5 on SWE-bench Pro at 5–10% of the cost.Vendor-reported at launch (Jun 1, 2026); no independent eval published yet.
7	Qwen3.7 Max ↗Alibaba	80.4%	60.6%	—	The best non-Claude score on the hardest benchmark — 60.6% SWE-bench Pro — built for long-horizon coding agents.Vendor-reported (May 20, 2026). Proprietary — the open-weights sibling is Qwen3.6-35B.
8	Kimi K2.6 open ↗Moonshot AI	80.2%	58.6%	—	A top-three open coder whose 58.6% SWE-bench Pro beats several closed flagships.Vendor-reported (10-run average on Moonshot's SWE-agent harness).
9	Gemini 3.5 Flash ↗Google DeepMind	78.8%	—	—	Frontier-ish coding at Flash speed and price, with computer use built in as a native tool.vals.ai independent eval. See our decode of its native computer-use tool.

We rank by SWE-bench Verified (real, human-validated GitHub issues resolved end-to-end), tiebroken by the harder SWE-bench Pro. A score is only printed once confirmed against the maker's primary source, an independent evaluation (vals.ai, llm-stats), or the official leaderboard — and each row says which kind it is. Where independent and vendor numbers differ we prefer the independent one. Models still being checked are marked “verifying” and shown without a number rather than estimated. Prices are per 1M input tokens on the standard API tier and can change — always confirm current pricing with the provider.

our picks

Best overallClaude Opus 4.8

Highest verified scores on the hardest agentic benchmarks. Reach for it on the gnarliest, largest refactors.

Best value (closed)Claude Sonnet 5

85.2% SWE-bench Verified at $2/1M — near-frontier coding at a rounding-error price. The default for most work.

Best budget / openDeepSeek V4 Pro

80.6% Verified at $0.435/1M with MIT open weights — frontier-class results at ~11× less than Opus.

Best for autonomous agentsClaude Opus 4.8

Top Terminal-Bench and long-horizon reliability for delegated, repo-wide tasks in Claude Code or CI.

Best free optionClaude Sonnet 5

It's the default model for free claude.ai users — frontier-class coding at no cost for everyday tasks.

Hardest-tasks dark horseQwen3.7 Max

Its 60.6% on SWE-bench Pro is the best non-Claude score on the benchmark that's hardest to game.

compare head-to-head

how the field got here

2021GitHub Copilot preview Autocomplete-in-the-editor goes mainstream.
2023ChatGPT + GPT-4, then Cursor Chat-based coding and the first AI-native editor arrive.
Aug 2024SWE-bench Verified launches A human-validated benchmark of real GitHub issues sets an honest bar.
Oct 2024Claude 3.5 Sonnet hits ~49% Agents begin resolving real issues, not just snippets.
2025Terminal agents Claude Code and Codex CLI move AI out of the editor into the whole repo.
Apr–Jun 2026Open weights close the gap DeepSeek V4, Kimi K2.6 and peers cluster at ~80% Verified — for pennies.
2026Verified saturates in the mid-80s SWE-bench Pro and Terminal-Bench become the real differentiators.

Primary sources

BenchmarkSWE-bench — the real-GitHub-issue benchmark & leaderboard
Benchmarkvals.ai — SWE-bench Verified — independent third-party evaluations
BenchmarkSWE-bench Pro (Scale) — the harder long-horizon leaderboard
PaperSWE-bench (arXiv) — how the benchmark is constructed
MakerAnthropic — news — Claude Opus 4.8 / Sonnet 5 model cards & pricing
MakerOpenAI — GPT-5.5 — GPT-5.5 release post
MakerMoonshot AI — Kimi K2.6 — open weights & reported scores
MakerGoogle DeepMind — Gemini — Gemini model pages
PressVentureBeat — MiniMax-M3 debut — M3 launch scores & pricing
BenchmarkW&B ml-news — Qwen3.7-Max scores — Qwen3.7-Max benchmark table

AI Coding Leaderboard.

today's standings

our picks

compare head-to-head

how the field got here

$ quick-answers