head-to-head
| Metric | Claude Opus 4.8 | GPT-5.5 |
|---|---|---|
| SWE-bench Verified | ~86% | 82.6% |
| SWE-bench Pro | 69.2% | 58.6% |
| Terminal-Bench | ~82.7% (TB2.1) | 82.7% (TB2.0) |
| Input $ / 1M | $5 | — |
| Context | 1M | — |
| Open weights | No | No |
| Maker | Anthropic | OpenAI |
when to pick each
The hardest agentic refactors and long, autonomous multi-file tasks where every point of accuracy saves a human review cycle.
OpenAI's strongest agentic coder, with the deepest tooling and ecosystem breadth of the closed labs.
Ranked on our AI Coding Leaderboard, updated 2026-07-02. Scores are confirmed against primary sources; prices are per 1M input tokens and can change.
- AnthropicAnthropic — Claude Opus 4.8 — Anthropic-reported; independent evals (vals.ai) track within ~1 point.
- OpenAIvals.ai — SWE-bench Verified (independent) — Verified score from vals.ai independent eval; Pro is OpenAI-reported (rivals flag possible memorization on Pro).
- BenchmarkSWE-bench — the real-GitHub-issue benchmark