OpenAI's GeneBench-Pro, released on June 30, 2026, is a 129-problem benchmark built to measure something most AI evaluations ignore: whether an agent can look at a noisy biological dataset, decide which analysis it can actually support, and reach a decision-ready answer. The headline result is a reality check. Even GPT-5.6 Sol Pro, OpenAI's most capable model at maximum reasoning, solved fewer than one in three problems. This is not a knowledge test the models are failing. It is a judgment test.
- GeneBench-Pro contains 129 synthetic problems across genomics, quantitative biology and translational medicine, each pairing a deliberately noisy dataset with a target tied to a real downstream decision.
- Scores are low across the board: GPT-5.6 Sol Pro 31.5%, GPT-5.6 Sol 28.7%, Claude Opus 4.8 16.0%, Gemini 3.5 Flash 8.1%.
- Every problem is generated from a known causal structure, so grading is deterministic, sidestepping the rubric noise that weakens most long-horizon science benchmarks.
- OpenAI calls the missing skill research taste: knowing which questions a dataset can answer, when a diagnostic should change the model, and when a result is safe to act on.
What did OpenAI actually release?
GeneBench-Pro is a research-grade successor to the original GeneBench, and it is far harder. It presents an agent with 129 problems spanning 10 domains and 21 sub-domains, from statistical and population genetics to clinical pharmacogenomics and cancer genomics. Each task hands the model a realistic dataset, an experimental context and a research question, then asks it to analyze the data, choose a method and produce a conclusion. Because OpenAI controls the entire data-generation process, every problem has a known ground truth, and answers are graded deterministically while still accepting different valid analytical routes. To sanity-check realism, OpenAI sent 82 of the 129 problems to outside specialists, including graduate students, postdocs, industry scientists and professors, who estimated a typical problem would take a human expert 20 to 40 hours.
RelatedClaude Sonnet 5 Nearly Matches Opus at Half the Price
Why are the scores so low?
The scores are low because the benchmark targets the exact capability current models are weakest at. GPT-5.6 Sol Pro reached 31.5% at maximum reasoning, GPT-5.6 Sol scored 28.7%, Claude Opus 4.8 managed 16.0% and Gemini 3.5 Flash trailed at 8.1%. That still marks real progress: on the original GeneBench, GPT-5 scored below 5% during development. The improvement is genuine, but a top model missing roughly seven of every ten problems tells you the bottleneck for AI-for-science has moved. It is no longer recalling facts or running a fixed pipeline. It is the higher-order call of deciding which analysis a messy dataset can honestly support.
How is this different from other AI science benchmarks?
Most science benchmarks either test recall or grade long, open-ended answers against a rubric, which introduces its own noise. GeneBench-Pro avoids both traps by generating each problem from a known causal structure, so a correct conclusion can be verified against ground truth even when the model takes a reasonable but different analytical path. That design is the whole point: it isolates judgment from lookup.
| Property | GeneBench-Pro | Recall benchmarks | Rubric-graded science tests |
|---|---|---|---|
| Tests | Multistage analytical judgment | Fact retrieval | Long-form reasoning |
| Data | Noisy, decision-linked datasets | Static questions | Prompts or papers |
| Grading | Deterministic vs known truth | Exact match | Human or model rubric |
| Top score | 31.5% | Often 80%+ | Varies widely |
| Failure exposed | Noticing-to-acting gap | Memorization limits | Rubric variance |
Who does this affect?
Anyone building AI agents that touch real lab data. OpenAI flagged a specific and worrying failure mode: agents that identify a flaw in the data, then fail to act on it. For a team deploying an agent against genuine experiments, that is the difference between a system that assists a scientist and one that quietly produces a confidently wrong answer. The benchmark makes that risk measurable instead of anecdotal.
RelatedGemini 3.5 Flash Makes Computer Use a Native Tool
How the benchmark got here
- 2025Original GeneBench in development. GPT-5 scored below 5%, exposing how hard research-grade biology judgment is.
- Jun 30 2026GeneBench-Pro released. 129 problems, deterministic grading, top score 31.5%.
- Jul 2026Independent scoring. 10 questions go public on Hugging Face, 50 to Artificial Analysis for a neutral read.
- Does the leaderboard hold outside OpenAI's harness? The 50-question Artificial Analysis subset is the real test of whether 31.5% is a ceiling or an artifact.
- Does research taste become a training target? Expect labs to optimize for the noticing-to-acting gap now that it is measurable.
- Does agentic biology stay gated? Prediction: high-stakes lab deployments stay human-in-the-loop until scores clear 50%.
Our take
GeneBench-Pro is the most honest AI-for-science benchmark released this year, precisely because the numbers are embarrassing. A field drowning in saturated leaderboards needed a test that models fail, and this is it. The deterministic grading is the clever part: it separates real judgment from lucky pattern-matching in a way rubric scoring never could. The 31.5% ceiling is not a knock on GPT-5.6 Sol so much as proof that the frontier has moved from what a model knows to whether it can be trusted to decide. For anyone selling AI scientists, that distinction is the entire ballgame, and GeneBench-Pro just put a number on it.
- OfficialIntroducing GeneBench-Pro OpenAI's announcement
- PaperGeneBench-Pro: Evaluating Multistage Statistical Reasoning methodology and scores
- BenchmarkArtificial Analysis independent scoring of the 50-question subset
Original analysis by GenZTech. Based on OpenAI's GeneBench-Pro release, June 30, 2026.
