OpenAI's GeneBench-Pro, released on June 30, 2026, is a 129-problem benchmark built to measure something most AI evaluations ignore: whether an agent can look at a noisy biological dataset, decide which analysis it can actually support, and reach a decision-ready answer. The headline result is a reality check. Even GPT-5.6 Sol Pro, OpenAI's most capable model at maximum reasoning, solved fewer than one in three problems. This is not a knowledge test the models are failing. It is a judgment test.

  • GeneBench-Pro contains 129 synthetic problems across genomics, quantitative biology and translational medicine, each pairing a deliberately noisy dataset with a target tied to a real downstream decision.
  • Scores are low across the board: GPT-5.6 Sol Pro 31.5%, GPT-5.6 Sol 28.7%, Claude Opus 4.8 16.0%, Gemini 3.5 Flash 8.1%.
  • Every problem is generated from a known causal structure, so grading is deterministic, sidestepping the rubric noise that weakens most long-horizon science benchmarks.
  • OpenAI calls the missing skill research taste: knowing which questions a dataset can answer, when a diagnostic should change the model, and when a result is safe to act on.
What GeneBench-Pro measures Recall benchmarks test facts. GeneBench-Pro tests the chain of judgment from a messy dataset to a decision-ready conclusion. RECALL BENCHMARK QuestionFact lookup GENEBENCH-PRO NoisydatasetPick amethodRun theanalysisDecision-ready call The gap is not knowledge. It is research taste: judging what a dataset can and cannot support. Observed failure mode: agents notice a data flaw, then fail to act on it. genztech.blog
Fig 1 Standard benchmarks reward recall. GeneBench-Pro scores the full chain from a messy dataset to a defensible decision, the judgment real computational biologists exercise every day.

What did OpenAI actually release?

GeneBench-Pro is a research-grade successor to the original GeneBench, and it is far harder. It presents an agent with 129 problems spanning 10 domains and 21 sub-domains, from statistical and population genetics to clinical pharmacogenomics and cancer genomics. Each task hands the model a realistic dataset, an experimental context and a research question, then asks it to analyze the data, choose a method and produce a conclusion. Because OpenAI controls the entire data-generation process, every problem has a known ground truth, and answers are graded deterministically while still accepting different valid analytical routes. To sanity-check realism, OpenAI sent 82 of the 129 problems to outside specialists, including graduate students, postdocs, industry scientists and professors, who estimated a typical problem would take a human expert 20 to 40 hours.

RelatedClaude Sonnet 5 Nearly Matches Opus at Half the Price

Why are the scores so low?

The scores are low because the benchmark targets the exact capability current models are weakest at. GPT-5.6 Sol Pro reached 31.5% at maximum reasoning, GPT-5.6 Sol scored 28.7%, Claude Opus 4.8 managed 16.0% and Gemini 3.5 Flash trailed at 8.1%. That still marks real progress: on the original GeneBench, GPT-5 scored below 5% during development. The improvement is genuine, but a top model missing roughly seven of every ten problems tells you the bottleneck for AI-for-science has moved. It is no longer recalling facts or running a fixed pipeline. It is the higher-order call of deciding which analysis a messy dataset can honestly support.

GeneBench-Pro scores by model GPT-5.6 Sol Pro 31.5 percent, GPT-5.6 Sol 28.7, Claude Opus 4.8 16.0, Gemini 3.5 Flash 8.1. 50%25%0 31.5%28.7%16.0%8.1% GPT-5.6Sol ProGPT-5.6SolClaudeOpus 4.8Gemini3.5 Flash genztech.blog
Fig 2 · benchmark No model clears one in three. GPT-5.6 Sol Pro leads at 31.5% with maximum reasoning; the rest fall away sharply. Independent scoring on a 50-question subset is going to Artificial Analysis.

How is this different from other AI science benchmarks?

Most science benchmarks either test recall or grade long, open-ended answers against a rubric, which introduces its own noise. GeneBench-Pro avoids both traps by generating each problem from a known causal structure, so a correct conclusion can be verified against ground truth even when the model takes a reasonable but different analytical path. That design is the whole point: it isolates judgment from lookup.

PropertyGeneBench-ProRecall benchmarksRubric-graded science tests
TestsMultistage analytical judgmentFact retrievalLong-form reasoning
DataNoisy, decision-linked datasetsStatic questionsPrompts or papers
GradingDeterministic vs known truthExact matchHuman or model rubric
Top score31.5%Often 80%+Varies widely
Failure exposedNoticing-to-acting gapMemorization limitsRubric variance

Who does this affect?

Anyone building AI agents that touch real lab data. OpenAI flagged a specific and worrying failure mode: agents that identify a flaw in the data, then fail to act on it. For a team deploying an agent against genuine experiments, that is the difference between a system that assists a scientist and one that quietly produces a confidently wrong answer. The benchmark makes that risk measurable instead of anecdotal.

RelatedGemini 3.5 Flash Makes Computer Use a Native Tool

How the benchmark got here

  1. 2025Original GeneBench in development. GPT-5 scored below 5%, exposing how hard research-grade biology judgment is.
  2. Jun 30 2026GeneBench-Pro released. 129 problems, deterministic grading, top score 31.5%.
  3. Jul 2026Independent scoring. 10 questions go public on Hugging Face, 50 to Artificial Analysis for a neutral read.
What to watch · 2026
  • Does the leaderboard hold outside OpenAI's harness? The 50-question Artificial Analysis subset is the real test of whether 31.5% is a ceiling or an artifact.
  • Does research taste become a training target? Expect labs to optimize for the noticing-to-acting gap now that it is measurable.
  • Does agentic biology stay gated? Prediction: high-stakes lab deployments stay human-in-the-loop until scores clear 50%.

Our take

GeneBench-Pro is the most honest AI-for-science benchmark released this year, precisely because the numbers are embarrassing. A field drowning in saturated leaderboards needed a test that models fail, and this is it. The deterministic grading is the clever part: it separates real judgment from lucky pattern-matching in a way rubric scoring never could. The 31.5% ceiling is not a knock on GPT-5.6 Sol so much as proof that the frontier has moved from what a model knows to whether it can be trusted to decide. For anyone selling AI scientists, that distinction is the entire ballgame, and GeneBench-Pro just put a number on it.

Primary sources

Original analysis by GenZTech. Based on OpenAI's GeneBench-Pro release, June 30, 2026.