Anthropic's biology fix: one tool beats a bigger AI, GENZ TECH

Anthropic's new biology research lands on a blunt, useful conclusion: when AI agents flub scientific data retrieval, the model usually is not the problem, the plumbing is. In a paper published in June 2026 the company showed frontier models scoring as low as 16.9% on repeated viral-sequence lookups, then fixed almost all of it not with a bigger model but with one deterministic tool. The thesis worth remembering is that reliable science agents will be won on infrastructure, not on model size.

On a new benchmark called VirBench (120 viral-sequence retrieval queries across 40 pathogens), Claude Sonnet 4 scored just 16.9%, not from weak reasoning but from fragmented, inconsistent biology databases.
Anthropic built gget virus with NCBI, a deterministic tool that wraps NCBI's REST, Datasets and E-utilities APIs, handles batching and returns standardized, logged output.
With that tool, every tested model cleared 92%: Sonnet 4 rose to 92.8%, GPT-5.5 went from 91.3% to 99.7%, and run-to-run stability jumped to 0.92 to 1.00.
The punchline: "reliable dataset construction should not depend on access to the newest or most expensive model." A cheap model with the right tool beat expensive models without one.

Fig 1 The bottleneck is not reasoning, it is retrieval. Routed straight at NCBI's fragmented APIs a model scores 16.9%; routed through the deterministic gget virus tool the same model scores 92.8%.

What did Anthropic actually find?

Anthropic's paper, "Paving the Way for Agents in Biology," built a benchmark it calls VirBench: 120 viral-sequence retrieval queries spanning 40 pathogens, the kind of grunt work that underpins real genomics pipelines. When frontier models tried to answer those queries by talking to public biology databases directly, the results were alarming. Claude Sonnet 4 landed at 16.9% accuracy, and worse, its answers were unstable from run to run. That is a catastrophic score for a task that a competent grad student does reliably, and it is exactly the kind of quiet failure that makes scientists distrust AI agents in the lab.

Why do capable models fail at simple retrieval?

The failure is not in the model's head, it is in the world it is querying. Biological databases evolved over decades as separate systems with their own query languages, pagination quirks, identifier schemes and silent truncation behavior. An agent improvising HTTP calls against that mess gets partial pages, mismatched accession numbers and inconsistent formats, and it has no way to know it got a wrong or incomplete answer. The model looks confident and is confidently wrong. This is the uncomfortable truth behind a lot of "AI cannot do science" takes: the intelligence is adequate, the data infrastructure was never designed for a machine to use.

How does gget virus fix it?

Anthropic worked with NCBI to build gget virus, a deterministic tool that sits between the agent and the databases. It coordinates NCBI's REST, Datasets and E-utilities APIs, handles large-result batching so nothing is silently dropped, and returns standardized, logged output the agent can trust. Deterministic is the operative word: given the same query it returns the same structured result every time, so the model is no longer guessing at the shape of the data. With the tool in place every model tested crossed 92%. Sonnet 4 jumped to 92.8%, GPT-5.5 climbed from 91.3% to 99.7%, and run-to-run stability rose to between 0.92 and 1.00 across the board.

Fig 2 · benchmark The deterministic tool erased the gap between a cheaper model and an expensive one. Sonnet 4's 16.9% to 92.8% leap is the headline, but every model tested cleared 92%.

Why "a cheaper model with the right tool" matters

The most quotable line in the research is a cost argument: "reliable dataset construction should not depend on access to the newest or most expensive model." That reframes a lot of enterprise AI strategy. If a mid-tier model wired to a good deterministic tool matches a frontier model working blind, the money is better spent building trustworthy tools and data access than chasing the top of the leaderboard. Anthropic frames gget virus as one instance of a broader need for "context engines," reliable agent-accessible infrastructure for biology, alongside efforts such as ToolUniverse, Edison Scientific's Robin and Biomni. The lesson generalizes past biology: agents get reliable when the world they act on is built for machines to read.

Approach	gget virus + model	Model querying APIs directly	Bigger model, no tool
Accuracy (VirBench)	92.8% to 99.7%	16.9%	still low without retrieval fix
Run-to-run stability	0.92 to 1.00	Unstable	Unstable
Cost lever	Cheaper model works	n/a	Pay for frontier model
What you build	Deterministic tool + data access	Nothing (agent improvises)	Nothing (buy compute)

What is Claude Science, and why hire AlphaFold's creator?

The paper arrived inside a larger push. At its AI for Science event on June 30, Anthropic pitched itself hard at researchers, and the next day it launched Claude Science, a workbench that runs on its existing models including Claude Opus 4.8 and ships with more than 60 scientific databases and toolkits for genomics, single-cell work, proteomics, structural biology and cheminformatics. The buildout also included a 400 million dollar acquisition of Coefficient Bio and the hire of AlphaFold creator John Jumper. Put together, the message is that Anthropic wants to own the "context engine" layer for science, not just sell tokens to it.

Jun 2026VirBench and gget virus published. The 16.9% to 92.8% result, built with NCBI.
Jun 30AI for Science event. Anthropic's public pitch to the research community.
Jul 1Claude Science launches. Workbench with 60+ databases and toolkits, on Opus 4.8.
2026 to 2027Context engines expand. More deterministic tools, more databases designed with agents as users.

What to watch · 2026 to 2027

Do deterministic tools generalize? gget virus fixed viral retrieval. The test is whether the same pattern lifts proteomics, cheminformatics and clinical data.
The cost reframe. If cheap-model-plus-tool keeps matching frontier-model-alone, expect enterprise budgets to shift from compute to tooling.
Databases built for agents. The deeper implication is that biology databases will need to be redesigned with machine users in mind. Watch NCBI and peers.
Still no FDA-approved AI drug. Reliability benchmarks are a means, not the prize. Real validation is downstream discoveries.

Our take

This is one of the more grounded pieces of AI research in a while, because it resists the reflex to blame or credit the model. The finding that a 16.9% score becomes 92.8% by changing the tool and not the model is a quiet rebuke to the "just use a bigger model" school, and it points the field at the least glamorous, most valuable work: building deterministic, trustworthy access to the world's data. Anthropic clearly sees the strategic prize, hence Claude Science, the Coefficient Bio deal and the Jumper hire. The risk is that "context engines" become another moat that concentrates scientific tooling inside a few labs. But as a piece of engineering honesty, this research is the right lesson at the right time: reliable agents are built on reliable plumbing.

Primary sources

ResearchPaving the Way for Agents in Biology VirBench, gget virus, the 16.9% to 92.8% result
OfficialAnthropic Science Claude Science workbench and toolkits
ReferenceNCBI the databases gget virus coordinates
LeaderboardGenZTech AI model leaderboard where the models in this test rank

Original analysis by GenZTech. Figures as reported by Anthropic, July 2026.

Anthropic's biology fix: one tool beats a bigger AI

What did Anthropic actually find?

Why do capable models fail at simple retrieval?

How does gget virus fix it?

Why "a cheaper model with the right tool" matters

What is Claude Science, and why hire AlphaFold's creator?

Our take

$ quick-answers

$ related --topic=ai