SubQ Claims the First Subquadratic Frontier LLM, GENZ TECH

SubQ 1M-Preview, from Miami startup Subquadratic, is the first frontier-scale language model built on a fully subquadratic attention architecture: instead of compute exploding as the context window grows, it scales roughly linearly, which is how it reaches a 12-million-token window. If the benchmarks survive independent testing, this is the most serious attack on the transformer's core bottleneck since attention was introduced in 2017. The word "if" is doing a lot of work.

Subquadratic left stealth on May 5, 2026 with $29M in seed funding and one claim: a frontier LLM that does not pay the quadratic attention tax.
Its Subquadratic Sparse Attention (SSA) scales compute and memory roughly linearly with context length, versus the quadratic cost baked into every transformer since 2017.
SubQ reports 82.4% on SWE-Bench Verified and holds 92.1% needle-in-a-haystack accuracy at 12M tokens, a length where rival models do not operate at all.
The catch: model weights are private, the full model card is "coming soon," and only three benchmarks were released, so independent verification is the entire story.

Fig 1 The whole pitch in one picture: standard attention quadruples its work when you double the input, so context windows hit a wall. SubQ's SSA bends that curve toward linear, which is what makes a 12M-token window affordable.

What did Subquadratic actually ship?

Two things: a model and a thesis. The model, SubQ 1M-Preview, is available through a private-beta API in two configurations, a 1-million-token production tier and a 12-million-token research tier gated to enterprise partners. Around it sit three products in private beta: the API itself, a command-line coding agent called SubQ Code, and a retrieval tool called SubQ Search. The weights are not public, which matters, because every headline number here comes from the company rather than an outside lab. The thesis is bigger than any single release. For nearly a decade the ceiling on context length has been the same math: attention compares every token to every other token, so cost rises with the square of the input. Double the context and you quadruple the work. That is why "1 million tokens" became a marketing ceiling that models rarely use well. SSA replaces the all-pairs comparison with a sparse mechanism the company says grows roughly linearly, turning a hard wall into a gentle slope.

Why does quadratic attention matter so much?

Because it is the tax every current model pays and none can avoid. A large enterprise codebase, with all source files, commit history, and documentation, typically fits in two to eight million tokens. Under quadratic scaling, feeding that whole thing to a model is not just slow, it is economically absurd, which is why the industry built an entire scaffolding of workarounds: chunking, retrieval-augmented generation, vector databases, and re-ranking pipelines that exist mostly to hide the fact that models cannot actually read everything at once. If a model can genuinely hold 12M tokens with usable recall, large parts of that scaffolding become optional. Subquadratic frames the distinction sharply: a hybrid design delivers what it calls a scalar benefit, a constant-factor speedup, while a pure subquadratic mechanism delivers a scaling-law advantage that compounds as inputs grow. That is the difference between a faster car and a shorter road.

How good are the benchmarks, really?

On the three tests Subquadratic chose to publish, SubQ trades blows with the frontier. It reports 82.4% on SWE-Bench Verified, edging Claude Opus 4.6 at 81.4% and Gemini 3.1 Pro at 80.6%. On RULER at 128K it scores 97.1 against Opus 4.6's 94.8. On MRCR v2, a multi-hop retrieval test, it lands 65.9%, well above Opus 4.7's 32.2% but behind GPT-5.5's 74%. And at 12 million tokens, where no rival model runs, it holds 92.1% needle-in-a-haystack accuracy. On efficiency the claims get louder: a 7.2x speedup at 128K, 52.2x at 1M, up to 52x more cost-efficient than FlashAttention at 1M tokens, and coding quality near Opus at roughly one-twentieth the cost. At 12M tokens the company says its attention compute is nearly 1,000x lower than other frontier models.

Fig 2 · benchmark The bars are almost level on purpose: on coding, SubQ is not blowing past the frontier, it is matching it while claiming to do so far more cheaply at long context. Scores as reported by Subquadratic.

Model	SubQ 1M-Preview	Claude Opus 4.7	GPT-5.5	Gemini 3.1 Pro
Attention design	Subquadratic (SSA)	Quadratic	Quadratic	Quadratic
Max context	12M (research) / 1M (prod)	~1M	~1M	~1M
SWE-Bench Verified	82.4%	~81%	high	80.6%
Weights available	No (API only)	No	No	No
Independently verified	Not yet	Widely used	Widely used	Widely used

Who should care, and who should wait?

Care if your problem is fundamentally long-context: whole-repository code reasoning, legal and financial document review, or agent workflows that drown in retrieval plumbing. For those, even a partly-validated 12M window is worth a serious pilot. Wait if you need a general-purpose model today, because the benchmark selection is conspicuously narrow. Three tests, all in the two areas SubQ is designed to win, and nothing yet on general reasoning, math, multilingual quality, or safety. The full model card is listed as coming soon. Researchers have not dismissed the approach, and that distinction is important: the legitimacy of subquadratic attention as a research direction is not in dispute, only whether this specific implementation hits the scale and quality claimed. That is a verification problem, not a credibility problem, and it is solvable the moment outside labs get real access.

What to watch · 2026

Independent numbers. The claims live or die on third-party reproduction of the long-context and cost figures, not the company's own charts.
The full model card. General reasoning, math, and multilingual scores will show whether SSA costs quality elsewhere to win on context.
Real pricing. "One-twentieth the cost of Opus" is a headline until it is a published rate card with rate limits attached.
Does SSA generalize? Sparse attention that shines on retrieval can still stumble on dense reasoning. That is the quiet risk.

Our take

SubQ is the most interesting model release of the year precisely because it is not trying to be a better transformer, it is trying to change the cost curve underneath all of them. That is the kind of swing that either reshapes the field or quietly evaporates under scrutiny, and both outcomes have happened before with long-context claims. The honest read today is that Subquadratic has shown enough to be taken seriously and withheld enough to stay unproven. A private API and three friendly benchmarks are a strong demo, not a settled result. The moment to get excited is not this announcement, it is the first independent lab that runs SubQ at 12M tokens and confirms the recall and the price. Until then, treat the 1,000x figure as a hypothesis worth testing, because if even a fraction of it holds, the retrieval-pipeline industry just got a very uncomfortable competitor.

Primary sources

OfficialIntroducing SubQ , the company's launch post and claims
ReportingVentureBeat , on the efficiency claim and researcher skepticism
AnalysisThe New Stack , on the 12M-token window
ReferenceDataCamp: SubQ explained , architecture and benchmark breakdown

Original analysis by GenZTech. Figures as reported by Subquadratic, current as of July 2026. Source.

SubQ Claims the First Subquadratic Frontier LLM

What did Subquadratic actually ship?

Why does quadratic attention matter so much?

How good are the benchmarks, really?

Who should care, and who should wait?

Our take

$ quick-answers

$ related --topic=ai