Attention has a type: support is constant, finding it is not.

A frozen transformer's exact next-token prediction, top-1 agreement at or above 0.99 with full attention, comes from a small, constant number of keys. Constant in context length, verified from 8,000 to 1,000,000 tokens. And it shrinks as models grow, down to about 16 keys at 72B parameters. Long context turns out to be an indexing problem, not a quadratic-attention problem.

Mohammad Alsufi & Connor Scott BRL-2026-06 · June 2026 DOI 10.5281/zenodo.20582700

κ*, keys needed for top-1 at or above 0.99, versus model scale

Natural text, strict top-1 bar. Qwen2.5 ladder, 135M to 72B parameters.
Eleanor is our resident researcher. Click the planet to speak with her.

The Constant-Support Law

Self-attention costs O(n²) in context length n. Every one of n query positions scores every one of n keys. That is the quadratic wall everyone is trying to route around. This paper asks a narrower question: once you fix a frozen model and a context, how few keys does a query actually have to attend to so its next-token prediction stays identical to what full attention would produce?

The answer does not grow with n. We call it the Constant-Support Law: for a frozen transformer, the sufficient support κ*, the smallest key budget that reproduces the dense model's next token at a strict top-1 agreement bar of 0.99, is constant in context length, and it shrinks with model scale to a measured floor near 16. On natural text, sweeping seven model sizes from 135M to 72B parameters gives κ*: 352, 168, 152, 96, 48, 24, 16. On Qwen2.5-7B the budget holds dead flat at 48 keys from 8K to 64K tokens of context, while the fraction of context that counts as "read" falls from 0.59% to 0.07%. On the 1M-token checkpoint, the same fixed budget of 48 keys still suffices out to 1,000,000 tokens, with the read fraction down to 0.0046%, under one key in twenty thousand.

A second, independent model family confirms it is not a Qwen artifact. Running the same ladder on Llama (3.2-1B/3B, 3.1-8B, 3.3-70B) gives κ* = 144 → 88 → 40 → 16, the same order of magnitude and the same monotone shrink to a floor of 16.

The headline test is a distant needle, not just prose continuation, because top-1 agreement on ordinary text can be inflated by local predictability. We place a unique secret code once, at a random distant position, in a long context, and ask whether the sparse budget reproduces the dense model's next token, whether it actually found the buried fact. On Qwen2.5-72B at 32K tokens (its full trained context), 16 keys recover the distant needle with 99.2% agreement (Wilson 95% CI [.98, .997]) over 500 random placements against near-duplicate distractors. A budget of 12 clearly fails, at 94.6%. Sixteen is a measured floor, not an assumption.

Two things separate this from the general observation that "attention is sparse." First, the budget is an absolute constant, not a fraction of context, the key count does not rise as n grows, even as the fraction it represents keeps shrinking. Second, the budget decreases with model scale. Larger models need fewer keys to reproduce their own output, the opposite of the intuition that bigger models "use more context."

How it's measured, and what could confound it

The paper is careful to rule out the obvious confound: a more peaked output distribution makes top-1 easier to preserve almost by construction, which could in principle be doing the work instead of any real change in attention support. The authors temperature-scale 7B and 72B to a common output margin and re-measure κ* on one fixed probe. Under matched peakedness the gap survives, 36 keys vs. 14, still a 2.6x difference, and the direction of the shift is actually informative: at matched margin the smaller model becomes the more peaked one, the opposite of the "bigger models are peakier" story. That rules out output peakedness as the explanation for the shrink.

The paper also deliberately avoids softer metrics. Perplexity or downstream accuracy can look fine while a model quietly changes many individual token choices underneath. The headline metric is strict top-1 next-token agreement, measured on natural high-uniqueness text (PG-19 and code, ≥99% unique tokens, sampled every 512 positions) and separately validated against a real distant-needle retrieval test, not just on prose, where predictability could be doing favors it wouldn't get on a genuinely hard lookup.

ModelParamsκ* (keys)
SmolLM2-135M135M352
Qwen2.5-0.5B0.5B168
Qwen2.5-1.5B1.5B152
Qwen2.5-3B3B96
Qwen2.5-7B7B48
Qwen2.5-14B14B24
Qwen2.5-72B72B16

Why it holds: condensation, not coincidence

A budget that stops growing with context needs a mechanistic reason, not just a measurement. The paper gives two. First, a Lipschitz bound: softmax output error is bounded by the attention mass a selector drops, so keeping keys that carry all but a small fraction ε of the mass bounds the output perturbation by O(ε). Second, and more load-bearing: across depth, key representations condense into a small, roughly constant number of clusters, a phenomenon the paper connects to prior work on token condensation in attention dynamics. A query's attention mass concentrates on a handful of these clusters, so a budget on the order of the cluster count captures nearly all of it, independent of n.

The paper then gives that clustering a physical reading. Attention weights are exactly a Gibbs (Boltzmann) distribution over key "energies," at an inverse temperature set by the 1/√d scaling built into every transformer. The natural order parameter for how concentrated that distribution is is the participation ratio P, the effective number of keys carrying the attention mass. P = n is the fully spread "gas" phase; P = O(1) is a condensed phase where the distribution has collapsed onto a small, fixed set of modes, the attention analogue of Bose–Einstein condensation.

Measured on Qwen2.5-7B: the median participation ratio starts high in the earliest layers, up to ≈94 effective keys, the de-condensed "gas" phase, then collapses by layer 5 into a condensed band of roughly 3–8 effective keys, and stays there for the rest of the network. Constant support is exactly this condensed phase, made visible layer by layer.

Across eleven models in three lineages, this deep-layer participation ratio tracks κ* in log–log space with r ≈ 0.996, the smaller a model's condensed attention footprint, the fewer keys it needs to reproduce itself. The paper is careful to call this an order-of-magnitude diagnostic, not a formula: the ratio κ*/P spans roughly 5–48x across models, so P predicts the trend and scale of κ*, not its exact value. It also flags that P and κ* are two readouts of the same underlying concentration, so the correlation is partly true by construction, its value is practical, a single forward pass gives a cheap signal for how "long-context-ready" a checkpoint's attention already is.

The phase picture also explains a structural finding elsewhere in the paper: a minority of attention heads stay de-condensed even in deep layers. These are retrieval heads, and because which keys they need differs per query, no single global cache-eviction policy can serve them, the condensed majority hides a de-condensed critical minority. That is the mechanistic reason store-time KV eviction fails even though read-time sparsity works, covered under limits below.

Depth-wise condensation

A schematic of the phase transition measured in the paper: early layers stay de-condensed (attention spreads over many keys), then the network collapses into a small, stable, condensed support by around layer 5. Hover or tap a layer.
Hover a layer to see its measured participation ratio.

How the key budget shrinks with model scale

The clearest single result in the paper is the scale sweep, charted at the top of this page: the oracle sufficient support κ* falls monotonically as models get bigger, across two independent model families, down to a measured floor at 16 keys.

This holds independent of how long the context gets. On Qwen2.5-7B, sweeping context length instead of model scale, κ* stays flat at 48 from 8K to 64K tokens, a 64x range, while the share of context those 48 keys represent falls from 0.59% down to 0.07%. That flatness is the load-bearing claim: it is not that sparse attention "helps," it is that the useful content of attention does not grow at all once a model is fixed.

Support is constant. Finding it is not.

The paper is explicit that this is not a free lunch, and structures its whole framing around a distinction it calls the support–findability split. κ* measures an oracle budget, the smallest set of keys sufficient once you already know which keys matter. Locating those keys among all n candidates is still, in the worst case, an O(n) scoring problem: identifying the top-k distant keys still requires scoring every one of them.

The paper reports a realizable step toward closing that gap, not a full solution. A non-oracle block-summary selector, score one mean-key representative per block of B tokens (an O(n/B) "finding" pass), then attend within the chosen blocks, clears the strict 0.99 bar at a flat budget as context grows: a single block of ≈260 keys suffices at both 8K and 32K tokens on Qwen2.5-7B, with the read fraction falling from 3.2% to 0.8% as context quadruples. That is a 1/B constant-factor cut in finding cost, not an asymptotic one, a genuinely cheaper selector still needs a full O(n) scoring pass, just B times cheaper.

Single-fact retrieval is cheap for this selector, one block recovers the dense answer outright. Multi-hop retrieval, where the answer requires chaining two buried facts across the context, is harder: the same cheap selector run independently within one layer plateaus at 0.88 agreement, short of the 0.99 bar, even at 16 blocks. The fix the paper builds is a cross-layer selector, deep-layer block selection conditioned on the residual stream that already carries the intermediate entity found earlier in the network. That closes the gap: 0.55 / 0.86 / 0.99 agreement at 4 / 8 / 16 blocks, clearing the strict bar where the within-layer version could not. Multi-hop retrieval is a cross-layer computation, and the selector has to follow the chain across depth to work.

Honest limits

The paper states its own boundary carefully rather than overclaiming a general result, and reports the negative findings alongside the positive ones.

Breaks on aggregative tasks

The law holds for sparse, pointwise next-token dependence. It does not hold for tasks that must integrate information from many positions at once. On a counting task, how many times does a token occur across the context, deep-layer participation ratio stays flat while κ* grows with n (24 → 40 → 72 → 130 at 4K/8K/16K/32K, a sublinear ~n^0.82 fit). Aggregation is the law's real scope boundary, stated precisely rather than papered over.

No wall-clock speedup, yet

The contribution here is algorithmic: O(n) memory instead of forming the full n×n score matrix, and an O(n/B) finding pass instead of full scoring. The unfused PyTorch reference implementation is actually ~1.7x slower than the model's fused FlashAttention path at 8K tokens, because it is a reference, not a kernel. A fused kernel that converts the algorithmic win into real end-to-end speed is explicitly left as future work, this paper's confessed debt is what the next report in the series pays down.

KV-cache eviction fails

The law does not imply the cache itself can be compressed. Globally evicting KV entries by accumulated attention mass (the H2O-style approach) collapses top-1 agreement to as low as 0.04–0.09 under the strict bar, even keeping 32% of the cache. Which keys matter is query-dependent, so a key irrelevant to most queries can still be essential to some, permanently dropping it globally breaks those queries. Read-time sparsity works; store-time eviction does not.

Frontier scale, bounded context

The 72B distant-needle result is measured at 32K tokens, the model's full trained context, a genuine frontier-scale result, but not pushed further. The million-token flatness is shown separately, on the 7B-class 1M-context checkpoint. Combining 72B-class parameter count with million-token context is future work, not a claim made here.

The conclusion the paper draws is measured rather than triumphant: a frozen transformer's exact next token depends on a small, constant number of keys, about 16 at scale, independent of context length, and that constant is enough to recover a single fact buried anywhere in a long context on a 72B model. The mechanism is token condensation. The boundary is that the cache cannot be globally compressed. The result is a measured law with a clear physical explanation and an honest cost: O(n)-memory attention with a constant-factor cut in reads, not a sub-quadratic miracle.

Full text

Abstract and full paper sections, in clean HTML

Abstract

Self-attention costs O(n²) in the context length n, yet its useful content is extraordinarily concentrated. This paper reports the Constant-Support Law of frozen transformers: to reproduce the model's next-token prediction at a strict fidelity bar (top-1 agreement ≥0.99), attention needs only a small, model-dependent constant number of keys κ*, independent of n, shrinking with model scale to a measured floor at 16 (κ*: 352 → 168 → 152 → 96 → 48 → 24 → 16 from 135M to 72B, on natural text). At the frontier the law holds: κ* = 48 stays flat from 8K to 64K tokens on Qwen2.5-7B, and on a 72B model at 32K (its full trained context) 16 keys recover a unique fact buried at a distant position with 99.2% agreement against dense attention over 500 placements. The paper explains why with physics: reading attention as a Gibbs measure, it measures a depth-wise condensation of attention, the effective number of attended keys (participation ratio) collapses across depth from ~94 to ≈3–8 and correlates with κ* (log–log r ≈ 0.996) across eleven models in three lineages. The headline κ* numbers are oracle numbers (the aggregation budget once the right keys are known; locating them is still O(n)). The paper then makes the law operational: a non-oracle block-summary selector (O(n/B) finding, 1/B the work of full scoring, linear, not sub-linear) meets the strict bar at a flat budget as context grows, reading ≈260 keys at both 8K and 32K, turning an existence result into a realizable method. A cross-layer selector closes the multi-hop retrieval gap where a single-pass selector plateaus at 0.88. The law is robust to output-margin confounding and reproduces in a second model family, and refines the "Sparse Frontier" result: a constant absolute budget is the limiting case of sublinear-growth findings. The paper also maps the law's true boundary: it holds for sparse next- token dependence and breaks on genuinely aggregative tasks (counting), where κ* grows with n. All reported figures are measured rather than projected; the remaining open item is a fused wall-clock kernel.

1. The Constant-Support Law

Attention is the quadratic wall of the transformer: each of n query positions attends over all n keys, so cost grows as n². Yet the softmax that mixes those keys is extraordinarily peaked, for most positions a handful of keys carry almost all the weight, and that peak is an absolute constant that does not move as the context grows, not a fraction that shrinks slowly.

This reframes the long-context question. Prior work has asked how fast a key budget must grow with n. This paper shows it does not grow at all, the support of the exact next-token prediction exists at constant size, and that what grows instead is a different quantity: the cost of finding that support when the right keys are not known in advance. Separating these two, the constant-size support and the findability cost of locating it, is the organizing lens of the paper: the support–findability split. A long session is not a widening attention window but an index over history from which a constant set is retrieved and aggregated.

Sufficient support κ* is defined precisely: fix a frozen model, a context of length n, and a fidelity bar τ (top-1 next-token agreement ≥ 0.99). A budget-b selector keeps, per head and per query, an attention sink (4 keys), a local window, and the top-k highest-scoring distant keys, with sink + window + k = b. κ* is the smallest budget b for which the budget-b model meets the bar.

A second model family confirms the law is a transformer property, not a Qwen artifact: the Llama family (3.2-1B/3B, 3.1-8B, 3.3-70B) gives κ* = 144 → 88 → 40 → 16, the same order of magnitude and the same monotone shrink to 16 as Qwen, spanning two independent lineages at frontier scale.

2. Frontier validation

On Qwen2.5-7B over natural text, sweeping context with the bar fixed at top-1 ≥ 0.99: κ* = 48 at n = 8K, 32K, and 64K, dead flat while the fraction of context read falls from 0.59% to 0.07%. On the Qwen2.5-7B-Instruct-1M checkpoint, with the budget fixed at the 7B value, a constant b = 48 remains sufficient across n = 128K/256K/512K/1M, with the read fraction collapsing to 0.0046% at 1M, the model reproduces its own next token from under one key in twenty thousand.

At 14B and 72B, a distant-needle floor test (500 trials, all at n = 32K) gives a clean floor at 16 for both models: Qwen2.5-14B reaches 0.993 agreement at budget 16; Qwen2.5-72B reaches 0.992 (95% CI [.98, .997]) at budget 16 against 0.946 at budget 12, which fails the bar. On the hardest single-fact retrieval test, support sharpens to a hard 16-key floor as scale grows.

3. Why it holds: a measured mechanism

Two measured facts explain the law. First, softmax is Lipschitz in dropped mass: if a selector keeps a set of keys and drops the rest, the output error is bounded by the dropped attention mass, so keeping keys that carry all but a small fraction ε of the mass bounds the output perturbation by O(ε). Second, key representations condense into a small, roughly constant number of clusters across depth (token condensation); the paper measures this cluster count at c* ≈ 4, essentially independent of n. Combining the two: a budget on the order of the cluster count suffices to bound output error by a budget that does not grow with n, the constant-in-n form of the law. This argument explains the law's form but not its exact magnitude, which the next section narrows with a measurable order parameter.

4. The physics: attention as a condensing Gibbs measure

Attention weights are exactly the Boltzmann distribution of a system whose key energies are set by the query-key dot product, at inverse temperature β = 1/√d, the standard attention scaling is literally the system's temperature. The natural order parameter for how many keys the distribution actually occupies is the participation ratio P(q), the exponential of the Rényi-2 entropy: the effective number of keys carrying the mass. P = n is the uniform "gas" phase; P = O(1) is the condensed phase, the attention analogue of Bose–Einstein condensation onto a finite set of modes. Constant support is exactly the statement that the system sits in the condensed phase, P = O(1) independent of n.

Measured on Qwen2.5-7B: median participation ratio collapses from P ≈ 94 in the early, de-condensed "gas phase" layers to P ≈ 3–8 from layer 5 onward, and stays there for the rest of the network. Across eleven models in three lineages, deep-layer P and κ* move together in log–log space with r ≈ 0.996, an order-of-magnitude diagnostic (the ratio κ*/P spans roughly 5–48x across models), not an exact formula. A minority of heads remain de-condensed even in deep layers, the retrieval heads, and because which keys they need differs per query, no global eviction policy can serve them, which is the mechanistic reason KV eviction fails while read-time sparsity succeeds.

5. An O(n)-memory kernel and a cheaper selector

Given a keep-set, attention aggregation can avoid forming the full n×n score matrix: per query block, gather the sink, local window, and selected distant keys, and run a fused scaled-dot-product-attention over the gathered tensors, with O(n) peak memory. The saving is in aggregation, not selection, identifying the top-k distant keys still requires scoring all n of them, the O(n) work the quadratic wall is made of. A realizable, non-oracle block-summary selector, one mean-key representative scored per block of B tokens, an O(n/B) "finding" pass, clears the strict 0.99 bar at a flat budget as context grows: a single block of ≈260 keys suffices at both 8K and 32K on Qwen2.5-7B (B = 128), with the read fraction falling from 3.2% to 0.8% as n quadruples. The reference implementation reports no wall-clock speedup: it runs ~1.7x slower than fused FlashAttention at 8K, because it is an unfused reference, not a kernel. The contribution is the algorithmic one, O(n) memory, O(n/B) finding, oracle-matching fidelity, with a fused kernel for real end-to-end acceleration left explicitly open.

6. Three quantities, and the boundary

The law concerns keys read per query (O(1)). It does not by itself reduce two other quantities: keys found (locating the right keys cheaply is O(n/B)-cheaper than full scoring by a constant factor, but still O(n)) and keys stored (the KV cache). KV-cache eviction, keeping only a budget of entries by accumulated attention mass and dropping the rest globally, fails under the strict bar: at n = 1024, keeping even 32% of the cache collapses top-1 agreement to 0.09. The law's true scope boundary is genuinely aggregative tasks: on a token-counting task, deep-layer participation ratio stays flat while κ* grows with n (24 → 40 → 72 → 130 at 4K/8K/16K/32K, a sublinear ~n^0.82 fit), the support is a growing count, not a constant, on tasks that must integrate the whole context rather than look up sparse, pointwise dependence.

7. Reconciliation with the Sparse Frontier

Prior work (Nawrot et al., 2025, "The Sparse Frontier") finds that sparse-attention budgets should grow sublinearly with n and that task difficulty sets attainable sparsity. This paper does not contradict that result, it refines it. A constant absolute budget is precisely the limiting case of sublinear growth, and is fully consistent with the observation that the usable fraction of context shrinks at longer n (the read fraction here falls to 0.07% at 64K). The existence/findability split separates two things prior work measures together: the smallest sufficient budget when the right keys are known (constant), versus the budget a cheap heuristic needs to find those keys on hard tasks (which grows). On single-fact retrieval, one block recovers the dense answer; on multi-hop retrieval (chaining two buried facts), the same cheap selector run within one layer reaches only 0.12 / 0.38 / 0.88 agreement at 4 / 8 / 16 blocks, a smooth, steep climb that does not close the 0.99 bar. A cross-layer selector, conditioning deep-layer block selection on the residual stream that already carries the intermediate entity, closes the gap: 0.55 / 0.86 / 0.99 at 4 / 8 / 16 blocks. Multi-hop is a cross-layer computation, an earlier layer writes the intermediate entity into the residual stream, and a later layer must attend to it before it can chase the second fact.

8. Related work

Fixed-pattern sparsity approaches, StreamingLLM (attention sinks plus a local window) and H2O (heavy-hitter eviction by score), fix the cache; this paper shows global eviction fails under a strict top-1 bar and frames its claim around per-query reads instead of a shared fixed cache. Query-aware selection methods (Quest, MInference) select blocks per query or pattern, closest in spirit to this paper's selector, but report perplexity or accuracy rather than a constant count under exact next-token fidelity, and do not report the shrink-with-scale floor. The Sparse Frontier introduces sparse-attention scaling laws and argues budgets must grow; this paper reconciles that as existence-versus-findability. The mechanism connects to token condensation in attention dynamics (Geshkovski et al., 2023), which to this paper's knowledge had not previously been proposed as the explanation for a constant-support law. No claim is made to being first to use a constant-size cache, StreamingLLM and H2O predate that framing; the claim here is about per-query read support, not the cache itself.

9. Implications

Because the condensed support is O(1), the aggregation cost of attention at long context is bounded by a constant per query, and the paper's O(n)-memory kernel realizes that bound without ever forming the n×n matrix, the marginal token at 64K costs essentially the same aggregation as at 8K. The unlocked work is a cheap, approximate selector: the law says the target is small, finding it cheaply is the remaining cost. Because support is constant per query, a long session is, in attention terms, an indexing-and-finding problem rather than a quadratic-attention one, the paper states this as a design implication for production long-context systems (an external retrieval index over history, plus constant-budget aggregation), not as a deployment result. It also proposes a training signal: encourage early condensation to shrink κ* further, while explicitly protecting the de-condensed retrieval heads that make eviction fail , the concrete next experiment the law points to.

Limitations (as stated in the paper)

  • Aggregative tasks are the real scope boundary. The law holds for sparse, pointwise next-token dependence; on genuinely aggregative tasks like counting, κ* grows with n while the participation ratio stays flat.
  • Memory, not just FLOPs. The KV cache itself is not compressed, eviction fails; the win is read-FLOPs plus a 1/B-cheaper (still O(n)) finding pass, not cache size.
  • Wall-clock. The paper reports O(n)-memory and FLOP reduction; a fused kernel beating FlashAttention end-to-end is a systems effort it does not complete, its reference selector is unfused and slower.
  • Frontier 72B context. The 72B distant-needle result is measured at 32K, the model's full trained context, not pushed beyond it. Million-token flatness is shown separately on a 7B-class checkpoint; combining 72B parameter count with million-token context is future work.

Conclusion

A frozen transformer's exact next token depends on a small, constant number of keys, about 16 at scale , independent of context length, and that constant is enough to recover a unique fact buried in a long context on a 72B model. The mechanism is token condensation; the boundary is that the cache cannot be globally compressed. The result is a measured law with a clear theory and an honest cost: O(n)-memory attention with a constant-factor cut in reads, not a sub-quadratic miracle.