Attention finds its keys, but not far enough.

Past roughly half a million tokens, models stop finding the right memory. We pre-registered three fixes and refuted all three. The autopsy that followed found why: the needle's rank barely moves as context grows, but survival requires it to shrink. The collapse is arithmetic, not bad luck, and we can finally say exactly where it happens.

Mohammad Alsufi & Connor Scott· BRL-2026-08 · June 2026 · 9 pages· DOI 10.5281/zenodo.20673107
Needle rank percentile versus survival threshold, by context length
Eleanor is our resident researcher. Click the planet to speak with her.

The frontier the last paper left open

BRL-2026-07 built the engine that makes reading a long conversation cheap and moved its memory off the GPU. It also found the one thing that engine could not fix: findability. One-shot block-summary selection, score every block of the context, keep the top few, solves needle retrieval cleanly at 32K and 128K tokens, then collapses by 512K, even on a model trained for a million tokens of context.

This report goes after that collapse directly. We pre-registered three specific hypotheses for what was breaking and how to fix it, committed the predictions before running the measurements that would judge them, and let the results stand as written. All three hypotheses failed. The autopsy that the first failure obligated us to run is what actually explains the wall, a simple, measurable law that had been hiding underneath the whole time.

Hypothesis 1: hierarchy does not save one-shot selection

The first guess was rank dilution: ranking the top blocks out of thousands of candidates is a wide, noisy decision, so narrowing every individual decision, score coarse super-blocks first, then only rank the children of the ones that survive, should restore accuracy at the same total budget.

The pre-registered bar was strict: hierarchical recall at or above 0.9 at 512K tokens, refuted below 0.7 or if it failed to beat flat selection outright. It did neither.

30 paired trials per cell. Recall is the model answering the planted-needle question end to end. Selection-hit is whether the needle's block survived the selector's own cut, isolating the selector from the model.
PolicyRecall @ 128KRecall @ 512KSelection-hit @ 512K
Dense reference1.000.97n/a
Flat top-161.000.330.190
Hierarchical0.870.170.177

H1 is refuted on both registered conditions, recall of 0.17 is well below the 0.7 floor, and at the point estimate hierarchy actually trails flat selection at equal budget (the confidence intervals overlap, so we don't over-read the ordering, but there is certainly no improvement). The selection-hit numbers explain why, and this is the useful part of an otherwise negative result: flat and hierarchical selection find the needle's block at statistically indistinguishable rates. The failure isn't where the ranking decision happens, a decision over a few hundred candidates fails exactly as often as one decision over four thousand. The score itself stops ranking the needle's block highly as context grows, and no amount of restructuring the decision tree changes that. Dilution was the wrong mechanism.

The autopsy: a scale-invariant percentile

Per the pre-registration's own rule, a refuted hypothesis obligates a follow-up measurement. For every trial we captured the selector's actual queries, scored every valid block with the same rule the selector uses, and recorded the rank percentile of the needle's block, the fraction of competing blocks that outscore it.

Context length32K128K512K
Median needle rank percentile1.6%1.8%3.6%
Percentile required to survive top-166.3%1.6%0.39%
Fraction of layer/heads where it survives top-160.640.490.26

Across a sixteen-fold range of context length, the needle's rank percentile barely moves, it drifts from 1.6% to 3.6%. Meanwhile, surviving a fixed budget of "keep the top 16 blocks" gets steadily harder, because the cutoff percentile shrinks in direct proportion to how many blocks now exist: at 512K it takes beating 99.6% of competitors, not 93.7%.

The rank law

The needle block's score-rank percentile is approximately constant as context grows. Its survival threshold is not, it must shrink like 1/n for a fixed-size selection budget to keep catching it. Because the percentile stays put while the bar keeps rising, every fixed-budget one-shot selector, flat or hierarchical, inherits a collapse horizon. The 512K wall BRL-2026-07 measured is this law's left tail, not an engineering artifact.

We label this hypothesis-generating, not confirmed: it was found in a pre-registered autopsy but the autopsy itself was post-hoc, and it predicts per-head membership in the keep-set, not end-to-end recall directly, that link is real and monotone in every measurement we have, but not yet separately characterized. A pre-registered transport test, fit at short context and predicting long, is next.

One clarifying point worth keeping straight: within a single context length, "percentile below the cutoff" and "makes the keep-set" are close to the same statement by construction. What makes the law informative rather than circular is that the distribution of percentiles measured at one length reliably describes what happens at other lengths, that transport is the actual empirical claim, and it's what lets the law predict the 512K wall from the 32K and 128K measurements alone.

The law transports to real prose

BRL-2026-07 had already shown that swapping a synthetic, repeated-paragraph haystack for real prose at a fixed length hurts retrieval, recall at 32K dropped depending on budget. Re-running the autopsy on that same real-prose haystack finds the entire effect sitting in one number: the median rank percentile jumps from 1.6% to 13.5%, real prose is roughly 8.5× more competitive as a background than the synthetic haystack, and keep-set membership at a top-16 budget falls from 0.64 to 0.38.

That is the same law expressed on a different axis. Growing the context and swapping in a harder haystack are two ways of doing the same thing: adding competitors that outscore the needle. We call this a successful pilot rather than a confirmation, twelve trials, one fixed haystack, no pre-registration, but the direction is unambiguous, and the registered version of this test is the next program's first experiment.

Hypothesis 2: findability is trainable, but doesn't reach far enough

If the problem is the score's geometry, the fix should be geometric: train a small projection that reshapes the selector's queries so the needle's block ranks higher, without touching the model itself. We trained a rank-16 low-rank projection, 458,752 parameters, identity-initialized, applied only inside the selector, contrastively on 120 synthetic-needle prompts at 128K tokens. The model's own forward pass and output logits are provably untouched; the projection only changes which blocks get read.

Training took minutes. The pre-registered bar was strict: recall at or above 0.9 at 512K, four times the training length, refuted below 0.6.

Refuted

Against the pre-registered bar

0.47 recall sits far below the 0.9 target and below the 0.6 floor that would have counted as a partial win.

Real signal

Untrained → trained

512K recall nearly doubled, 0.27 → 0.47, at four times the length the projection ever saw in training.

Uncertified

30 trials per arm

Confidence intervals for the two arms overlap ([0.14, 0.44] vs. [0.30, 0.64]), a real-looking gain that isn't yet statistically locked down.

Read against the rank law, this is exactly what partial percentile compression should look like: training at 128K teaches the projection to push the needle's block past its 128K-scale competitors, but at 512K the survival bar is four times tighter, and a projection that never saw 512K-deep score distributions only closes part of that gap. Findability is trainable, the geometry isn't frozen, but one cheap training pass at a single length doesn't buy scale-invariant findability. Training across lengths, training the keys instead of the queries, or leaving one-shot scoring behind entirely are the open directions this hands to the next report.

Hypothesis 3: the score gap is not an uncertainty signal

The last idea was cheap by design: use the selector's own top-score shape as a free per-step uncertainty signal. When the scores near the cutoff are bunched close together, spend a bigger budget; when one block clearly dominates, a small budget should suffice. If it worked, it would mean adaptive spending without any extra machinery.

It doesn't work, on both registered clauses. The threshold that met the fidelity bar on a calibration window turned out to just be spending the full budget in disguise, roughly 31 blocks out of a possible 32. And measured against the actual signal it would need, whether the score gap predicts which steps are hard, the correlation across six evaluation windows averaged ρ ≈ 0.08, statistically indistinguishable from no relationship. The top of the score distribution carries essentially no information about whether the cheap budget will be enough at that step. Spending more when scores look close together is spending blind.

This closes a design door worth having closed: per-step budget adaptivity needs a signal from outside the selector's own scores, not a repackaging of them.

The findability curve

The chart at the top of this page is the rank law in one picture: the needle's median rank percentile against context length, plotted alongside the percentile a fixed top-16 budget actually requires for survival. The two lines almost touch at 32K and pull apart fast, that widening gap is the collapse.

Hover any point for the exact measured value. Source: find_autopsy.json, 12 trials per length, median across all layer / key-value-head pairs.

Honest limits

This is the one section where we'd rather undersell than oversell, because the open problem is the actual news.

  • Findability past 512K is still unsolved. That's not a caveat tacked onto a win, it's the frontier this report maps but does not close. Nothing here assures retrieval at 512K; the law explains why the wall exists and where it will keep appearing, not how to remove it.
  • H2's gain is real-looking but not statistically certified. 30 trials per arm on identical prompts, with overlapping 95% confidence intervals (0.27 [0.14, 0.44] untrained vs. 0.47 [0.30, 0.64] learned). We report it because the direction and the mechanism both make sense against the rank law, but we're not claiming it's locked down.
  • One model, one needle family, one block size. All measurements run on Qwen2.5-7B-Instruct-1M with verbatim-code retrieval in a controlled haystack, the easiest case for a selector. The wikitext transport pilot (12 trials, one fixed haystack) is suggestive, not a registered confirmation.
  • The rank law itself is hypothesis-generating, not proven. It was discovered in a post-hoc autopsy (mandated by H1's refutation, not planned from the start) and predicts per-head keep-set membership across a measured range, its account of how membership becomes end-to-end recall is a link we've observed to be stable but haven't separately modeled. The pre-registered test that would certify transport, fit short, predict long, is the next report's first experiment.
  • Total compute for every number in this report: $56.15 of a pre-registered $300 cap. Any individual claim can be independently re-measured for tens of dollars.

Frequently asked

Why does long-context retrieval collapse specifically around half a million tokens?

Because a fixed-size selection budget (say, "keep the top 16 blocks") requires the correct block to outscore an ever-larger fraction of competitors as context grows. We measured that the correct block's actual rank percentile stays roughly flat, around 1.6% to 3.6%, no matter how long the context is. The survival bar, meanwhile, shrinks in direct proportion to context length. Past a certain length the bar falls below where the correct block typically sits, and retrieval collapses. That crossover lands around 512K tokens in our measurements.

Does making the selection process hierarchical (coarse-to-fine) fix it?

No. We tested this directly (H1) and it does not improve on flat, single-pass selection at the same budget, if anything it trails slightly at the point estimate. The reason is that hierarchy only changes how narrow each individual ranking decision is; it doesn't change the underlying score, and the score is what's failing to rank the correct block highly enough.

Can you train your way out of the collapse?

Partially. A small trained projection (458K parameters, minutes to train, the base model untouched) nearly doubled recall at four times its training length, 0.27 to 0.47 at 512K. That's real signal that the underlying geometry is trainable, but it falls well short of the reliability bar we pre-registered, and the result isn't yet statistically certified at the sample sizes we ran. Scale-invariant findability remains unsolved.

What is the "rank law"?

A needle block's score-rank percentile among its competitors is approximately constant as context length grows, rather than shrinking the way it would need to for a fixed-size retrieval budget to keep finding it reliably. That single, simple fact, measured at 1.6% at 32K tokens, 1.8% at 128K, and 3.6% at 512K, explains the entire collapse curve without needing any other mechanism.

Full text

A clean rendering of the paper's core content, for reference and for indexing. For figures, tables, the full artifact manifest, and citations, see the PDF.

Abstract

BRL-2026-07 measured the wall where constant-support decoding stops working: one-shot block-summary selection certifies needle retrieval at 32K and 128K and collapses by 512K, even on a 1M-trained model. This report attacks the wall with three pre-registered hypotheses and refutes all three, and the autopsy mandated by the first refutation yields the law that explains every number.

H1 (hierarchical selection): refuted. Two-level bounds-of-bounds selection at constant budget shows no improvement over flat selection, worse at the point estimate (0.17 vs. 0.33 recall at 512K, intervals overlapping), and selection-hit rates show why: the failure is the bound score's ranking, not the width of any decision.

The rank law: the needle block's score-rank percentile is approximately scale-invariant (median 1.6% at 32K, 1.8% at 128K, 3.6% at 512K) while fixed-k survival demands it shrink like 1/n, the percentile distribution measured at one length carries to the others, making the 512K collapse arithmetic rather than accident. The real-prose haystack effect the predecessor measured turns out to be the same law: a single 8.5× rightward percentile shift.

H2 (learned findability, model untouched): a rank-16 query projection trained contrastively on 120 prompts at 128K raises 512K recall 0.27 → 0.47 at k=16 (uncertified at 30 trials/arm, intervals overlap) with the model's forward pass untouched, suggestive signal, refuted regardless against the pre-registered 0.9 bar and 0.6 extrapolation floor.

H3 (adaptive budget from the selector's score gap): refuted. The gap is uncorrelated with need (mean per-window Spearman ρ ≈ 0.08 across six evaluation windows), so the calibrated policy degenerates into always-spending. Every number is tied to a committed artifact; total compute $56.15 of a $300 cap, itemized. Finding remains the one cost of long context that scales, and it now has a law.

1. The frontier BRL-2026-07 left open

BRL-2026-07 made reading the constant support fast and freed its storage from the GPU, then measured the wall it could not pass: findability. One-shot block-summary selection, score n/128 block bounds, keep the top k, certifies needle retrieval at 32K (497/500 at k=8) and 128K (200/200 at k=16) on a long-context-trained model, and collapses by 512K (0.4–0.8 across k ≤ 64). That fixed-budget sparse attention degrades with context is documented across methods; what was missing was a mechanism.

Three pre-registered hypotheses, committed before any hypothesis-bearing measurement: H1, the collapse is rank dilution, so hierarchical two-level selection restores recall at constant budget. H2, findability is trainable without touching the model: a low-rank query projection used only by the selector, trained contrastively on synthetic needles, restores recall and extrapolates beyond its training length. H3, the selector's top-score gap is a per-step uncertainty signal enabling an adaptive budget.

2. Method

All machinery is BRL-2026-07's engine unchanged; selection policies plug in through a single extension point, so every comparison is policy-vs-policy inside one engine. All measurements use Qwen2.5-7B-Instruct-1M on single Modal GPUs, and the bound score under study is the predecessor's selector rule, adopted from Quest [Tang et al., 2024]: per block, score = relu(q)·kmax + (−relu(−q))·kmin, an upper bound on any token-level dot product inside the block. Readouts per paired trial: end-to-end needle recall (model in the loop) and selection-hit (is the needle's block in the keep-set?). The hierarchical selector takes bounds-of-bounds over 16-block supers; the learned selector applies a rank-16, per-layer, per-key-value-head identity-initialized projection to the selector's queries only, trained online on one synthetic-needle prompt per step.

3. H1: hierarchy does not save one-shot selection

H1's mechanism story was rank dilution: top-k over n/128 candidates degrades because the candidate set grows, so keeping every ranking decision narrow should restore recall at constant budget. The prediction was specific: hierarchical recall ≥ 0.9 at 512K (k=16), refuted below 0.7 or if hierarchy fails to beat flat at equal budget. H1 is refuted on both registered conditions. The selection-hit column says why: flat and hierarchical selection hit the needle's block at numerically indistinguishable rates (0.190 vs. 0.177 at 512K), both catastrophically below the registered 0.9. The failure is not where the decision is made, narrow decisions over a few hundred candidates fail exactly as often as one decision over 4,096. The bound score itself stops ranking the needle's block highly as n grows.

4. The rank autopsy: a scale-invariant percentile

For each trial the selector's queries were captured at the answer step, every valid block scored with the same bound rule the selector uses, and the rank percentile of the needle's block recorded per layer and key-value-head. The percentile does not shrink: across a 16× length range the median drifts from 1.6% to 3.6%, approximately scale-invariant, with mild degradation at the long end, while fixed-k survival demands it shrink in proportion to 1/n: top-16 of 4,096 blocks requires beating the 0.39th percentile. The needle's block reliably outscores roughly 96–98% of competitors no matter how many competitors exist, and that single fact generates the entire findability surface: Pr[block in top-k] = Pr[percentile < k/(n/128)], a nearly fixed distribution evaluated at a moving threshold.

At 32K the cut (6.3%) sits comfortably above the median, most layer-heads hit (0.64). At 128K the cut (1.6%) lands on the median, half hit, and end-to-end recall survives only through redundancy across layers. At 512K the cut (0.39%) is deep in the left tail, the per-layer hit rate falls to roughly 0.2, and recall collapses.

The law transports across haystacks. BRL-2026-07 measured that replacing the looped-paragraph haystack with real prose at a fixed length (32K) drops recall to 0.58/0.78/0.96 at k=8/16/32. Re-running the autopsy on the wikitext haystack finds the entire effect in one number: the median percentile shifts from 1.59% to 13.5%, real prose is an 8.5× more competitive background, with top-16 membership falling 0.64 → 0.38. In rank-law terms the conclusion stands either way: haystack competitiveness and context growth are the same failure axis, two ways of adding competitors that outscore the needle. The law also says what a repair must do: nothing that reorders decisions (hierarchy, H1) can help, because the percentile is a property of the score. A repair must compress the percentile, make the needle's block outscore not 98% but 99.99% of competitors. That is a statement about the query–summary geometry, which is exactly what H2 trains.

5. H2: findability is trainable, but does not extrapolate far enough

The rank law says a repair must compress the needle block's score percentile. H2 attempts this without touching the model: a residual low-rank projection (rank 16, per layer and key-value-head, identity-initialized, 458,752 parameters total) is applied to the selector's queries only, the model's forward pass never sees it, so logits computed over any given keep-set are bit-identical to the untrained selector's. Training is online and cheap: 120 synthetic-needle prompts at 128K, one prefill each, one cross-entropy step on the needle block among all valid distant candidates per layer/head; the contrastive loss falls from 59.7 to 0.52 in minutes of optimization. The pre-registered bar was strict: recall ≥ 0.9 at 512K (4× the training length), refuted if extrapolated recall < 0.6.

The verdict is refuted on the extrapolation clause, with real signal inside. At 512K, k=16, 30 trials per arm: flat untrained 8/30 = 0.27 (95% CI [0.14, 0.44]); flat learned 14/30 = 0.47 ([0.30, 0.64]). The trained projection nearly doubles recall at four times its training length, the intervals overlap, so the gain is consistent and substantial but not certified at this sample size, and either way it sits far below both the 0.9 prediction and the 0.6 refutation floor. Read against the rank law, the result is exactly what partial percentile compression looks like: training at 128K teaches the projection to push the needle block past competitors at the 128K percentile cut (1.6%), but at 512K the cut is 4× tighter (0.39%), and a projection that has never seen 512K-deep score distributions closes only part of that gap.

6. H3: the score gap is not an uncertainty signal

H3 claimed the selector's own top-score shape is free per-step uncertainty: when scores between rank 8 and rank 32 sit close to the top of the spread the selector is unsure and should spend k=32; otherwise k=8 suffices. Registered predictions: KL within 10% of always-32 at 32–128K with mean per-step k ≤ 14; refuted if the gap signal is uncorrelated with need. H3 is refuted on both registered clauses. The calibration rule could only meet the fidelity bar by giving the budget back: the frozen threshold yields mean-k ≈ 31 of a possible 32, the "adaptive" policy that satisfies the constraint on the calibration window is always-32 wearing a costume, and even so, the registered fidelity clause fails on three of six evaluation windows. The reason is the registered refutation condition itself: across six evaluation windows the gap–need rank correlation is ρ ∈ [−0.28, 0.27], mean per-window ≈ 0.08, the top of the score distribution carries essentially no information about whether the cheap budget will be sufficient at that step. Spending more when scores look close is spending blind.

7. Limitations

One model (Qwen2.5-7B-Instruct-1M), one needle family (verbatim-code retrieval in a looped paragraph haystack, the easiest haystack for a selector), one block size (128). Trials are 12–30 per cell: enough to refute the pre-registered bars decisively, not enough to certify small effects, H2's gain is reported with overlapping unpaired intervals, and per-trial pairing was not logged. The rank autopsy reads one decode step per trial on key-value-head 0 readouts for selection-hit; the percentile distribution pools all layer/head pairs. H3's gap signal was evaluated at one (klo, khi) pair (8, 32); richer uncertainty features were not tested. Several pre-registered deviations are disclosed in full in the PDF, including trial-count reductions under cost constraints and a dropped tiering-rig deliverable, since no selector in this report assures 512K retrieval.

8. Conclusion

Three pre-registered hypotheses; three refutations; one law. Hierarchy cannot save one-shot selection because the failure lives in the score, not the decision width (H1). The score's own top-end shape knows nothing about when it is wrong (H3). And the score's failure is now quantitative: the needle block's rank percentile is approximately constant in context length when survival demands it shrink like 1/n, so every fixed-budget one-shot selector inherits a collapse horizon. Training the selector's queries moves the percentile (H2's near-doubling at 4× extrapolation) but does not flatten it. Reading is constant, storing is tiered (BRL-2026-07), and finding is the one cost that still scales, now with a measured law saying exactly how. Collapsing the context window for good means breaking that law, percentile-compressing selection, trained across lengths, or selection that reads more than one shot, and that is the next report's problem statement.

Compute statement

All experiments ran on Modal serverless GPUs. Total GPU spend for every number in this report: $56.15 of a pre-registered $300 cap, itemized per run. Any single claim can be independently re-measured for tens of dollars.

Citations

M. Alsufi & C. Scott, Attention Pays Its Bill, BRL-2026-07 (2026). · M. Alsufi & C. Scott, Attention Has A Type, BRL-2026-06 (2026). · Nawrot et al., The Sparse Frontier, arXiv:2504.17768 (2025). · Tang et al., Quest, ICML (2024).