Attention pays its bill.

From the Constant-Support Law to measured wall-clock: a fused gather-decode kernel, and the regime where constant support pays.

The previous report proved that a transformer only needs a small, fixed number of memories to decode correctly, no matter how long the conversation. It also admitted the reference implementation was slower than dense attention. This report builds the real kernel and measures it: up to 41.9x faster at the attention operation at one million tokens, end-to-end decode that stays flat as context grows instead of the 2.4x slowdown dense attention pays over the same span, and a demonstration that the memory cache does not need GPU memory at all. A $200 graphics card can serve a million-token conversation from ordinary RAM. The report also prints where the method loses, and where its own retrieval guarantees turned out to be conditional on an easy test.

Mohammad Alsufi & Connor Scott BRL-2026-07 · June 2026 · 22 pages DOI 10.5281/zenodo.20673046

Attention-op speedup versus context length (batch 8, H100)

Read the full text Download the PDF ↓

The finding, plainly

BRL-2026-06 established the Constant-Support Law: a frozen transformer reproduces its dense next-token prediction from a small, constant number of keys, independent of context length. It proved the law but reported no wall-clock speedup, its reference implementation was an unfused Python loop, roughly 1.7x slower than the model's own fused attention path at 8K tokens. This report closes that gap.

We built a fused Triton gather-decode kernel: block-aligned reads, GQA-structured so one key/value tile is loaded once and shared across query heads, fp32 online softmax, and the whole decode step (embedding through greedy argmax, 28 layers) captured in a single CUDA graph. We measured it against the strongest dense baseline we could construct, not naive PyTorch, but the fastest probed SDPA backend at every shape, which sustains 86–96% of the H100's own memory roofline. Every prediction was written down before any measurement ran, and the paper reports which predictions held and which didn't.

41.9×

faster at the attention op, 1M tokens, batch 8

1.77×

end-to-end decode speedup at 1M tokens, batch 1

15×

a 24GB card decoding from host RAM vs. the only dense option that fits

R²=0.998

a 3-parameter cost model fits the entire measured grid

The core mechanism is a memory-traffic model, committed before any benchmark ran. Decoding one token moves bytes: the model's weights (a fixed 15.2GB, read once per step), and the KV cache reads, which for dense attention grow linearly with context, 7.5GB per sequence at 128K tokens, 60GB at 1M. The sparse kernel instead reads a constant keep-set: one sink block, a local window, and a small number of distant blocks selected by relevance, regardless of how long the conversation has gotten. Step time becomes a simple equation in bytes moved, fitted bandwidth, and a fixed per-step overhead. That equation, not intuition, is what tells you when sparsity pays and when it doesn't.

Why the reference kernel was slow, and what fixed it

The predecessor's selector was a prefill-shaped, per-head Python loop: it found the right keys but paid for the finding in eager-mode kernel launches and uncoalesced memory gathers. The lesson wasn't that gathering is inherently slow, it's that the gather has to be fused and block-aligned. Because every candidate key/value tile is a contiguous 128-token region of the cache, loads are fully coalesced and the indirection costs one integer load per block. No score vector the length of the context, let alone a full attention matrix, is ever materialized; per-step attention work is constant in context length by construction, not by approximation.

The first working selector still cost about 85μs per layer, small kernels whose launch overhead, not their math, set the floor. Hoisting shared work out of the per-layer path and fusing two bound-scoring matrix multiplies into one cut that to 56μs. A second variant, splitting the key-value scan across more parallel programs, raised occupancy 13x and cut sparse op time to 35–66μs across the whole sweep, pushing the batch-1 crossover (the context length where sparse starts winning) down from 128K to roughly 32–64K.

Measured results

At the attention-operation level, the picture is stark (charted at the top of this page). Dense latency roughly doubles with every doubling of context length: 10.7μs at 8K, growing to 5,324μs at 1M tokens (batch 8). The sparse kernel barely moves, 35–149μs across the same range, independent of n to first order, because it reads a constant number of blocks no matter how long the cache is. The gap between those two curves is the Constant-Support Law expressed as wall-clock: a 128K-token step and a 1M-token step cost about the same to serve.

End to end, the whole model, not just the attention op, the sparse engine's per-sequence decode throughput is close to flat in context length: about 89–96 tokens/second from 128K out to 1M tokens on the long-context model (an 8x longer context costs about 7% more). The dense engine decays 2.4x over that same span. At 1M tokens, batch 1, sparse decode is 1.77x faster even measured inside an unfused harness where both engines pay the same fixed per-step overhead; at 128K with batch 8 it reaches 1.81x. Below the crossover point, the report says plainly that sparsity is not yet paying, more on that below.

The regime map: a three-parameter bill

The organizing result of the report is a small equation, fitted once and reused everywhere: step time equals bytes moved divided by effective bandwidth, plus a fixed overhead, plus, for the sparse path, a "price of finding" that accounts for the selector's own work. Three numbers, measured on an H100: bandwidth β ≈ 3.05 TB/s, fixed step overhead c₀ ≈ 3.2ms, and price-of-finding c₁ ≈ 1.7ms at the cheaper (k=8) budget, 3.3ms at the budget that passes this paper's own accuracy bar (k=32).

That equation reproduces every measured H100 cell in the paper's speed grid at R² = 0.998 in-sample, with held-out error under 2.2% on configurations it wasn't fitted to. It also turns the Constant-Support Law into an operator's decision rule rather than a hopeful claim: plug in a hardware's bandwidth and a workload's context length and batch size, and the model tells you, in milliseconds, whether sparse attention is worth turning on for that regime, and by how much.

Two of the report's own pre-registered hypotheses about that model failed as stated, and it says so directly: a simpler two-term version of the equation (no price-of-finding term) fit the data far worse than its own bar required, and a related fidelity prediction also missed. The three-term model above is the amendment, not the original guess, and the paper is explicit that this is a correction, not a retroactive pass.

The cache doesn't need the GPU

The report's largest systems consequence follows directly from the law: if a decode step only ever reads a few dozen blocks per layer, then 99.5–99.9% of a long conversation's cache is sitting in memory doing nothing on any given step. So it doesn't need to sit in GPU memory at all.

The team placed a 60GB key-value cache, the full memory of a million-token conversation, in ordinary pinned host RAM, on a machine with a 24GB GPU card that could never fit that cache in its own memory. The GPU holds only the model weights and a small per-layer staging buffer; the selector doubles as a prefetcher, pulling just the constant keep-set over PCIe each step. Result: 6.1 tokens/second of million-token decode, on hardware where dense decoding at that length isn't possible at all. Compared to the honest floor, streaming the full dense cache over the same PCIe link, the only dense option that hardware has, constant-support reading is 15x faster. The same tiered path was validated end-to-end at 128K tokens, where it also solves the paper's needle-retrieval task at full accuracy, so the mechanism is checked for correctness, not just raw throughput.

The practical reading: HBM residency stops being the hard boundary of usable context length. Host RAM becomes the new boundary. At the node level, that boundary is terabytes, not tens of gigabytes.

Honest limits

The report prints its own failure modes in the same tables as its wins, which is the point of pre-registering predictions before measuring.

What the paper says doesn't work (yet)

Below roughly 64K tokens, sparse is slower, not faster. At 8K tokens, batch 1, the sparse engine runs at about 0.84x dense, the constant keep-set only saves a fraction of a gigabyte of reads, while the selector itself still costs real time. Nothing about the kernel fixes this; it's a property of the bill, not a bug to be optimized away, and the crossover point moves with hardware and batch size.
The cheaper budget fails the paper's own accuracy bar. Reading with only 8 distant blocks (k=8), the pre-registered, cheapest configuration, produces 0.985 next-token agreement with dense attention, against a pre-set 0.99 requirement. It doesn't clear the bar. The paper's headline accuracy tables lead instead with the more expensive "assured" budget (k=32), which reaches 0.9975.
Retrieval assurance turned out to be haystack-conditional. The needle-in-a-haystack test the field uses to certify long-context retrieval was run first on a repetitive, looped-paragraph filler corpus, the easiest case for a distant-block selector, because a synthetic needle stands out. Swapping that filler for real, natural prose at the same length dropped recall at the cheap budget from 0.99 to 0.58. The authors found this themselves, ran the harder test, and put the number in the abstract rather than the appendix.
Fixed-budget retrieval degrades with context length, and the paper measures exactly where. On the model trained for million-token context, recall at a fixed key budget holds above 0.9 out to 128K tokens, then collapses at 512K even at generous budgets. Retrieval "cost" is cheap in microseconds; retrieval reliability at a fixed budget is a property of what the model's own attention heads learned during training, not something the selector kernel can fix by itself.

None of this is hidden in an appendix. The report's position is that a systems result is only trustworthy if it's honest about where it stops holding, the same discipline BRL-2026-06 applied to the law itself.

Model & hardware

Qwen2.5-7B-Instruct (28 layers, GQA 28/4 heads, head-dim 128, bf16) and its 1M-context variant. Measured on H100 and H200 GPUs, plus a 24GB A10 for the host-RAM tiering demo. All experiments ran on serverless Modal GPUs.

Keep-set

Each decode step reads one sink block, a local window, and k distant 128-token blocks selected by a per-head bound score. Two budgets recur: k=8 (cheap, pre-registered) and k=32 (the "assured" budget that clears the fidelity bar).

Dense baseline

Not a strawman: the fastest probed SDPA backend (cuDNN attention won every shape tested) at 2.9–3.2 TB/s, 86–96% of the H100's own HBM3 roofline.

Predecessor

BRL-2026-06, "Attention Has A Type," established the Constant-Support Law this report converts into wall-clock. Read it here.

Reproducibility

Every number ties to a named, version-controlled JSON artifact. Re-running the full measurement program, including the internal revision cycle, cost $135.20 of serverless GPU time.

Read the full report text ↓

Full text

Attention Pays Its Bill

From the Constant-Support Law to Measured Wall-Clock: A Fused Gather-Decode Kernel and the Regime Where Constant Support Pays.
Mohammad Alsufi and Connor Scott, and the Brainsless Research Lab AI Systems Research Group. Brainsless Research Lab, Technical Report BRL-2026-07, June 2026.

Abstract

BRL-2026-06 established the Constant-Support Law, a frozen transformer reproduces its dense next-token prediction from a constant number of keys, independent of context length, but reported no wall-clock speedup: its reference implementation ran 1.7x slower than FlashAttention, and the fused kernel was left as an open item. This report closes it. We build a fused Triton gather-decode kernel (block-aligned gather, GQA-structured, fp32 online softmax, CUDA-graph captured end-to-end) and measure it under pre-registered predictions against the strongest dense baseline we can construct, the fastest probed SDPA backend. At the attention-op level the kernel delivers 2.3–10.2x (batch 1, 128K to 1M context, split-k variant) and 3.9–41.9x (batch 8, same variant). End to end, sparse decode throughput per sequence is flat from 8K to 128K on H100 and within 7% from 128K to 1M on H200 (dense decays 2.4x over the same span), and the measured price-of-finding, bandwidth, step overhead, and a three-parameter memory-traffic model, fits the measured H100 (n × batch) grid at in-sample R² = 0.998 with held-out error ≤ 2.2%, turning the law into an operator's regime map: where constant support pays, and where it honestly does not.

Fidelity is measured through the real engine: on natural text at 8–32K, per-step agreement at the k=8 budget is 0.906 unconditional, below the predecessor's strict 0.99 bar. Conditioned on confident steps (dense margin > 1), agreement pooled over every non-quarantined artifact is 0.985 at k=8 and 0.9975 at the assured k=32 budget, which clears the 0.99 bar. On the long-context-trained model (Qwen2.5-7B-Instruct-1M), distant-needle retrieval certifies one-sided 97.5% lower bounds of 0.982 at 32K (k=8) and 0.981 at 128K (k=32) on the looped-paragraph haystack those suites use; a real-prose probe at 32K reads 0.58/0.78/0.96 at k = 8/16/32 (100 trials), so retrieval assurance is haystack-conditional and budget buys it back.

A matched-budget baseline panel inside the same engine separates the policies: a sliding window is fastest and never retrieves; finer page-granular selection retrieves with better per-key fidelity but pays roughly 8x the selector operations at decode. We further measure findability, recall at fixed budget decays with context, depends on the key geometry training induced, and collapses by 512K even on the 1M-trained model, quantifying the budget-versus-length frontier observed across sparse-attention methods generally. Finally, the law's strongest systems consequence: the cache does not need the GPU. With the 60GB cache in host RAM and the keep-set fetched over PCIe, a 24GB A10 sustains 6.1 tok/s of million-token decode, 15x the dense-from-host floor, with the tiered path validated end-to-end at 128K. Every number is tied to a named artifact, and the full measurement program is independently re-runnable, documented, at a per-run itemized cost.

1. From law to clock

BRL-2026-06 established the Constant-Support Law: a frozen transformer reproduces its dense next-token prediction at a strict fidelity bar (top-1 agreement ≥ 0.99) from a small, model-dependent constant number of keys, independent of context length, verified from 8K to 1M tokens, with the oracle budget shrinking with model scale to a floor of 16 at 72B parameters. That report also stated plainly that its own reference implementation was roughly 1.7x slower than the model's fused FlashAttention path at 8K, because it was an unfused reference implementation rather than a real kernel, and left building that kernel as an explicit open item.

This report closes that item. We build a fused kernel, a Triton gather-decode kernel that reads only the constant-support keep-set from the key-value cache, integrate it into a real decoding engine for Qwen2.5-7B-Instruct (and its 1M-token long-context variant), and measure wall-clock against the strongest dense baseline we can construct, under the same honesty rules as the prior report: every number is measured and tied to a named artifact, comparisons are against the best dense configuration rather than a naive implementation, and pre-registered predictions are reported whether they held or not.

The organizing instrument is a memory-traffic model, committed before any benchmark ran. Decode is memory-bound: a step's latency is, to first order, the number of bytes it must move. The model prices every regime in bytes and converts the Constant-Support Law into falsifiable wall-clock predictions, including the prediction that the prior report's own target of "2–4x at 128K" must fail at batch 1, not because the kernel is weak, but because at batch 1 the 15.2GB of model weights dominate the bill. Where constant support pays out is a property of the bill, not of the kernel alone.

2. The memory-traffic model

Decoding one token with this model architecture moves, per step: the model weights (≈15.23GB, read once per step regardless of batch size); dense key-value reads, which scale linearly with context length (7.5GB per sequence at 128K tokens, 60.1GB at 1M); and, for the sparse path, a constant gathered keep-set plus a small block-summary scan (roughly 1/128th of the dense traffic). The pre-registered top-4-distant-block configuration keeps the gathered portion under 100MB even at 1M tokens; the production top-8 configuration used for the headline fidelity numbers adds under 0.5% to every speedup prediction.

The step-time model is bytes moved divided by effective bandwidth, plus a fixed per-step overhead (dominated by kernel launches, small under CUDA graphs). Both free parameters of that base model are fitted on the dense rows only; the sparse prediction then has no free parameters of its own, and its fit quality against measurement is itself one of the report's pre-registered hypotheses.

3. The kernel

The predecessor's reference selector was a prefill-shaped, per-head Python loop: it found the right keys but paid for the finding in eager-mode launches and uncoalesced gathers. The lesson isn't that gathering is slow, it's that the gather has to be fused and block-aligned. The kernel is built decode-first, since that is where the law's read-saving is purest (one query, n cached keys). One Triton program runs per batch-item and key-value head: it selects the keep-set (mean-pool keys per 128-token block once at prefill, score the summaries against the query, keep the sink block, the last local blocks, and the top-k distant blocks by score), then streams the selected blocks through a fused, block-aligned gather-attend step with an fp32 online softmax. Because selection is block-aligned, every key/value tile the kernel reads is a contiguous 128×128 region of the cache, so loads are fully coalesced and the indirection costs one integer load per block; no context-length-sized score vector, let alone a full attention matrix, is ever materialized. GQA is exploited structurally: one key-value head's tile is loaded once and shared by all of its query heads. The entire decode step, embedding, all 28 transformer layers with the kernel, the language-modeling head, greedy argmax, and position bookkeeping, is captured in a single CUDA graph, so per-step launch overhead is amortized identically for both the sparse and dense paths. Prefill itself stays exactly dense (the model's own fused attention for full sequences, with an exact chunked path for very long sequences), so the key-value cache is bit-identical between dense and sparse runs: sparsity exists only in decode-time reads, never in storage.

4. Measurement methodology

Dense and sparse configurations share everything, weights, key-value tensors, prefill, the decode step, CUDA-graph capture, except the attention read itself. At each configuration, every eligible dense attention backend is timed and the fastest is used as the baseline (cuDNN's fused attention won every shape probed in this study). Timing uses CUDA events around graph-replay loops, three repeats per mode interleaved between dense and sparse runs within one process, to control for thermal and clock drift on rented cloud GPUs; run-to-run session variance was 2–4%. Fidelity is measured through the real decoding engine, eager and exactly masked, three ways: per-step next-token argmax agreement with the dense engine on a true text continuation; free-running exact-prefix length under greedy decoding; and a distant-needle task, where a unique code is placed at a random distant position in the context and both engines must retrieve and answer it correctly, greedily, with no cascading shortcuts.

5. Op-level results

Three facts structure the attention-operation-level measurements. First, the dense baseline is not a strawman: the fastest dense backend sustains 2.9–3.2 TB/s at contexts of 128K tokens and above, which is 86–96% of the H100's 3.35 TB/s HBM3 bandwidth ceiling in the roofline sense. Second, the sparse path is flat: 35–149 microseconds per call, nearly independent of context length, while dense latency doubles with every doubling of context, growing from 10.7 microseconds at 8K to 5,324 microseconds at 1M tokens (batch 8), a 42x gap at that one cell alone. Third, the crossover point, where sparse first beats dense, is exactly where the memory-traffic bill says it should be: at batch 1 it sits between 32K and 128K tokens depending on the kernel variant; at batch 8 it drops below 16K, because batching multiplies the dense key-value bytes that the sparse kernel eliminates while the sparse floor grows only mildly. A parallel "split-k" kernel variant, which spreads the key-value scan across more concurrent programs, raises op-level speedup to 2.28x at 128K and 10.24x at 1M (batch 1), and to 41.94x at 1M (batch 8), the headline number in this report.

The price of finding the right keys was paid down twice during development. The first working selector cost roughly 85 microseconds per layer, execution time rather than launch overhead, which put the batch-1 crossover as far out as 128K. Hoisting work that is shared identically across all 28 layers out of the per-layer path, and fusing two small scoring matrix multiplies into one, cut that floor to 56 microseconds. The split-k variant then raised parallelism roughly 13x and cut sparse op time to 35–66 microseconds across the whole sweep.

6. End-to-end decode

Whole-model decode throughput, embedding to argmax, CUDA-graphed, dense and sparse interleaved for a fair comparison, shows the Constant-Support Law as a service-level property: the sparse engine's per-sequence throughput is approximately flat in context length (about 100 tokens/second per sequence at 8K, 32K, and 128K alike on H100), while the dense engine's throughput decays as the key-value read term grows. Serving a 128K-token request costs about the same, per token, as serving an 8K-token one. Below the crossover, the report says plainly that sparse is honestly slower: at 8K tokens, batch 1, the sparse engine runs at roughly 0.84x dense, because the keep-set only saves a fraction of a gigabyte of reads against a 15.7GB total bill while still paying for the selector's own work. Extending to a million tokens on the long-context-trained model (H200, k=32), per-sequence sparse throughput stays flat, 89 to 96 tokens/second from 128K to 1M, an eightfold longer context costing about 7% more, and the crossover point sits later on H200 than H100, because H200's faster memory bandwidth makes dense reads cheaper even as the sparse selector floor does not shrink.

Measured against the predecessor's own pre-registered target of 2–4x decode speedup at 128K and 5–10x at 1M: that target was not met in any measured cell as literally stated, but the shortfall is fully explained and priced by the fitted memory-traffic bill, the engine's own fixed per-step overhead, paid identically by both dense and sparse modes, caps the achievable ratio in a way the original target's reasoning did not account for. Measured maxima were 1.81x at 128K (batch 8, H200) and 1.77x at 1M (batch 1). Removing that shared fixed overhead as a projection (not a measurement) recovers ratios much closer to the original target and, at the operation level alone, the measured 11.5–41.9x figures sit entirely above the original gate's range. The gate was written about the attention operation; the full harness bill is what a real deployment actually pays, and this report reports both.

7. Matched-budget baselines

The closest published methods to this kernel are page-granular query-aware sparsity (Quest) and sink-plus-window streaming attention (StreamingLLM). Both were re-implemented inside the same engine, identical weights, cache, prefill, and measurement harness, with only the keep-set selection policy changed, so the comparison isolates policy and granularity rather than reproducing each method's own separately optimized kernel. A sliding window is the fastest policy and never retrieves anything outside its window, by construction. A finer, page-granular selection policy retrieves with better per-key fidelity than this report's 128-token block granularity, but at roughly 8x the selector operations at decode time, because sixteen-token pages require eight times as many relevance summaries as 128-token blocks for the same context. Block-128 selection, this report's granularity, is the coarser choice whose cheaper finding cost is exactly what the fused kernel makes small; that tradeoff, not a claim of universal superiority, is the finding of this section.

8. Fidelity under the kernel

An early, cheaper version of the block-relevance scorer, averaging keys within a block and scoring against a group-mean query, passed every synthetic correctness test and reproduced the model's free-running text continuation exactly, yet solved only 6 of 25 distant-needle retrieval trials against the dense engine's 25 of 25. The diagnosis: a short needle inside a 128-token block gets averaged away by mean pooling on the key side, and averaged away again by the group-mean query on the retrieval-head side. Switching to a per-head bounding score (an upper bound on any token's dot product within the block, maximized over the query group) recovered 25 of 25 at the same budget. This replicates, at decode time and through a real fused kernel, the same failure-and-fix the predecessor paper documented for its own selector.

Because raw next-token argmax agreement over a few dozen steps is a fragile metric, unable to distinguish a genuine retrieval failure from a near-tie any floating-point reordering would flip, the primary fidelity protocol is distributional: teacher-forced KL divergence and perplexity between dense and sparse predictions over long true-text windows. Per-step agreement at the cheap (k=8) budget is 0.906 unconditional across the pooled measurement set, below the predecessor's strict 0.99 bar. Conditioning on steps where the dense model is actually confident (top-1 versus top-2 logit margin greater than 1, since most disagreements concentrate at near-ties with margin under 0.5) raises pooled agreement to 0.985 at k=8 and 0.9975 at the assured k=32 budget, which clears the bar. The report is explicit that this conditioning is a post-hoc diagnosis of a missed gate, not a substitute for it, and prints the unconditional number in the same table.

9. Findability: the budget the law does not fix

The distant-needle task, a unique code at a random distant position, answered greedily through the real engine, is where retrieval reliability, as opposed to retrieval cost, is stressed. Recall at a fixed key budget decays with context length: on the model's native 32K-token training window, recall falls from 125/125 at 8K to 88% at 32K with the cheap budget, recovering to 98.3% and 100% as the budget grows. On the model trained for million-token context, recall holds above 90% out to 128K tokens at both tested budgets, then collapses by 512K, to as low as 40–80% depending on budget, even though the same budgets are ample at shorter lengths. The required key budget is roughly flat out to 128K and then turns sharply upward, a budget-versus-length frontier consistent with what has been documented across sparse-attention methods generally, quantified here for this specific engine and selector.

The certifications above were run on a looped-paragraph filler corpus, the easiest case for a block-relevance selector because a unique needle stands out sharply against repetitive background text. Replacing that filler with real natural prose at the same 32K length drops recall at the cheap budget from 0.99 to 0.58; the mid budget drops from a near-perfect rate to 0.78; the assured budget holds at 0.96. Real prose populates the cache with semantically competitive blocks, and the budget the selector needs grows with how competitive the background is, retrieval assurance is a property of the model, the budget, and the haystack together, not of the selector alone, and the certified numbers should be read as an upper envelope rather than a general guarantee.

10. The cache does not need the GPU

The Constant-Support Law's strongest systems implication is that if a decode step reads only a few dozen blocks per layer, the remaining 99.5–99.9% of the key-value cache has no business occupying expensive GPU memory on any given step. This report demonstrates that directly: the 60GB key-value cache of a million-token context is placed in pinned host RAM, with the GPU holding only the model's weights and a small per-layer staging buffer, and the selector doubling as a prefetcher that pulls the keep-set over PCIe each step. On a 24GB GPU, a card on which dense million-token decoding cannot exist at all, since the cache alone is 2.5x the card's entire memory, the tiered engine decodes at 6.1 tokens/second. The honest comparison is the dense-from-host floor: streaming the full cache over the measured PCIe link would cap dense throughput at 0.4 tokens/second, so constant-support reading wins by 15x, and the gap is set by memory-bandwidth arithmetic, not kernel tuning. The same tiered path is validated at the certified 128K length, where it also solves the retrieval task at full accuracy. What changes here is not a speedup but the memory economics of long context: HBM residency stops being the boundary of usable context length; host RAM becomes the boundary, and at the node level, that boundary is measured in terabytes.

11. The regime map

The pre-registered two-term memory-traffic model gains one term the original registration missed and the measurement found: an engine-level price of finding. Effective bandwidth and fixed step overhead are fitted on the dense rows only; the price-of-finding term is fitted once on the sparse rows. With those three numbers, β ≈ 3.05 TB/s, fixed step cost ≈ 3.2 milliseconds, and price of finding ≈ 1.7 milliseconds at the cheap budget (3.3 milliseconds at the assured budget), the model reproduces every measured H100 speedup cell at in-sample R² = 0.998, with held-out prediction error under 2.2% on configurations withheld from the fit. The same functional form, refitted per platform, prices the H200 long-context rows as well, with the price-of-finding term measuring meaningfully higher on the faster platform, a result the report flags rather than fully explains, and attributes tentatively to clock and occupancy differences between platforms. Two related pre-registered hypotheses about a simpler, two-term version of this model failed as stated and are reported as failures, not folded quietly into the successful three-term amendment.

12. Related work

This report builds on a large body of work making dense attention memory-efficient (FlashAttention-2/3, flash-decoding) and on the query-aware and streaming sparse-attention methods it benchmarks directly against (Quest, StreamingLLM), as well as adjacent work on KV-cache offloading and memory tiering across GPU, CPU, and disk. Distinct from all of these, this report's contribution is the accounting: a memory-traffic model whose pre-registered two-term form was amended by one measured term, refitted per platform, that reproduces measured throughput across the full H100 grid at R² = 0.998 and converts the Constant-Support Law into an operator's regime map.

13. Limitations

All wall-clock numbers are measured on one model family (Qwen2.5-7B and its 1M-token variant) at one parameter scale; the traffic model's bandwidth and overhead terms should generalize by substitution to other models and hardware, but the price-of-finding term does not transfer without being re-measured, and larger-scale runs were outside this report's budget. The engine built for this report is a research harness, not a production serving stack: its fixed per-step overhead is several times higher than a production inference server's, so the measured end-to-end ratios understate what a more optimized runtime would achieve. Key-value memory itself is not reduced anywhere in this report, only decode-time reads are sparse; the cache is stored and paid for in full by both the dense and sparse paths. The fidelity bar is measured on greedy decoding against wikitext and a synthetic needle task; sampling-based generation and multi-hop retrieval are not covered. Needle trials at 512K and 1M tokens are comparatively small in count. Rented cloud GPU clocks are not perfectly stable; run-to-run variance of 2–4% is measured and disclosed rather than eliminated. Pre-registration in this report is attested within the same version-controlled release rather than externally time-stamped; future reports will commit predictions as a separately timestamped step.

14. Conclusion

The predecessor report proved a law and confessed a debt: no wall-clock speedup, a fused kernel explicitly left as future work. This report pays it. The fused gather-decode kernel turns the Constant-Support Law into measured wall-clock: at the attention operation, up to 41.9x against a dense read sustaining 86–96% of the H100's memory roofline; end to end, decode whose per-sequence cost does not depend on context length, 1.77x faster at one million tokens even measured inside an unfused harness. Just as important is what the bill refuses to pay: below the crossover point sparsity buys nothing, and the measurement says so; the engine's own fixed costs cap end-to-end ratios, and the three-parameter model prices that cap; and finding the right keys, cheap in microseconds at the kernel level, demands a key budget that depends on what the model's own attention heads learned, with long-context training making keys findable at a realizable budget far above the theoretical minimum. The next open items are named directly: a fully fused selector kernel, integration with a paged-memory serving runtime, sparse prefill (this report prices decode only), frontier-scale validation, and the geometry of trained key-findability as an object of study in its own right, the last taken up immediately in BRL-2026-08.

Compute statement

All experiments ran on serverless GPUs (L40S for development and fidelity stages, H100 for op-level and end-to-end grids, H200 for long-context sessions, A10 for the host-RAM tiering demonstration). Every number in this report is reproducible from a named, version-controlled artifact. Re-running the complete measurement program, including the internal revision cycle, cost $135.20 of serverless GPU time.

See it in production: Cortex → ← Back to the research index Brainsless home