The problem
Every AI system deployed today runs on an assumption: if you tell a model to follow a rule at the start of a session, be honest, stay in this persona, never discuss a competitor, follow the operator's instructions over the user's, it will keep following that rule as the conversation goes on. RLHF training is supposed to reinforce this. Prompt engineering is supposed to reinforce this. Memory systems that re-inject rules from an episodic store are supposed to reinforce this.
We show all three leave a large, measurable gap. The gap has a specific shape: it is not that the model forgets the rule. A model asked directly "what were you told to do?" will answer correctly almost all of the time. What collapses is whether the model acts on the rule during generation, once enough other things are competing for its attention. We call this asymmetry, high recall, collapsed compliance, the diagnostic signature of Constraint Routing Failure (CRF).
The same underlying budget problem, we argue, is the shared root of five alignment failure modes that are usually treated as unrelated: compliance collapse under long sessions, sycophancy (agreeing with a user over maintaining an accurate position), prompt injection (an adversarial user turn overriding the system prompt), persona drift, and instruction-hierarchy violations (failing to prioritize operator instructions over user instructions). Different papers propose different fixes for each. We argue they're one architectural problem wearing five costumes.
Why it happens
Standard transformer attention computes one shared softmax per layer, per head, over every token in the context, system-prompt constraints, persona tokens, old conversation turns, and the user's current message all draw from the same normalized budget. As a conversation gets longer, ordinary content accumulates and claims a growing share of that budget. The tokens carrying your behavioral rules don't get a protected lane; they compete like everything else, and as constraint count and depth both grow, they lose.
We show this is not just intuition. Attention mass per constraint token falls off as a power law in the number of active constraints, m(Ku) = M₀·Kuγ−1, with the exponent γ ≈ 0.39 measured directly on Qwen2.5-3B and replicated in the range [0.341, 0.390] across four different open model families (Qwen2.5, Llama-3.1, Mistral/Mixtral, Gemma-2). Going from 1 constraint to 8 constraints, per-constraint attention mass drops 3.57×. We also derive this as a theorem, not just an empirical curve: under a shared softmax with bounded attention scores (a direct consequence of ordinary weight decay and layer normalization), no parameter setting can keep per-constraint attention mass constant as the constraint count grows. The sublinear falloff is architectural, not a training artifact.
We ruled out the two most obvious alternative explanations. First, is this really about routing, or is the information itself getting lost from the model's internal representation? We trained a probing classifier that reads a constraint's compliance status directly from the model's hidden states, and it stays accurate (88.5%) at every layer we tested, all the way to the last one, the representation is present. Then we causally intervened: injecting the learned "comply" direction back into the residual stream at inference time recovers +9.4 points of compliance, proving the representation is not just present but usable. A separate, more direct intervention, surgically restoring each constraint token's attention weight to its uncontested (single-constraint) level, recovers most of the compliance gap on its own, and we estimate that routing failure accounts for roughly 78% of the collapse, with the diffuse loss of encoding quality across many positions responsible for the remaining 22%. Second, is this just the well-known "lost in the middle" effect, where information buried in a long context becomes harder to retrieve regardless of position? No: we tested constraint injection at the start, middle, and end of sessions and found no significant difference in violation rate. The failure tracks conjunctive load and depth, not position.
Measuring it: CCB-R
To make both halves of the asymmetry, compliance and recall, measurable in the same protocol, we built CCB-R (Compounding-Constraint Benchmark with Recall). A session injects Ku behavioral constraints and then alternates constraint turns (does the model still follow every active rule?) with recall turns (does the model still remember earlier content?) as the conversation runs to a target depth. Two scores come out: C-PP (Constraint Preservation Probability, the fraction of active constraints satisfied per turn) and Recall (whether earlier factual content is still retrievable). A held-out LLM judge (GLM-5.1, temperature 0, structured verdicts) scores every turn; for closed frontier models we cross-checked a sample with self-judging and found 94% raw agreement.
The benchmark is deliberately a stress test: at shallow depth or low constraint counts, every model we tested holds compliance near 1.0. The cliff only appears once depth and constraint load are both high, which is exactly the regime real long-running agent sessions live in.
Results across six models
At depth 48 and 8 simultaneous constraints, we evaluated four frontier API models (GLM-5.1, GPT-4o, Claude Sonnet 4.5, GPT-5.5) plus two replication points, Meta's open-weights Llama-3.3-70B and our own production model, Neptyn 1.0, to check the cliff isn't an artifact of closed-API alignment training or of being our own model. It wasn't. Every one of the six models showed the same signature: compliance collapsed while recall held.
| Model | C-PP | Recall | Gap |
|---|---|---|---|
| GLM-5.1 | 0.223 | 0.909 | +0.686 |
| Claude Sonnet 4.5 | 0.180 | 0.681 | +0.501 |
| GPT-5.5 | 0.222 | 0.659 | +0.437 |
| GPT-4o | 0.153 | 0.500 | +0.347 |
| Neptyn 1.0 (ours) | 0.158 | 0.850 | +0.692 |
| Llama-3.3-70B | 0.068 | 0.864 | +0.796 |
Two results in this table are worth calling out specifically. Claude Sonnet 4.5, which has explicit training against sycophancy and instruction drift, still shows a +0.501 gap, alignment training narrows the cliff relative to an unaligned model but does not close it. And Llama-3.3-70B, the largest model tested at 70B confirmed parameters, has the worst compliance of the six (C-PP 0.068) while other, smaller models do better. Across our full scaling sweep (Qwen2.5 from 0.5B to 14B) we found compliance does improve with scale, but with sharply diminishing returns above 7B, and no evidence it would ever close the gap by scale alone. Parameter count is not the lever.
We also confirmed the cliff generalizes past our synthetic benchmark: injecting the same kind of behavioral constraints into real multi-turn MT-Bench conversations (writing, reasoning, coding, and five other domains) reproduces the same asymmetry, C-PP 0.729, Recall 0.958, a +0.229 gap, with a smaller absolute size than the synthetic worst case, but the same direction, in real conversations models are actually likely to have.
The fix: Governed Multi-Stream Attention
If the problem is that every token role shares one softmax, the fix is to stop sharing it. Governed Multi-Stream Attention (GMSA) gives behaviorally distinct token roles, for example, a dedicated behavioral-constraint stream, separate from ordinary conversational context, their own independent softmax denominators. Each stream's routing mass stays constant by construction, regardless of how much unrelated content piles up in the other streams. It's a structural change to how attention is organized, not a smarter prompt or a bigger retrieval index.
We built two versions of this. First, an inference-time approximation, a three-part system we call the Brain architecture, combining episodic memory (a structured constraint store retrieved each turn) with executive attention (constraint-salience scoring, priority-ordered system-prompt reconstruction, and a post-generation compliance check). Deployed together across all six frontier models under the same cliff conditions that produced the numbers above, it lifts mean C-PP from 0.167 to 0.678, recovering most of the gap for prescriptive constraints ("always do X," which reaches 0.84) and constraints that are merely routing-sensitive (0.71), without retraining anything.
Second, and more fundamentally, we trained an actual GMSA prototype, a real K=2/K=3 stream-split attention architecture, fine-tuned on Qwen2.5-32B using only constraint-persistence training data. Trained on nothing but that one task, the same architecture generalized to five distinct alignment benchmarks it was never trained on: the original compliance benchmark, resistance to sycophancy under sustained user pressure, defense against prompt injection, persona persistence over 96-turn sessions, and instruction-hierarchy enforcement. Mean score across all five: 0.26 at baseline → 0.79 with GMSA, a +0.53 absolute lift, and +0.35 beyond a matched LoRA fine-tuning control trained on identical data, which isolates the architecture itself (not just the training signal) as the source of the gain. Capability benchmarks (MMLU, HumanEval, MT-Bench) moved by less than a point in either direction, because the fix only touches how constraint tokens compete for attention, the knowledge and reasoning pathways are untouched.
The clearest single number for what GMSA adds beyond inference-time tricks: on suppression constraints ("never mention X"), the hardest category, the inference-time Brain architecture only reaches 0.41, repositioning a rule in the prompt cannot stop a model from generating a token it was already statistically primed to generate. The trained GMSA prototype reaches 0.74 on the same category. Positional tricks can't fix a generation-distribution problem; retraining the attention mechanism itself can, partially.
Honest limits
What we haven't shown yet
- Suppression constraints are still an open problem. "Never say X" recovers to only 0.41 with inference-time fixes and 0.74 with the GMSA prototype, real progress, but the weakest category by a wide margin, and the one that matters most for safety-critical deployments (content that must never appear, not just content that must appear).
- The mechanistic evidence is open-models-only. Attention-mass measurements, probing, and causal injection all require access to internals we don't have for closed frontier models. For Claude Sonnet 4.5 and GPT-5.5, we only have the behavioral signature (compliance collapses, recall doesn't), the routing-failure mechanism for those models is our best-supported hypothesis, not a directly verified one.
- The five-benchmark generalization result is single-lab, single-infrastructure. The same team, judge model, and evaluation pipeline produced all five numbers. That's a real risk of correlated bias that independent replication would catch. We report it as the best current evidence, not a confirmed result.
- Scale is a five-point Qwen sweep, not a scaling law. The observation that compliance lift saturates above 7B parameters is fit on five model sizes; we haven't verified it holds across other model families.
- Four things we tried didn't work and are reported as negative results: contrastive fine-tuning on violation pairs (statistically indistinguishable from zero), per-constraint prompt chaining (introduced response-coherence failures worse than the routing problem it solved), few-shot reminder turns (not statistically significant), and scale alone (real but small, and it plateaus).
Full text
Below is the paper's abstract and core sections, in full. For the complete derivations, all 33 tables, and the appendices (proofs, judge calibration, related work), read the PDF.
Abstract
Standard transformer attention treats behaviorally distinct token roles, constraints, instructions, persona, episodic context, user content, as undifferentiated competitors in one softmax budget. We argue this design is the structural root cause of five alignment failure modes: compliance collapse, sycophancy, prompt injection, persona drift, and instruction hierarchy violations. We propose Governed Multi-Stream Attention (GMSA), typed streams with independent softmax denominators per role, as the unified fix.
A GMSA K=3 prototype (Qwen2.5-32B, trained on constraint-persistence data only) achieves a mean of 0.79 across five distinct alignment benchmarks for which no targeted training was conducted, versus 0.26 baseline and 0.44 for matched LoRA on identical data. The architectural gain over LoRA averages +0.35 (range +0.16 to +0.48 across the five tasks) and is positive on every task and across three model families and six model sizes (Qwen2.5 7B–72B, Llama-3.1 8B–70B, Mistral-NeMo 12B). Capability metrics are unchanged (MMLU, HumanEval, MT-Bench).
The experimental anchor is Constraint Routing Failure (CRF): at Ku=8 constraints and depth 48, compliance falls to C-PP=0.07–0.22 while factual recall holds at 0.54–0.91 across six frontier models. The mechanism is architectural: per-constraint attention mass dilutes sublinearly, mi = Θ(Kuγ−1), γ̂=0.39 (R²=0.98), replicated across four architecture families (γ̂ ∈ [0.341, 0.390]). A Pinsker-based bound connects mass to enforcement gain: gi ≤ √(Cθ/2)·mi. Causal attention surgery quantifies routing failure as responsible for ~78% of the compliance cliff; diffuse encoding loss is a minor co-contributor (21.7%).
Inference-time: a three-system cognitive architecture (Brain v1 working memory; Brain v2 episodic memory; Brain v3 executive attention) achieves mean C-PP=0.678 across six models (+0.51 over baseline), generalizing to naturalistic sessions (+0.23 on WildChat and LMSYS-Chat-1M). Suppression constraints are an identified open problem (0.41 [0.34, 0.48]); GMSA reduces but does not fully close this gap. GMSA is proved to be the unique attention variant satisfying Behavioral Mass Invariance, empirically verified at 0.90–1.00× across contexts of 200–11,400 tokens.
1. Introduction
In 1998, Andy Clark and David Chalmers argued in The Extended Mind that the boundary of the cognitive system is not the body. A notebook that reliably stores memory is memory. A tool that offloads decisions is thinking. Cognition extends into whatever it reliably uses. Brainsless Research Lab takes this thesis as its operating premise: we study what happens when AI systems become structurally integrated with human cognitive work, not as tools the mind uses, but as constituent parts of the cognitive process itself.
Constraint Routing Failure is, under this framing, the central failure mode of cognitive extension. An AI agent that cannot maintain its behavioral commitments reliably across a long interaction is not an extended mind, it is a tool that forgets what it is. The research question this paper addresses is why that forgetting occurs at a structural level, and whether it is fixable at the architecture level.
Standard transformer attention distributes softmax mass uniformly across all token roles in one shared denominator, so behavioral-rule tokens, persona tokens, system-instruction tokens, and user-content tokens all compete in one budget. As conversational content grows, behavioral mass dilutes. The same budget mis-specification predicts sycophancy (user-pressure tokens outcompeting the "maintain accurate answer" commitment), prompt injection (adversarial user tokens outcompeting system-instruction tokens), persona drift (persona tokens diluting over depth), and instruction hierarchy violations (no architectural mechanism to prioritize operator tokens over user tokens). Five alignment phenomena, one structural cause.
3. Constraint Routing Failure
We define a behavioral constraint as an instruction that specifies a persistent rule for all subsequent responses: tone, format, honesty requirement, persona, confidentiality, or structural requirement. A model exhibits Constraint Routing Failure at depth d and constraint load Ku if the probability of correctly enforcing all constraints simultaneously during generation decreases significantly relative to depth 1, while the probability of correctly recalling any individual constraint remains high. CRF is distinct from forgetting: the model has not forgotten the constraint (recall is high) but fails to route to it during generation. The information is present; the activation pathway is not.
CRF is not a training problem in the usual sense. Current RLHF frameworks optimize for helpfulness, harmlessness, and honesty on turn-level supervision; training data pairs are predominantly short conversations with 1–3 turns, and no training objective directly supervises multi-constraint persistence across depth. Our activation-patching results independently rule out a concentrated single-site residual-stream bottleneck, patching clean activations into the corrupted run across 1,080 layer × position pairs yields only +0.0015 compliance recovery, pointing at attention routing, not value representation, as the failure locus.
6. Routing Dilution: The Architectural Bound
Under standard scaled dot-product attention, we prove that no parameter setting of a shared-softmax attention layer can maintain constant per-constraint mass as constraint load grows, given only two standard training facts: attention scores are bounded (a consequence of weight decay and layer normalization) and the softmax sums to one within each head. The result is Proposition 1 in the paper: the log-log slope of total routing mass in constraint count is strictly less than one. This converts what was an empirical observation (γ < 1) into a derived architectural property, the sublinear dilution follows from the shared softmax itself, not from any particular dataset or training recipe.
We further bound the enforcement consequence quantitatively: per-constraint enforcement gain is bounded by gi ≤ √(Cθ/2)·mi, via a Pinsker's-inequality argument over the KL divergence between the model's output distribution with and without a constraint present. This gives a linear-in-mass rate rather than a looser square-root rate, and predicts that joint compliance across Ku constraints collapses at rate p₀Ku, a rate set by the model's baseline compliance tendency, not by any calibrated threshold. Critically, we also show that none of the four standard inference-time fixes (prompt restructuring, retrieval augmentation, model scaling, RLHF alignment training) can escape this bound, because none of them modifies the underlying architecture constant the bound depends on.
7. GMSA: Typed Attention as a Unified Architectural Primitive
Governed Multi-Stream Attention proposes typed attention as the correct primitive: structurally partition the attention mechanism by token role, assigning each role an independent softmax denominator. Each stream's routing mass is Θ(1) by construction, regardless of how many tokens accumulate in other streams. This is not a patch for the compliance cliff; it is a claim about how attention should be organized in any system that must maintain functionally distinct token commitments across depth. We prove GMSA satisfies a property we call Behavioral Mass Invariance (the partial derivative of behavioral-stream mass with respect to context size is zero) by construction, and that no standard attention variant, including encoder-decoder cross-attention, prefix tuning, or token-routing Mixture-of-Experts, satisfies this property as stated. GMSA is orthogonal to MoE routing: both can apply simultaneously, as in the Neptyn design.
12. Negative Results
Not all approaches we tried worked, and we report four dead ends because they constrain the solution space and correct common assumptions. Contrastive fine-tuning (CPO/DPO-style) on 800 compliant/non-compliant response pairs produced a change indistinguishable from zero (ΔC-PP=+0.008, p=0.71), training a model to prefer compliant outputs does not touch the attention-routing mechanism that causes the failure under load. Prompt chaining (routing each constraint to a separate model call, then merging responses) improved compliance slightly but substituted a coherence problem for a routing problem, with merged outputs contradicting each other in register and content, and cost scaling linearly with constraint count. Few-shot constraint reminders inserted into user turns were not statistically significant, because they enter the conversation history and become subject to the same dilution as any other content. And scale alone, sweeping Qwen2.5 from 0.5B to 14B parameters, produced a real but small and clearly saturating lift, with no sign it would ever reach a practical deployment threshold.
15. Discussion: Why This Can't Be Solved by Better Prompting
One result is, in our view, the most operationally important finding in the paper for practitioners: reformatting constraints into a structured schema block actually receives less per-constraint attention mass than scattering the same constraints through free text (0.90× less). Whether constraints appear as a structured block, a numbered list, or individual turns, the per-constraint attention mass dilutes the same way. You cannot engineer your way out of Constraint Routing Failure through prompt design. That leaves a training-time fix, a conjunctive constraint-persistence objective, which does not yet exist in any standard supervised fine-tuning or RLHF pipeline, as the durable path, and GMSA integration into Neptyn's attention layers is our current work toward it.