Brainsless Research Lab: attention, memory, alignment

Frontier-class AI you can own.

Brainsless is a research lab that makes owning a frontier-class model practical. We take the strongest open models and make them ours: our own training, our own memory system, our own serving stack, specialized to a job and run on hardware we control.

Today that stack serves a trillion-parameter model at over 510 tokens per second from four GPUs. That is above every provider median the public leaderboard publishes for this model, at the leaderboard's own workload, and it is measured, published, and reproducible down to the prompt seeds.

A faster step is not only lower latency. It is more tokens from the same GPUs, so one cluster carries more users before it saturates. That is the lever that turns owning a frontier model from a hyperscaler's budget line into something a focused team can run, and it is the difference between renting intelligence and holding it.

The lab exists to break the walls that make ownership impractical: serving speed, serving cost, and specialization. Planless, our AI co-founder, is where the proof runs. It is trained on the work of building a company, it remembers through Cortex, our memory system, and it answers from our own model on our own stack.

Four technical reports back the research. One lab, one thesis: a specialized model you own beats a general one you rent.

The research

BRL-2026-11 · NEWJuly 2026

Over 510 tokens per second on Kimi-K2.6, from four GPUs.

511.6 tokens per second, single stream, lossless — a record set at 505.9 (the first GPU reading above 500 on this model) and raised by blind re-runs of the unmodified public release. Replication moved this record up, not down. Above every provider median the leaderboard publishes for this model (fourteen providers listed) (best at the record's pin: 438.1, from undisclosed hardware; 451.0 at the raised cell's pin, with 511.6 standing 13.4% above). Per GPU about three times NVIDIA's disclosed trillion-parameter configuration. The report carries the decomposition that found it: the forward pass costs 6.6 milliseconds, verifying a guessed token costs 0.4, and most of the rest of the step was engine overhead we removed. Tool-calling traffic sets the headline; the math cells read 419.2–422.1, and every session and node draw is printed. Verification is a purchasable fact: one command, about $15, re-runs the record from the public release. Every part in the configuration is stock and public: the number is a floor, and our trained drafter is already 5.8% ahead of the stock head in a same-session pair.

Read the record → The paper PDF Run it yourself

GPU serving of Kimi-K2.6, single stream, 10k-token workload — pinned 2026-07-05, n=16

Brainsless (ours)4×B200, count stated

511.6+13.4% over the leader at its pin

Crusoehardware undisclosed

438.1

Fireworkshardware undisclosed

381.2

NVIDIA record, 671B8×B200, lossless — reference

340

CoreWeaveGB300 NVL72 rack

261.8

Nebiushardware undisclosed

222.4

Together (FP4)hardware undisclosed

218.2

GMI (FP8)hardware undisclosed

40.2

External bars: Artificial Analysis, with NVIDIA's published 671B single-user record added for reference. No other entry runs a configuration as small as four GPUs.

The attention series — how little of a million-token context models actually read

BRL-2026-06June 2026

Attention Has A Type

The law. To predict its exact next word, a model only ever needs a small, fixed number of memories, whether the conversation is eight thousand tokens or a million.

Read → PDF

BRL-2026-07June 2026

Attention Pays Its Bill

The cost paper. A long conversation now costs the same per word as a short one, and a million-token cache runs from about $1,500 of ordinary RAM instead of $180,000 of GPUs.

Read → PDF

BRL-2026-08June 2026

Attention Finds Its Keys

The frontier. Past half a million tokens, models stop finding the right memory, and the rank law explains why: the collapse is arithmetic, not bad luck.

Read → PDF

Mohammad Alsufi & Connor Boone, with the Brainsless Research Lab AI Systems Research Group. Code public, noncommercial license. Re-run the record yourself →

GPU serving of Kimi-K2.6, single stream.

Our measurement against the field, pinned 2026-07-05. External numbers are Artificial Analysis live-endpoint medians at the 10k-token workload; ours is a server-side benchmark at the same workload, n=16 per cell. No other entry runs a configuration as small as four GPUs.

Deployment	Hardware	GPU count stated	tok/s, single stream	Measurement
Brainsless (ours)	4×B200, one node	Yes — 4	511.6	Server-side benchmark, n=16, tool cell; record set 505.9 (2026-07-05), raised by blind re-runs of the public release; math 419.2–422.1
Crusoe	Undisclosed	No	438.1	AA live-endpoint median (peak 449)
Fireworks	Undisclosed	No	381.2	AA live-endpoint median
CoreWeave	GB300 NVL72 rack (their blog)	Rack-scale — 72	261.8	AA live-endpoint median
Nebius	Undisclosed	No	222.4	AA live-endpoint median
Together (FP4)	Undisclosed	No	218.2	AA live-endpoint median
GMI (FP8)	Undisclosed	No	40.2	AA live-endpoint median

Two reference points outside the table's scope. Cerebras serves this model at 981 tok/s from wafer-scale hardware on a private endpoint; it is excluded from a GPU comparison by construction. The best documented lossless single-stream number in this model class is NVIDIA's 340 tok/s on DeepSeek-R1 671B from eight B200s under TRT-LLM; their 368 variant relaxes acceptance and pays 2.8 points of MMLU-Pro, so it is a different claim. Ours is lossless: the full model verifies every token, and the output distribution is the model's own. On 2026-07-06 we replayed the same workload against one provider's production API from a standard paying account: it read 118.4–168.9 tok/s across its endpoint products against a 381.2 board median (fireworks_live_replication.json); the table above uses the board's numbers as published. The record itself replicated: three blind re-runs from the public release landed every n=16 cell inside the stated node envelope, raising the record to 511.6 (depth 6), with first-eight medians to 538 and single requests to 568 (brl11_repl_r1.json, brl11_repl_cleanroom.json, brl11_repl_fovea.json).

Protocol, ours: 16 distinct novel ~10k-token docpack prompts per domain, 2048-token outputs, temperature 0.6, per-request seeds, no prefix-cache reuse. Per-request streaming decode rate (ctok−1)/(t_last−t_first), interpolated median of n=16, Kimi-native tokens (o200k parity 1.0035–1.0081). Engine: SGLang v0.5.14, EAGLE3 speculative decoding (public 3B MLA draft head, draft depth 7, chain top-k 1), fp8 KV cache. Node draws move throughput ±5–10% and every session and draw is disclosed. Acceptance is cross-validated on two engines (τ 4.825 SGLang, 4.815 vLLM, equal depth, same head). External medians fetched 2026-07-05 from artificialanalysis.ai. Artifacts: brl11_stage600a.json, brl11_stage600e1.json, brl11_stage600b.json, brl11_stage600sg.json, brl11_stage600r.json (the record session).

Frontier-class AI you can own.

Own the stack

Measure, then claim

Prove it in production

The research

Over 510 tokens per second on Kimi-K2.6, from four GPUs.

Attention Has A Type

Attention Pays Its Bill

Attention Finds Its Keys

GPU serving of Kimi-K2.6, single stream.