Giving a large language model 4,000 tokens of context is easy; giving it a million without blowing up memory or collapsing latency is one of the deepest engineering problems in modern inference. The culprit is twofold: attention is quadratic in sequence length, and the KV cache — the very trick that makes generation fast — sees its memory grow linearly with every token. This article explains why long context is hard, what the KV cache actually stores, and the arsenal of ideas (MQA/GQA, FlashAttention, PagedAttention, RoPE/YaRN, attention sinks, ring attention) that make giant context windows possible — along with their very real limits.

Why long context is hard: the O(n²) wall

The heart of a Transformer is self-attention: every token "looks at" every other one. For a sequence of length n, you compute a score matrix of size n × n : each token is compared against each other token. Both the compute and memory cost of that matrix are therefore O(n²).

Doubling the context quadruples the attention work. At 1,000 tokens the matrix has a million entries. At 100,000 tokens it has ten billion. This is the fundamental wall of long context, and no engineering trick erases it entirely.

That wall has two distinct faces:

  • At training (and at prefill) — the initial processing of the prompt — the cost is the raw quadratic computation of attention over the whole sequence at once.
  • During token-by-token generation, it's different: we don't recompute the whole matrix at every step, thanks precisely to the KV cache. But that cache carries its own memory cost, linear and relentless.

Understanding this duality is essential: prefill is limited by compute, while autoregressive decoding is limited by memory and bandwidth. Effective optimizations target one or the other of these phases — rarely both at once.

The KV cache: what it stores and why it swells

When a model generates text autoregressively, it produces one token at a time. Without optimization, each new token would replay attention over the entire already-generated prefix, recomputing the keys (K) and values (V) of all past tokens at every step. This is a colossal waste: the work grows as O(n²) over the whole generation.

The KV cache removes that redundancy with a simple idea:

  • each token's key and value are computed once ;
  • they are kept in memory (the "cache") ;
  • at the next step, only the K/V of the new token is computed, which then queries the existing cache.

This is what makes generation viable in practice. The price is memory: for every token, every layer, and every attention head, you must store a (key, value) pair. The cache size is therefore, in order of magnitude:

KV_memory ≈ 2 (K and V) × n_layers × n_kv_heads × head_dim × length × batch × bytes_per_value

The decisive factor is sequence_length: with a fixed architecture, the cache memory grows linearly with the context. Each extra token adds a constant slice.

Take a concrete order of magnitude. For a model with a few dozen layers, in 16-bit (2 bytes), with modest K/V head dimensions, the KV cache of a single long sequence can reach tens of gigabytes — often more than the active weights needed for one decoding step. Multiply by the batch size and memory explodes. At that point it's the KV cache, not compute, that caps:

  • the batch size (how many requests run in parallel) ;
  • the maximum context length you can actually serve ;
  • and therefore the overall server throughput.

Diagram showing the KV cache growing by one step per generated token; the trend is a straight linear line. Figure: with every token a new (K, V) pair is added per layer and per head; the occupied memory grows linearly with sequence length.

The practical upshot is central: serving long context is most often a memory problem, not a FLOPS problem. Every technique below attacks one of three levers:

  1. shrink what the cache stores (MQA, GQA) ;
  2. cut the memory transfers during computation (FlashAttention) ;
  3. manage the occupied memory space more cleverly (PagedAttention, sliding window, distribution).

Shrinking the cache at the source: MQA and GQA

The most direct lever is reducing the n_kv_heads factor. In classic multi-head attention (MHA), each query head (Q) has its own key/value head (K/V): 32 query heads means 32 K/V pairs to cache.

Multi-Query Attention (MQA, Shazeer, 2019) goes to the extreme:

  • all query heads share a single K/V pair ;
  • the cache shrinks by a factor equal to the number of heads (e.g. ÷32) ;
  • but that aggressive sharing costs quality and can make training unstable, because the model loses the diversity of "viewpoints."

Grouped-Query Attention (GQA, Ainslie et al., Google, 2023) is the compromise that became standard:

  • query heads are grouped into G groups; each group shares one K/V pair ;
  • GQA continuously interpolates between MHA (G = number of heads) and MQA (G = 1) ;
  • the cache is divided by heads / G while keeping most of the quality.

That is why GQA powers most recent large open models. Crucial point: it's an architectural choice, hence fixed at training time — you don't convert an MHA model to GQA after the fact without retraining or adapting it.

Comparison of MHA, GQA and MQA: number of shared key/value pairs and the resulting KV-cache size. Figure: from MHA (one K/V pair per head, cache = baseline) to MQA (a single shared pair, cache ÷4 here), GQA sits in between by sharing per group.

Note: MQA/GQA reduce only the size of the cache, independently of everything else. You can therefore combine them with FlashAttention (compute) and PagedAttention (management) — these techniques are orthogonal and stack in production.

Speeding up compute without truncating it: FlashAttention

MQA/GQA shrink the size of the cache; FlashAttention (Dao et al., 2022) attacks a different bottleneck: the memory transfers during the attention computation itself.

The key insight lies in a GPU's memory hierarchy:

  • on-chip SRAM is tiny but enormously fast in bandwidth ;
  • HBM (the GPU's main memory) is vast but about 10× slower ;
  • on a modern GPU, attention is therefore memory-bound, not compute-bound.

The problem with naive attention: it materializes the full n × n matrix in HBM, then re-reads it to apply the softmax. These round-trips saturate the slow transfers, and the GPU spends its time waiting on memory rather than computing.

FlashAttention is IO-aware: it never materializes the full matrix in HBM. Its recipe rests on three ingredients:

  • Tiling: split Q, K, V into blocks that fit in SRAM.
  • Online softmax: compute the softmax incrementally, rescaling partial results as it goes, without ever seeing a whole row at once.
  • Recomputation: on the backward pass, instead of storing the large matrix, recompute it on the fly — a bit more FLOPS in exchange for far less memory traffic.

The whole thing is fused into a single GPU kernel, which removes intermediate writes. The result: an exact computation (no approximation, unlike sparse attention), memory that is linear in sequence length, and measured speedups on the order of on GPT-2 (sequence length 1K) in the original paper. FlashAttention is now a foundational component of long-context training and inference, and its successors (v2, v3) have further improved parallelism and GPU occupancy.

Managing memory like an OS: PagedAttention and vLLM

Even with a shrunken cache, managing the memory space is a problem in itself. Naive servers reserve a contiguous block per request sized for the maximum possible length. Hence massive waste, of three kinds:

  • internal fragmentation: reserved but unused space (the request stops early) ;
  • external fragmentation: gaps between blocks, unusable because too small ;
  • excessive reservation: you lock down memory for a length you may never reach.

This waste sharply cuts the number of requests served in parallel, hence throughput.

PagedAttention (Kwon et al., SOSP 2023), at the heart of vLLM, imports an idea from operating systems: paging. The mechanism:

  • a request's KV cache is split into fixed-size blocks (pages) ;
  • these blocks are allocated dynamically and non-contiguously in physical memory ;
  • a mapping table links logical blocks to physical blocks, exactly like an OS's virtual memory.

Waste thus drops to near zero, which lets you greatly increase batch size.

Better still: paging unlocks memory sharing, in two forms:

  • Prefix caching: two requests sharing a common prefix (same system prompt, same document) can share the same physical pages instead of recomputing them.
  • Copy-on-write: during parallel sampling (several continuations of one prompt), common pages are shared and only divergences are duplicated.

The paper reports throughput 2 to 4× higher than state-of-the-art servers at equivalent latency, with gains all the more pronounced as sequences get longer, models bigger, and decoding algorithms more complex.

Extending the window: positional encoding and RoPE / PI / YaRN

Reducing memory is not enough: the model must also understand positions it never saw during training. Most large models use RoPE (Rotary Position Embedding):

  • position is encoded by rotating the Q and K vectors by an angle proportional to the position ;
  • because the angle depends on the position, the dot product between two tokens depends only on their relative distance ;
  • you thus get relative position "for free," with no extra trainable parameters.

The problem: trained on 4K tokens, a model extrapolates poorly beyond that. At unseen positions, the high-frequency components have rotated into angle ranges never learned, and quality collapses. Several methods fix this:

  • Position Interpolation (PI) (Chen et al., 2023): instead of extrapolating, compress positions into the known range (position p of a doubled context is treated as p / 2). A short fine-tune then adapts the model.
  • NTK-aware scaling: non-uniform interpolation by frequency, which better preserves the high frequencies (important for fine-grained position).
  • YaRN (Peng et al., 2023): refines the interpolation by frequency bands, with no extra compute or memory cost over earlier methods, and converges fastest during fine-tuning.

YaRN is one of the most widely used recipes to go, for example, from 4K to 128K tokens with a modest retraining budget. The general lesson: extending context is not only a memory matter — it's also a matter of interpreting positions.

Forgetting intelligently: sliding window, StreamingLLM and attention sinks

Another family of approaches accepts not keeping everything. Sliding-window attention caches only the last k tokens: memory becomes constant instead of linear. Alas, as soon as the text exceeds the window and the very first tokens are evicted from the cache, quality collapses abruptly.

StreamingLLM (Xiao et al., ICLR 2024) explains why and fixes it. The surprising observation:

  • models pour a huge share of their attention onto the very first tokens, even semantically meaningless ones ;
  • these tokens act as "attention sinks" where the softmax "dumps" its probability mass ;
  • evicting them destabilizes the whole attention distribution, hence the collapse.

The fix is minimal: keep a few initial tokens (typically 4) as sinks, plus a sliding window of recent tokens. With this trick, the model stays stable over sequences of millions of tokens, without any fine-tuning, with up to 22× speedup over the recompute baseline.

Beware not to confuse the promises, though: this is bounded-memory generation, not a true understanding of the whole past. Information evicted from the window really is lost — StreamingLLM solves streaming stability, not long-range recall.

Spreading the sequence out: Ring Attention

When a single sequence exceeds one GPU's memory, you distribute it. Ring Attention (Liu et al., 2023) proceeds as follows:

  • the sequence is split into blocks spread across several devices arranged in a ring ;
  • each device computes attention over its local block ;
  • then it passes its K/V blocks to the next neighbor in the ring ;
  • communication is fully overlapped by the block computation.

No approximation: attention stays exact. The addressable context grows with the number of devices, enabling sequences exceeding a hundred million tokens — at the cost of multi-GPU infrastructure and careful orchestration of transfers.

The "lost in the middle" trap

A context that is technically available is no guarantee that it's actually used. The "Lost in the Middle" study (Liu et al., 2023) revealed a U-shaped performance curve:

  • models retrieve information placed at the beginning or at the end of the context well ;
  • but their accuracy drops sharply when it is buried in the middle.

The canonical test is the needle in a haystack: a specific fact is hidden at various depths of a long document, and the model is asked to find it. The U-shaped curve shows that the mere presence of information in the window is not enough to guarantee its use.

The lesson: window size is an upper bound, not a usage guarantee. A few practical consequences:

  • put the essentials at the edges of the prompt, not the center ;
  • never assume a model "has read" all its context just because it fits ;
  • evaluate long context beyond the trivial needle, toward multi-document reasoning.

The newest models clearly improve on this front, but where information sits in the prompt remains a concrete quality lever.

Long context or RAG? The right trade-off

So should you cram everything into a huge window, or dynamically retrieve the right passages (RAG)? The trade-off is concrete.

Long context:

  • zero retrieval infrastructure to maintain ;
  • global reasoning over all the material at once, ideal when the relevant information is dense and intertwined ;
  • but every token costs memory and latency (the quadratic prefill) ;
  • "lost in the middle" lurks, and the bill climbs fast.

RAG:

  • you inject only the few relevant passages, keeping the prompt short and cheap ;
  • the knowledge base updates without retraining the model ;
  • but quality depends entirely on the retriever: a missed passage is a missed answer ;
  • cross-document reasoning is harder to orchestrate.

In practice the two are often combined: RAG to target the right documents, long context to reason over them comfortably. The sound rule is to not confuse capability with usage: a 1M-token window is only useful if the model actually exploits it and you're willing to pay its memory and latency.

A practical roadmap

To serve long context efficiently, these levers stack:

  1. Choose a GQA architecture (smaller cache by design).
  2. Use FlashAttention for prefill and training (memory-bound compute eliminated).
  3. Serve with a paged engine (vLLM / PagedAttention) for prefix caching and dense batching.
  4. Extend the window with YaRN/PI if the target context exceeds base training.
  5. For endless streaming, add attention sinks (StreamingLLM).
  6. For a single gigantic sequence, distribute via Ring Attention.
  7. Always measure real recall (needle, multi-document) before promising a capability.

In short

Long context hits two walls: O(n²) attention and a linear KV cache. We attack them on three complementary fronts.

  • At the source, MQA/GQA reduce the number of cached K/V heads.
  • At compute time, FlashAttention removes wasteful memory transfers via tiling and exact recomputation, while RoPE/PI/YaRN extend the positional window.
  • At management time, PagedAttention pages and shares the cache like an OS, StreamingLLM bounds memory via attention sinks, and Ring Attention distributes the sequence across multiple GPUs.

Above all of this hovers "lost in the middle": having a giant context is not the same as knowing how to use it — and it's often the reasoned choice between long context and RAG that makes the difference in production.