Giving a large language model 4,000 tokens of context is easy; giving it a million without blowing up memory or collapsing latency is one of the deepest engineering problems in modern inference. The culprit is twofold: attention is **quadratic** in sequence length, and the **KV cache** — the very trick that makes generation fast — sees its memory **grow linearly** with every token. This article explains why long context is hard, what the KV cache actually stores, and the arsenal of ideas (MQA/GQA, FlashAttention, PagedAttention, RoPE/YaRN, attention sinks, ring attention) that make giant context windows possible — along with their very real limits.

## Why long context is hard: the O(n²) wall

The heart of a Transformer is self-attention: every token "looks at" every other one.
For a sequence of length `n`, you compute a score matrix of size `n × n` :
each token is compared against each other token.
Both the compute **and** memory cost of that matrix are therefore **O(n²)**.

Doubling the context quadruples the attention work.
At 1,000 tokens the matrix has a million entries.
At 100,000 tokens it has ten **billion**.
This is the fundamental wall of long context, and no engineering trick erases it entirely.

That wall has two distinct faces:

- **At training (and at *prefill*)** — the initial processing of the prompt — the cost is the raw quadratic computation of attention over the whole sequence at once.
- **During token-by-token generation**, it's different: we don't recompute the whole matrix at every step, thanks precisely to the KV cache. But that cache carries its own memory cost, linear and relentless.

Understanding this duality is essential: *prefill* is limited by **compute**,
while autoregressive decoding is limited by **memory** and bandwidth.
Effective optimizations target one **or** the other of these phases — rarely both at once.

## The KV cache: what it stores and why it swells

When a model generates text autoregressively, it produces one token at a time.
Without optimization, each new token would replay attention over the **entire** already-generated
prefix, recomputing the keys (K) and values (V) of all past tokens at every step.
This is a colossal waste: the work grows as O(n²) over the whole generation.

The **KV cache** removes that redundancy with a simple idea:

- each token's key and value are computed **once** ;
- they are **kept in memory** (the "cache") ;
- at the next step, only the K/V of the **new** token is computed, which then queries the existing cache.

This is what makes generation viable in practice.
The price is memory: for **every token**, **every layer**, and **every attention head**,
you must store a (key, value) pair. The cache size is therefore, in order of magnitude:

```text
KV_memory ≈ 2 (K and V) × n_layers × n_kv_heads × head_dim × length × batch × bytes_per_value
```

The decisive factor is **sequence_length**: with a fixed architecture, the cache memory
grows **linearly** with the context. Each extra token adds a constant slice.

Take a concrete order of magnitude. For a model with a few dozen layers, in 16-bit (2 bytes),
with modest K/V head dimensions, the KV cache of **a single** long sequence can reach
**tens of gigabytes** — often more than the active weights needed for one decoding step.
Multiply by the batch size and memory explodes. At that point it's the KV cache,
not compute, that caps:

- the **batch size** (how many requests run in parallel) ;
- the maximum **context length** you can actually serve ;
- and therefore the overall server **throughput**.

![Diagram showing the KV cache growing by one step per generated token; the trend is a straight linear line.](/articles/contexte-long-et-gestion-du-kv-cache/kv-cache-growth.svg)
*Figure: with every token a new (K, V) pair is added per layer and per head; the occupied memory grows linearly with sequence length.*

The practical upshot is central: serving long context is most often a **memory** problem,
not a FLOPS problem. Every technique below attacks one of three levers:

1. **shrink what the cache stores** (MQA, GQA) ;
2. **cut the memory transfers** during computation (FlashAttention) ;
3. **manage the occupied memory space** more cleverly (PagedAttention, sliding window, distribution).

## Shrinking the cache at the source: MQA and GQA

The most direct lever is reducing the **n_kv_heads** factor.
In classic multi-head attention (**MHA**), each query head (Q) has its own key/value head (K/V):
32 query heads means 32 K/V pairs to cache.

**Multi-Query Attention** (MQA, Shazeer, 2019) goes to the extreme:

- **all** query heads share **a single** K/V pair ;
- the cache shrinks by a factor equal to the number of heads (e.g. ÷32) ;
- but that aggressive sharing costs quality and can make training unstable, because the model loses the diversity of "viewpoints."

**Grouped-Query Attention** (GQA, Ainslie et al., Google, 2023) is the compromise that became
standard:

- query heads are grouped into `G` groups; each group shares **one** K/V pair ;
- GQA continuously interpolates between MHA (`G` = number of heads) and MQA (`G` = 1) ;
- the cache is divided by `heads / G` while keeping most of the quality.

That is why GQA powers most recent large open models.
Crucial point: it's an **architectural choice**, hence fixed at training time —
you don't convert an MHA model to GQA after the fact without retraining or adapting it.

![Comparison of MHA, GQA and MQA: number of shared key/value pairs and the resulting KV-cache size.](/articles/contexte-long-et-gestion-du-kv-cache/mha-mqa-gqa.svg)
*Figure: from MHA (one K/V pair per head, cache = baseline) to MQA (a single shared pair, cache ÷4 here), GQA sits in between by sharing per group.*

Note: MQA/GQA reduce only the **size** of the cache, independently of everything else.
You can therefore combine them with FlashAttention (compute) and PagedAttention (management) —
these techniques are **orthogonal** and stack in production.

## Speeding up compute without truncating it: FlashAttention

MQA/GQA shrink the **size** of the cache; **FlashAttention** (Dao et al., 2022) attacks
a different bottleneck: the **memory transfers** during the attention computation itself.

The key insight lies in a GPU's memory hierarchy:

- on-chip **SRAM** is tiny but enormously fast in bandwidth ;
- **HBM** (the GPU's main memory) is vast but about 10× slower ;
- on a modern GPU, attention is therefore **memory-bound**, not compute-bound.

The problem with naive attention: it **materializes** the full `n × n` matrix in HBM,
then re-reads it to apply the softmax. These round-trips saturate the slow transfers,
and the GPU spends its time waiting on memory rather than computing.

FlashAttention is **IO-aware**: it **never** materializes the full matrix in HBM.
Its recipe rests on three ingredients:

- **Tiling**: split Q, K, V into blocks that fit in SRAM.
- **Online softmax**: compute the softmax **incrementally**, rescaling partial results as it goes, without ever seeing a whole row at once.
- **Recomputation**: on the backward pass, instead of storing the large matrix, recompute it on the fly — a bit more FLOPS in exchange for far less memory traffic.

The whole thing is **fused** into a single GPU kernel, which removes intermediate writes.
The result: an **exact** computation (no approximation, unlike sparse attention),
memory that is **linear** in sequence length, and measured speedups on the order of
**3×** on GPT-2 (sequence length 1K) in the original paper. FlashAttention is now
a foundational component of long-context training and inference,
and its successors (v2, v3) have further improved parallelism and GPU occupancy.

## Managing memory like an OS: PagedAttention and vLLM

Even with a shrunken cache, **managing** the memory space is a problem in itself.
Naive servers reserve a **contiguous block** per request sized for the maximum possible length.
Hence massive waste, of three kinds:

- **internal fragmentation**: reserved but unused space (the request stops early) ;
- **external fragmentation**: gaps between blocks, unusable because too small ;
- excessive **reservation**: you lock down memory for a length you may never reach.

This waste sharply cuts the number of requests served in parallel, hence throughput.

**PagedAttention** (Kwon et al., SOSP 2023), at the heart of **vLLM**, imports an idea from
operating systems: **paging**. The mechanism:

- a request's KV cache is split into fixed-size **blocks (pages)** ;
- these blocks are allocated **dynamically** and **non-contiguously** in physical memory ;
- a **mapping table** links logical blocks to physical blocks, exactly like an OS's virtual memory.

Waste thus drops to **near zero**, which lets you greatly increase batch size.

Better still: paging unlocks memory **sharing**, in two forms:

- **Prefix caching**: two requests sharing a common prefix (same *system prompt*, same document) can **share the same physical pages** instead of recomputing them.
- **Copy-on-write**: during parallel sampling (several continuations of one prompt), common pages are shared and only divergences are duplicated.

The paper reports throughput **2 to 4× higher** than state-of-the-art servers at equivalent
latency, with gains all the more pronounced as sequences get longer, models bigger,
and decoding algorithms more complex.

## Extending the window: positional encoding and RoPE / PI / YaRN

Reducing memory is not enough: the model must also **understand** positions it never saw
during training. Most large models use **RoPE** (Rotary Position Embedding):

- position is encoded by **rotating** the Q and K vectors by an angle proportional to the position ;
- because the angle depends on the position, the dot product between two tokens depends only on their **relative distance** ;
- you thus get relative position "for free," with no extra trainable parameters.

The problem: trained on 4K tokens, a model **extrapolates poorly** beyond that.
At unseen positions, the high-frequency components have rotated into angle ranges never learned,
and quality collapses. Several methods fix this:

- **Position Interpolation (PI)** (Chen et al., 2023): instead of extrapolating, **compress** positions into the known range (position `p` of a doubled context is treated as `p / 2`). A short *fine-tune* then adapts the model.
- **NTK-aware scaling**: **non-uniform** interpolation by frequency, which better preserves the high frequencies (important for fine-grained position).
- **YaRN** (Peng et al., 2023): refines the interpolation by frequency bands, with **no extra** compute or memory cost over earlier methods, and **converges fastest** during fine-tuning.

YaRN is one of the most widely used recipes to go, for example, from 4K to 128K tokens
with a modest retraining budget. The general lesson: extending context is not only a memory
matter — it's also a matter of **interpreting positions**.

## Forgetting intelligently: sliding window, StreamingLLM and attention sinks

Another family of approaches accepts **not keeping everything**.
**Sliding-window attention** caches only the last `k` tokens: memory becomes **constant**
instead of linear. Alas, as soon as the text exceeds the window and the very first tokens
are evicted from the cache, quality collapses abruptly.

**StreamingLLM** (Xiao et al., ICLR 2024) explains why and fixes it.
The surprising observation:

- models pour a huge share of their attention onto the **very first tokens**, even semantically meaningless ones ;
- these tokens act as **"attention sinks"** where the softmax "dumps" its probability mass ;
- evicting them destabilizes the whole attention distribution, hence the collapse.

The fix is minimal: keep **a few initial tokens** (typically **4**) as sinks,
**plus** a sliding window of recent tokens. With this trick, the model stays stable
over sequences of **millions of tokens**, without any *fine-tuning*, with up to **22×**
speedup over the recompute baseline.

Beware not to confuse the promises, though: this is **bounded-memory** generation,
not a true understanding of the whole past. Information evicted from the window really is
**lost** — StreamingLLM solves *streaming* **stability**, not long-range recall.

## Spreading the sequence out: Ring Attention

When a single sequence exceeds one GPU's memory, you **distribute** it.
**Ring Attention** (Liu et al., 2023) proceeds as follows:

- the sequence is split into blocks spread across several devices arranged in a **ring** ;
- each device computes attention over its local block ;
- then it **passes** its K/V blocks to the next neighbor in the ring ;
- communication is **fully overlapped** by the block computation.

No approximation: attention stays exact. The addressable context grows with the
**number of devices**, enabling sequences exceeding a hundred million tokens —
at the cost of multi-GPU infrastructure and careful orchestration of transfers.

## The "lost in the middle" trap

A context that is **technically** available is no guarantee that it's **actually** used.
The "Lost in the Middle" study (Liu et al., 2023) revealed a **U-shaped** performance curve:

- models retrieve information placed **at the beginning** or **at the end** of the context well ;
- but their accuracy **drops sharply** when it is buried **in the middle**.

The canonical test is the **needle in a haystack**: a specific fact is hidden at various depths
of a long document, and the model is asked to find it. The U-shaped curve shows that the mere
presence of information in the window is not enough to guarantee its use.

The lesson: window size is an **upper** bound, not a usage guarantee.
A few practical consequences:

- put the **essentials at the edges** of the prompt, not the center ;
- never assume a model "has read" all its context just because it fits ;
- **evaluate** long context beyond the trivial *needle*, toward multi-document reasoning.

The newest models clearly improve on this front, but where information sits in the prompt
remains a concrete quality lever.

## Long context or RAG? The right trade-off

So should you cram everything into a huge window, or dynamically **retrieve** the right
passages (RAG)? The trade-off is concrete.

**Long context:**

- zero retrieval infrastructure to maintain ;
- global reasoning over all the material at once, ideal when the relevant information is dense and intertwined ;
- but every token costs memory **and** latency (the quadratic *prefill*) ;
- "lost in the middle" lurks, and the bill climbs fast.

**RAG:**

- you inject only the few relevant passages, keeping the prompt short and cheap ;
- the knowledge base updates **without retraining** the model ;
- but quality depends entirely on the **retriever**: a missed passage is a missed answer ;
- cross-document reasoning is harder to orchestrate.

In practice the two are often **combined**: RAG to target the right documents,
long context to reason over them comfortably. The sound rule is to
**not confuse capability with usage**: a 1M-token window is only useful if the model
actually exploits it and you're willing to pay its memory and latency.

## A practical roadmap

To serve long context efficiently, these levers stack:

1. Choose a **GQA architecture** (smaller cache by design).
2. Use **FlashAttention** for *prefill* and training (memory-bound compute eliminated).
3. Serve with a **paged** engine (vLLM / PagedAttention) for *prefix caching* and dense batching.
4. Extend the window with **YaRN/PI** if the target context exceeds base training.
5. For endless *streaming*, add **attention sinks** (StreamingLLM).
6. For a single gigantic sequence, **distribute** via Ring Attention.
7. Always **measure real recall** (needle, multi-document) before promising a capability.

## In short

Long context hits two walls: **O(n²)** attention and a **linear** KV cache.
We attack them on three complementary fronts.

- **At the source**, MQA/GQA reduce the number of cached K/V heads.
- **At compute time**, FlashAttention removes wasteful memory transfers via tiling and exact recomputation, while RoPE/PI/YaRN extend the positional window.
- **At management time**, PagedAttention pages and shares the cache like an OS, StreamingLLM bounds memory via attention sinks, and Ring Attention distributes the sequence across multiple GPUs.

Above all of this hovers "lost in the middle": having a giant context is not the same as
knowing how to use it — and it's often the reasoned choice between long context and RAG
that makes the difference in production.