Why long context is costly: O(n²) attention and a linear KV cache. MQA/GQA, FlashAttention, PagedAttention, RoPE/YaRN, attention sinks.