Why do SSMs scale better than Transformers?

Attention costs O(n²) in time/memory because every token is compared to every other, and its KV cache grows at inference. An SSM propagates a fixed-size state step by step: linear O(n) training cost and constant-memory inference (no KV cache), hence up to ~5× higher throughput on long sequences.

What does Mamba's "selectivity" add over S4?

S4 is time-invariant (A, B, C, Δ fixed), so it treats every token identically and cannot filter by content. Mamba makes B, C, and the step Δ depend on the input token: the model can memorize a key token or forget a distractor, unlocking content-based reasoning — at the cost of losing the convolutional view, offset by a hardware-aware parallel scan.

Where does Mamba still lag behind Transformers?

On in-context recall and exact copying: a fixed-size compressed state is a bottleneck for verbatim retrieval of arbitrary earlier information, where attention keeps everything in its KV cache. Hence hybrids like Jamba, which slot a few attention layers among the Mamba layers to recover that precise recall.

What does Mamba-2's SSD duality change in practice?

SSD shows that an SSM (scalar-times-identity state matrix) equals a 1-semiseparable masked causal attention: the same transformation admits an O(n) recurrence or an O(n²) attention. In practice the algorithm splits the sequence into chunks (64 to 256 tokens), computes the intra-chunk part as quadratic attention on tensor cores and propagates the state between chunks in O(1) — hence a core 2 to 8× faster than Mamba-1 and much larger states at comparable cost.

State Space Models and Mamba (beyond Transformers)

Transformers have ruled AI since 2017, but their attention mechanism pays a price: its cost grows with the square of the sequence length. Doubling the context quadruples the compute, and the memory cache swells linearly with every generated token. State Space Models (SSMs) — and their most accomplished incarnation, Mamba — offer another path: linear cost in length, constant-memory inference, and, for the first time, performance that rivals attention on language. This article explains where SSMs come from, how S4 unlocked long-range memory, why Mamba makes the state selective, and where these architectures win — or lose — against Transformers.

Why look for an alternative to attention

Self-attention compares every token to every other one. For a sequence of length n, it builds an n × n score matrix: training time and memory grow as O(n²). That is exactly what makes attention so expressive (each position "sees" every other directly), but also what makes it costly on long sequences: full documents, genomes, raw audio, long agent sessions.

At autoregressive inference time the problem takes a different shape: to avoid recomputing the past, you keep a KV cache (keys and values) that grows with every token. Memory usage therefore grows linearly with context, and each new token must wait on reading an ever-larger cache. That is the tax on expressivity.

A concrete order of magnitude: for a 32,768-token window, the attention matrix alone holds more than a billion entries per head per layer. With a few dozen heads and layers, the training memory peak explodes, and it is this quadratic wall — not quality — that bounds the context length in practice.

Comparison: attention builds an n×n matrix (quadratic cost, growing KV cache) while an SSM propagates a fixed-size state from one step to the next (linear cost, constant memory). Figure: quadratic attention O(n²) versus linear SSM recurrence O(n).

Many subquadratic alternatives have been proposed — linear attention, gated convolutions, modernized RNNs, SSMs. Gu & Dao's verdict is blunt: until Mamba, none matched attention on important modalities such as language. Understanding why requires going back to the root: classical state space models.

What we want from a sequence block

Before diving in, let us fix the requirements. A good sequence-mixing block should: (1) train in parallel on GPUs to absorb terabytes of text; (2) infer fast, ideally at bounded memory cost; (3) route information by content, not just by position. Attention ticks (1) and (3) but misses (2); classical RNNs tick (2) but miss (1). SSMs aim for all three at once — that is the whole point of the story that follows.

Classical state space models

The SSM is decades old, born of control theory and signal processing (Kalman filters, control systems). In its continuous form, it relates an input signal u(t) to an output y(t) through a latent state x(t) of dimension N:

x'(t) = A x(t) + B u(t)      (state dynamics)
y(t)  = C x(t) + D u(t)      (output equation)

Four matrices govern the system: A describes how the state evolves on its own (the "memory"), B how the input is injected, C how the output is read from the state, and D a direct input→output connection (often seen as a residual skip). The central intuition: the state x(t) is a compressed summary of the signal's entire past, of fixed size N, regardless of sequence length.

To ground the idea, take A scalar and negative, say A = −1, with no input: the state follows x'(t) = −x(t), so x(t) = x(0)·e^{−t} — an exponential decay. This is a leaky integrator: the state forgets its past at a rate set by A. Stack N such filters with different rates and you get a bank of memories at varied time scales — the embryo of what HiPPO will formalize.

Discretization: from continuous to discrete

Language and most data arrive in discrete steps, not continuous time. So we discretize the system with a step size Δ (delta), typically via a zero-order hold. This turns continuous A and B into discrete versions Ā and B̄:

Ā = exp(Δ A)
B̄ = (Δ A)⁻¹ (exp(Δ A) − I) · Δ B   ≈   Δ B
hₜ = Ā hₜ₋₁ + B̄ xₜ
yₜ = C hₜ + D xₜ

The step Δ acts as a time scale: a large Δ makes the state "forget" (favoring the current input), a small Δ makes it persist (long memory). Remember this parameter: it is the one Mamba will make content-dependent.

Intuition for the formula: Ā = exp(Δ A) is simply the exact solution of the linear ODE over an interval of duration Δ. On our example A = −1, we get Ā = e^{−Δ}: with Δ = 0.1, Ā ≈ 0.90 (the state keeps 90% of its value at each step, long memory); with Δ = 3, Ā ≈ 0.05 (it forgets almost everything, short memory). So this single number sets the model's temporal reach.

Two faces: recurrent and convolutional

The strength of time-invariant linear SSMs (LTI, i.e. A, B, C, Δ fixed) is that they admit two mathematically equivalent representations:

Recurrent view: apply hₜ = Ā hₜ₋₁ + B̄ xₜ step by step. Cost O(n), constant memory (you keep only the current state). Ideal at inference — it is an RNN.
Convolutional view: because the parameters never change over time, unrolling the recurrence yields a convolution by a precomputed global kernel K. GPU-parallelizable, ideal for training — like a CNN.

Let us unroll explicitly to see where the kernel comes from. Starting from h₋₁ = 0:

h₀ = B̄ x₀
h₁ = Ā B̄ x₀ + B̄ x₁
h₂ = Ā² B̄ x₀ + Ā B̄ x₁ + B̄ x₂
yₜ = Σ_{k=0..t}  (C Āᵏ B̄) · xₜ₋ₖ        ← convolution by K
K  = (C B̄, C Ā B̄, C Ā² B̄, …, C Ā^{n−1} B̄)

Each output is thus a convolution of the input by the kernel K, whose coefficients are the powers C Āᵏ B̄. You train in convolution mode (parallel, fast, often via FFT) and infer in recurrent mode (linear, constant memory). The best of both worlds — provided the parameters stay fixed over time. That very condition is what Mamba will break.

How to read the state

For intuition, think of the state hₜ as a compressed register that accumulates the history seen so far.

At each step, two forces compete: Ā hₜ₋₁ keeps the old content alive (the memory), and B̄ xₜ injects the new information.

The matrix A sets the forgetting rate: its eigenvalues near 1 make memory last; far from 1, they erase it fast.

That is why initializing A is no small matter — a bad A dooms the state to forget almost everything, however good the rest of the network is.

S4: structuring the state for long-range memory

Naively, computing an SSM's convolution kernel is numerically unstable and expensive, and a random A forgets the past almost immediately. S4 (Structured State Space, Gu, Goel & Ré, 2021) solved both problems.

First, HiPPO initialization. HiPPO (High-order Polynomial Projection Operators) provides a special A matrix that makes the state memorize an approximation of the entire history of the signal, by projecting it onto a basis of orthogonal polynomials (typically Legendre polynomials). Concretely, the state becomes a set of coefficients that best reconstruct the past function — a compressed but structured memory. The ablations are unequivocal: replacing the HiPPO matrix with a random one collapses performance. Long-range memory is not an accident; it is wired into A.

Second, a structured parameterization (diagonal plus low rank, called DPLR) makes kernel computation stable and efficient. Rather than raising a dense N × N matrix to successive powers (costly and unstable), S4 exploits this structure to compute the kernel via a rational fraction and an FFT, bringing training down to a near-linear cost in length. It is this algebraic trick that unlocked large-scale training.

The result made waves on the Long Range Arena (LRA), a benchmark suite designed to stress long-range dependency. S4 set the state of the art on every task and was the first model to solve Path-X — a task over sequences of 16,384 elements where Transformers simply failed. Later variants such as S5 simplified the architecture (a single multi-input SSM, parallel scan) while keeping the performance, and DSS / S4D showed that a purely diagonal parameterization often suffices, simplifying the implementation further.

This success also clarified a conceptual point: a deep SSM is neither really an RNN nor really a CNN, but an object that can borrow both algorithms as needed. It is this flexibility — recurrence for inference, convolution for training — that made SSMs a credible alternative, where classical RNNs (LSTM, GRU) hit walls of sequential training and vanishing gradients.

The hidden limitation: time invariance

The whole S4 edifice rests on the LTI property: A, B, C, Δ are the same for every token. That is what allows the convolutional view and therefore efficient training. But it is also its Achilles' heel.

An LTI system processes every input identically, regardless of content. It cannot decide "this token is important, I'll memorize it" nor "this one is noise, I'll ignore it." Yet that is exactly what attention does naturally: select. Gu & Dao illustrate this with two toy tasks — Selective Copying (copy while ignoring distractors) and Induction Heads (recover a pattern seen earlier) — that LTI SSMs fail to solve, lacking the ability to filter by content. The problem is not memory capacity; it is the absence of selection.

The diagnosis is subtle: a fixed convolution kernel applies the same filter wherever the relevant information sits. If the token to copy can appear at any position, no static filter can target it reliably. The dynamics would need to react to the content — exactly what an LTI convolution forbids by construction.

Mamba: making the state selective

Mamba's idea (Gu & Dao, 2023) fits in one sentence: let the SSM parameters depend on the input. Concretely, B, C, and above all the step Δ become functions of the current token xₜ, computed by linear projections. The system becomes time-varying: at each position it can modulate its dynamics.

This yields the selection mechanism that was missing. Through Δ(xₜ), the model can, depending on content: let the state persist (memorize a key token) or reset it (forget a distractor). That is content-based reasoning, exactly what attention provides — but here without an n × n matrix.

The analogy is illuminating: a large Δ pushes Ā = exp(Δ A) toward 0 and resets the state on that token ("input gate" wide open, the past is overwritten); a small Δ keeps Ā ≈ 1 and lets the state pass through almost intact (the current input is ignored). By making Δ token-dependent, Mamba recovers the gates of LSTMs (forget/input), but in an SSM framework that stays linear in the state and therefore parallelizable.

Mamba block: projections produce input-dependent B, C and Δ (selectivity); the parallel scan applies the recurrence while keeping parameters in fast SRAM; at the bottom, a comparison of S4 (LTI, non-selective) versus Mamba (selective). Figure: Mamba block — selective SSM and hardware-aware parallel scan.

The price of selectivity — and the hardware workaround

Making parameters time-dependent has an immediate cost: you lose the convolutional view. The global kernel K no longer exists, since the dynamics change at every step. You fall back to the recurrence — sequential by nature, hence a priori hostile to GPUs.

Mamba's engineering contribution is a hardware-aware parallel algorithm (parallel scan). The linear recurrence is associative, which permits a parallel scan (Blelloch-style) in O(log n) steps instead of a sequential loop of n steps. Crucially, the implementation exploits the GPU memory hierarchy: parameters and state are materialized in fast SRAM (not slow HBM), and the intermediate state is never fully written to global memory — it is recomputed during backpropagation (recomputation), like a FlashAttention-style kernel fusion. The Mamba block also drops attention and MLP: it is a homogeneous, stackable architecture.

Why SRAM changes everything: on a GPU, reading/writing HBM costs an order of magnitude more than SRAM. A naive recurrence that materialized the (length × N) state in HBM would be dominated by that memory traffic (memory-bound). By keeping the work in SRAM and recomputing the state only when needed, Mamba turns a memory-bound operation into a compute-bound one — exactly FlashAttention's lesson applied to the scan.

# Selective SSM (simplified pseudocode): B, C, Δ depend on the input.
def selective_ssm(x):            # x: (length, dim)
    delta = softplus(proj_delta(x))      # token-dependent step Δ
    B = proj_B(x)                        # input -> state, depends on x
    C = proj_C(x)                        # state -> output, depends on x
    A_bar = exp(delta[..., None] * A)    # discretization: fixed A, modulated by Δ
    B_bar = delta[..., None] * B

    h = zeros(state_dim)                 # initial state (fixed size)
    ys = []
    for t in range(len(x)):              # in practice: parallel scan, not a loop
        h = A_bar[t] * h + B_bar[t] * x[t]   # hₜ = Ā·hₜ₋₁ + B̄·xₜ
        ys.append(C[t] @ h)                   # yₜ = C·hₜ
    return stack(ys) + D * x             # + residual skip

Note that A stays a fixed learned parameter; it is Δ (and therefore Ā = exp(Δ A)), B, and C that vary with the token. Selectivity lives entirely in this dependence on the input, computed by simple, cheap linear projections.

Mamba's results

The reported figures are striking: an inference throughput 5× higher than Transformers, linear scaling in length, and gains sustained up to sequences of one million elements. On language modeling, Mamba-3B matches or beats Transformers twice its size, both in pretraining and downstream. And the architecture is general-purpose: it also excels at audio and genomics. On the toy tasks that defeated S4 — Selective Copying, Induction Heads — Mamba solves them, confirming that selectivity, not mere memory capacity, was the missing piece.

The complexity table

To anchor the intuition, here is the heart of the trade-off, summarized by dominant order:

Attention: training O(n²·d), training memory O(n²), per-token inference O(n·d) with an O(n) KV cache.
LTI SSM (S4): training O(n·d) (convolution mode via FFT), per-token inference O(d) with a constant O(N) state.
Selective SSM (Mamba): training O(n·d) via parallel scan, per-token inference O(d), constant O(N) state — but without the convolutional view, hence the hardware-aware algorithm.

The line that changes everything is the last column: with a Transformer, inference memory grows with context (the KV cache); with an SSM, it stays flat. Over a 100K-token window, that difference is measured in gigabytes of VRAM and in per-token latency. And since an SSM's per-token latency does not depend on how much has already been generated, long generation keeps a constant throughput, whereas a Transformer slows down as its cache swells.

Mamba-2 and the SSM ↔ attention duality

In 2024, Dao & Gu drew an unexpected theoretical bridge with Structured State Space Duality (SSD). The idea: an SSM with a "scalar-times-identity" state matrix is equivalent to a form of masked causal attention (a so-called 1-semiseparable mask). In other words, the same sequence transformation admits two algorithms: an O(n) recurrence or an O(n²) attention — two sides of one coin.

This duality is not just elegant; it is practical. It lets you reformulate the SSM computation as structured matrix multiplications, thereby exploiting the GPU's highly optimized matrix units (the tensor cores) rather than a bespoke scan. The core of Mamba-2 is thus 2 to 8× faster than Mamba-1's, while staying competitive with Transformers. SSD partly reconciles the two worlds: attention and SSMs are not foreign rivals but special cases of one family.

Concretely, the SSD algorithm splits the sequence into chunks of size Q (typically 64 to 256 tokens). Within a chunk, it computes the transformation as a quadratic attention over Q × Q (fast thanks to tensor cores); between chunks, it propagates the recurrent state from one chunk to the next in O(1). This yields the best of both forms: intra-chunk matrix parallelization and inter-chunk linearity. Mamba-2 also allows much larger state dimensions (on the order of 8× those of Mamba-1) at comparable cost, raising the model's memory capacity.

Hybrids: the best of both worlds

If attention excels at precise recall and SSMs at efficiency, why choose? Hybrid architectures interleave the two. The most notable is Jamba (AI21 Labs, 2024), the first production-grade Mamba model. Its recipe:

alternating blocks mixing Mamba layers and attention layers, in a ratio of roughly one attention layer for every eight layers total (the famous 1
);
Mixture-of-Experts (MoE) applied every other layer, to inflate total capacity while keeping few active parameters per token (hence a controlled inference cost);
a 256K-token context, where MLP/attention alone would become prohibitive.

The original Jamba activates only 12B parameters out of 52B total thanks to MoE, and fits up to 140K tokens on a single 80 GB GPU — a memory profile unthinkable for a dense Transformer of comparable size. Jamba 1.5 then comes in two sizes: Mini (12B active parameters) and Large (94B active), both with a 256K effective context. The few attention layers restore the precise recall that pure Mamba struggles to provide, while the SSM backbone delivers throughput and constant memory over long contexts. The trade-off is now a proportion tuning, not a binary choice.

Other families explore the same path with different dosages. Some replace almost all attention layers with SSM layers, keeping only a handful well placed; others distill a pretrained Transformer into an SSM backbone to recover quadratic knowledge at subquadratic cost. The common principle: use attention sparingly, where exact recall matters, and hand the rest — the bulk of the flow — to a linear mechanism.

Where SSMs shine in practice

Beyond language, constant memory and linear cost open domains poorly served by attention:

Genomics: DNA sequences of hundreds of thousands of bases, where a Transformer's window would be prohibitive.
Raw audio and waveforms: very long, finely sampled signals.
Time series and sensors: continuous streams where a fixed-size recurrent state is natural.
Vision and sequential image models, where Mamba variants scan patches.

The common thread: as soon as length becomes the limiting factor and verbatim recall is not critical, the SSM cost profile becomes decisive.

Strengths and weaknesses versus Transformers

Let us summarize the trade-off honestly.

Strengths of SSMs / Mamba:

Linear scaling in sequence length at training time (vs quadratic).
Fast, constant-memory inference: the state is fixed-size, there is no KV cache that swells — hence the ~5× higher throughput.
Very long contexts within reach (documents, genomes, audio, time series).
A homogeneous, stackable architecture (no attention or MLP required in the base block).

Weaknesses versus Transformers:

In-context recall and copying: a fixed-size compressed state is, by construction, a bottleneck for exact retrieval of arbitrary information seen earlier (verbatim copying, finding a specific fact in a long window). Attention, which keeps everything in its KV cache, remains superior here — hence the appeal of hybrids.
Ecosystem maturity: tooling, tuning, recipes, and hardware are massively optimized for attention; SSMs are catching up but start far behind.
Interpretability: the dynamics of a recurrent state are less legible than attention maps.

A clean way to phrase the recall weakness: attention has a random-access memory (the entire past stays addressable through the KV cache), whereas an SSM has only a compressed, overwriting memory of size N. Past that capacity, retrieving an arbitrary token verbatim becomes a gamble — hence the persistence of a few attention heads in hybrids.

Pitfalls and practical tips

A few guardrails for anyone who wants to experiment with these architectures.

Don't confuse state size with context window. An SSM sees the whole sequence, but what it retains is bounded by the state dimension N; increasing N raises memory capacity, at a cost.
Exact recall remains a weak point. For a use case like "find this precise string across 50 pages," pure Mamba often disappoints; prefer a hybrid or keep a few attention heads.
The step Δ is sensitive. Poorly initialized (via the softplus and its bias), it can saturate or vanish the state; follow published initialization recipes rather than improvising.
Measure on YOUR real length. SSM gains show up mostly on long contexts; on short sequences the edge over a well-optimized Transformer can be marginal.
The ecosystem moves fast. CUDA kernels, variants (Mamba-2, hybrids), and inference support change month to month; check the state of support before building in production.
Pick Mamba-2 over Mamba-1 by default. For a new project, the SSD duality gives a faster kernel and larger states at comparable cost; reserve Mamba-1 for cases where an existing dependency forces it.

Current status and outlook

Selective SSMs went from lab to production in two years. Mamba and Mamba-2 are serious backbones; Jamba and other hybrids show you can combine SSM efficiency with attention precision in a single model, up to hundreds of thousands of context tokens. The SSD duality has, along the way, blurred the theoretical boundary between the two families.

The lesson is not "SSMs replace Transformers" but "the sequence block is no longer a monopoly." Depending on the task — throughput and long context, or exact recall — you dose attention and SSM. For a site that serves the same content to humans and agents, this is good news: future AI readers will swallow longer contexts, faster, at more predictable cost. Attention remains king of recall; SSMs now force it to share the throne.

In summary

State space models reframe sequence modeling around a fixed-size latent state rather than a comparison of all tokens against one another.

S4 showed that with the right structure (HiPPO, diagonal-plus-low-rank parameterization) you can capture dependencies over tens of thousands of steps.

Mamba added selectivity — input-dependent parameters — and a hardware-aware parallel scan that makes the thing practical on GPUs.

Mamba-2 linked SSMs and attention through the SSD duality, and hybrids like Jamba proved you can combine the two at production scale.

The net result: a spectrum of architectures where you pick your point on the efficiency ↔ exact recall axis, instead of defaulting to a Transformer.