A Mixture-of-Experts (MoE) model answers a simple question: can we multiply a network's parameter count without multiplying its per-token compute by the same factor? The answer is yes, and it rests on **sparsity**: instead of one large dense network activated on every pass, we hold several sub-networks — the **experts** — of which a **router** activates only a handful for each token. It is this decoupling of *total* parameters from *active* parameters that broke the trillion-parameter barrier. This article explains the mechanism, its pitfalls (load balancing, expert collapse, token dropping), and how it scales.

## The starting point: the dense FFN layer

In a standard Transformer, each block alternates an attention layer and a **feed-forward network** (FFN) — typically two linear projections separated by a non-linearity. This FFN often holds two-thirds of a block's parameters, and crucially: **all of its parameters are activated for every token**. Doubling the FFN size doubles the compute (FLOPs) for every token processed. A model's capacity and its inference cost are therefore rigidly coupled.

The MoE idea is to **break that coupling**. We replace the single dense FFN with a set of `N` independent FFNs (the experts) plus a small routing network. For a given token, only `k` of the `N` experts are evaluated (with `k` very small, often 1 or 2). The model can then hold hundreds of billions of parameters while activating only a fraction per token.

Formally, a dense FFN layer computes `y = W_2 · σ(W_1 · x)` for **every** token `x`. An MoE layer computes `y = Σ_{i ∈ top-k} g_i · FFN_i(x)`, where `g_i` is expert `i`'s gating weight and the sum runs only over the `k` selected experts. The other `N − k` experts are never evaluated for that token: their compute is exactly zero, not merely small.

![A comparison between a dense FFN layer and a sparse MoE layer: the dense layer activates all its parameters, the MoE layer only activates the experts chosen by the router.](/articles/mixture-of-experts-moe-a-grande-echelle/dense-vs-sparse-ffn.svg)
*Figure: dense vs sparse. On the left, a single FFN processes each token with all its parameters; on the right, the router sends each token to the top-k experts, most of which stay inactive.*

## Total vs active parameters

This is the fundamental MoE metric. A model is described by two numbers:

- **total parameters**: everything that must be stored in memory (all experts);
- **active parameters**: what actually participates in computing a token (attention + selected experts).

Mixtral 8x7B illustrates the gap perfectly: it has **8 experts per layer**, routes each token to **2** of them, totals **~47 billion** parameters but activates only **~13 billion** per token. You get the quality of a large model for the compute of a much smaller one — at the cost of a heavy memory footprint (all 47 G must be loaded).

> Mnemonic: *total* parameters govern **memory** (VRAM), *active* parameters govern **compute** (latency, FLOPs). An MoE optimizes the latter without constraining the former.

### A back-of-the-envelope calculation

Take a block whose dense FFN would have 0.5 G parameters. With 8 experts you store 8 × 0.5 = 4 G of expert parameters per layer, but in top-2 you compute only 2 × 0.5 = 1 G per token. Over 32 layers that is 128 G stored versus 32 G computed on the FFN side. To this add the shared layers (attention, embeddings), which count in both totals. The resulting **active/total ratio** — often < 10% — is exactly what MoE seeks to minimize without losing quality.

## A bit of history

The idea is not new: "mixture of experts" dates back to the 1990s (Jacobs, Jordan, Hinton), where several specialized networks were trained together with a *gating* to weight them. Its modern resurrection rests on two key works:

- the **Sparsely-Gated MoE** (Shazeer et al., 2017), which introduced top-k routing for layers with thousands of experts in an LSTM;
- the **Switch Transformer** (Fedus, Zoph, Shazeer, 2021), which simplified to top-1, integrated it into the Transformer, and demonstrated trillion-scale.

Since then the lineage has grown dense: GShard, GLaM, ST-MoE, Mixtral, DeepSeekMoE/V3, and adoption in mainstream production models. The historical thread is constant: at every step the goal is to make the sparse layer both more **stable** to train and cheaper to **communicate** across accelerators.

## The router (gating network)

The heart of an MoE is the **router**, also called the *gating network*. It is a simple linear layer: it projects the token's hidden vector onto `N` logits (one per expert), applies a softmax, and keeps the `k` highest.

```py
import torch
import torch.nn.functional as F

def top_k_router(x, W_gate, k=2):
    # x: (tokens, d_model) ; W_gate: (d_model, n_experts)
    logits = x @ W_gate                      # raw score per expert
    probs = F.softmax(logits, dim=-1)        # distribution over experts
    topk_w, topk_idx = probs.topk(k, dim=-1) # k best experts + weights
    topk_w = topk_w / topk_w.sum(-1, keepdim=True)  # renormalize
    return topk_idx, topk_w  # where to send the token, with what weight
```

The layer's output is the **weighted sum** of the selected experts' outputs, the weights coming from the router's softmax. This router is learned jointly with the rest of the model — nothing imposes a thematic specialization *a priori*; specialization, if it emerges, does so through optimization.

An important detail: the softmax can be taken **before** or **after** the top-k. GShard and Mixtral normalize the weights of the `k` retained experts (as above). DeepSeek-V3 instead replaces the softmax with a per-expert **sigmoid**, then normalizes the scores of the selected experts — which decouples one expert's score from the others' and makes it easy to add a balancing bias (see below).

![Top-k routing: the router computes a softmax score per expert, keeps the two best, applies a capacity factor, drops the overflow, and uses an auxiliary loss to balance the load.](/articles/mixture-of-experts-moe-a-grande-echelle/router-top-k-routing.svg)
*Figure: top-2 routing. Scores rank the experts; the two best are kept, the capacity factor bounds the tokens per expert, and the auxiliary loss fights collapse.*

## Top-1, top-2: how many experts per token?

The number `k` of activated experts is a major lever.

- **Top-1** (Switch Transformer): a single expert per token. It is the sparsest choice, minimizing compute and communication. The Switch Transformer authors showed that beyond simplification, top-1 yields **better performance** at a fixed compute budget, and pushed the approach to **1.6 trillion parameters** spread over **2,048 experts** (the Switch-C model).
- **Top-2** (GLaM, Mixtral): two experts per token. The extra cost is moderate but the gradient is richer (the router gets more signal), which often stabilizes training. GLaM, with **1.2 trillion** total parameters over 64 experts per layer, activates only about **97 billion** parameters (8%) per prediction, and matched GPT-3 with **one-third the training energy**.

The larger `k`, the closer to dense behavior (and cost); the smaller `k`, the sparser (and harder to route). The classic trade-off is top-1 or top-2. Fine-grained-expert architectures (DeepSeek-V3) push `k` higher (8) but with much smaller experts, which combines combinatorial richness with computational sparsity.

## Routing variants

"Token chooses the expert" (top-k on the token side) is not the only option. Several variants exist:

- **Expert choice**: flip the decision — each expert picks its top `C` tokens. Balancing is then **guaranteed by construction** (each expert gets exactly `C` tokens), at the cost that a token may be taken by zero or several experts.
- **Block/hash routing**: tokens are assigned by a fixed (hash) function rather than a learned one — surprisingly competitive and trivially balanced.
- **Soft MoE**: combine *all* experts via continuous weights over token "slots," avoiding the hard decision (and dropping) at the cost of pure sparsity.

The choice depends on the goal: balance guarantee, quality, or maximum sparsity. In causal (decoder) pre-training, *expert choice* is tricky because an expert cannot "look at" the whole sequence without leaking information from the future; token-side routing therefore remains dominant for generative LLMs.

## The central problem: load balancing

Left unchecked, the router enters a vicious cycle. A few experts, slightly better at the start, receive more tokens, hence more gradient, hence improve more, hence attract even more tokens. This is **expert collapse**: a handful of experts does all the work while the rest are never trained. The model's "free" capacity is wasted.

The historical fix is an **auxiliary load-balancing loss**, added to the main loss. Its form in the Switch Transformer is:

```text
L_aux = α · N · Σ_i ( f_i · P_i )
```

where `N` is the number of experts, `f_i` the **fraction of tokens** actually routed to expert `i` in the batch, and `P_i` the **average probability** the router assigns to expert `i`. This loss is minimal when the load is uniform. The coefficient `α` (typically ~0.01) tunes the pressure: too low and balancing fails to kick in; too high and it degrades model quality by forcing artificial routing. A **router z-loss** (ST-MoE) is often added to penalize overly large logits and numerically stabilize the softmax.

Intuitively, `f_i` (a count, non-differentiable) acts as a weight, and `P_i` (differentiable) carries the gradient: minimizing the product pushes the router to *lower* the probability of already-overloaded experts. It is a soft signal, imposing nothing token by token but correcting the global tendency.

Concretely, MoE training instability comes from several sources:

- **router logits** that blow up and saturate the softmax (hence the z-loss);
- **discontinuous gradients**: a tiny change in score can flip a token from one expert to another, making the loss surface rough;
- sensitive **numerical precision**: the router is often computed in `float32` even when the rest runs in `bfloat16` (the Switch Transformer's *selective precision*).

These fixes (z-loss, computing the router in high precision, routing noise during training) are what separate an MoE that converges from one that diverges.

## The recent alternative: balancing without an auxiliary loss

DeepSeek-V3 popularized an **auxiliary-loss-free** approach. Instead of a loss that interferes with the main gradient, it adds to each expert a **dynamic bias** `b_i` applied **only** to the top-k selection, **not** to the final weighting `g_i`. At each step, the load is observed: an overloaded expert sees its bias decrease (becoming less attractive), an under-used expert sees it increase, by a step `γ`. Balance self-regulates without polluting the learning objective — avoiding the quality degradation that an overly strong auxiliary loss used to cause.

```text
# bias-adjustment pseudo-code (per step)
load_i  = number of tokens routed to expert i
average = total_tokens · k / N
for each expert i:
    if load_i > average:  b_i = b_i − γ   # make i less attractive
    if load_i < average:  b_i = b_i + γ   # make i more attractive
# selection uses (s_i + b_i) ; final weighting uses s_i alone
```

This trick cleanly separates the two roles: `s_i + b_i` decides **who** is activated (hence balancing), while `s_i` alone decides **the weight** of the contribution, preserving the unbiased learning signal. DeepSeek-V3 nonetheless keeps a very-low-coefficient *sequence-wise* balancing loss as a safeguard.

## Capacity factor and token dropping

In practice (and especially under hardware parallelism), each expert has a bounded **capacity**: a maximum number of tokens it can process per batch. It is computed as:

```text
capacity = (tokens_per_batch / number_of_experts) × capacity_factor
```

The **capacity factor** (typically 1.0 to 1.25) gives headroom above a perfectly uniform split. If an expert receives more tokens than its capacity, the overflow is **dropped** (*token dropping*): those tokens skip the expert layer (passing through the residual connection untransformed). Too low a factor increases drops (information loss); too high, it inflates the reserved compute and memory. The Switch Transformer shows that good balancing lets you run at a low factor (1.0–1.25) with a drop rate often below 1%, with no measurable impact on quality.

Why a **fixed capacity** rather than a dynamic one? Because hardware wants **statically shaped** tensors: you pre-allocate a buffer of size `capacity × d_model` per expert, which makes the all-to-all and the compute predictable. The downside is that imbalance is paid either in drops (buffer too small) or in waste (buffer too large, filled with zeros). That is the whole stake of the capacity × balancing pairing.

## Expert parallelism and the memory cost

At trillion scale, the experts do not fit on a single accelerator. We use **expert parallelism**: experts are spread across multiple GPUs/TPUs, and each token is **dispatched** (all-to-all) to the GPU(s) hosting its experts, then its output is **sent back**. Non-MoE layers (attention, embeddings) are replicated as in data parallelism. Expert parallelism is generally combined with data, tensor, and pipeline parallelism.

The MoE Achilles' heel lives here: **all-to-all** communication is expensive and sensitive to imbalance (an overloaded expert becomes a bottleneck). This is precisely why capacity and balancing are so critical — they bound and smooth this network traffic. A particularly costly case is dispatching tokens **across nodes** (inter-node): inter-node bandwidth is far lower than intra-node, hence **node-limited routing** strategies that cap the number of distinct nodes a token can reach (DeepSeek-V3 bounds it to 4 nodes), drastically cutting inter-node traffic.

The different forms of parallelism combine and stack:

- **Data**: each replica sees a different slice of the batch; non-MoE layers replicated.
- **Tensor**: a single layer is split across several accelerators.
- **Pipeline**: successive layers are distributed across stages.
- **Expert**: the experts of an MoE layer live on distinct accelerators, linked by all-to-all.

The placement choice (how many experts per accelerator, what degree of each parallelism) directly governs communication volume and hardware utilization. Poorly calibrated, the network becomes the limiting factor well before compute does. High-performance implementations also **overlap** expert compute with the all-to-all communication (double buffering, micro-batches) to hide network latency behind useful compute.

## Fine-grained and shared experts (DeepSeekMoE)

DeepSeek-V3 refines the architecture further with two ideas:

- **Fine-grained experts**: instead of `N` large experts, create `m·N` smaller ones (hidden dimension divided by `m`) and activate `m` times as many. At constant compute, this multiplies the possible **combinations** of experts, favoring a more precise specialization of knowledge.
- **Shared experts**: one or a few experts are **always** activated for all tokens. They absorb common knowledge (grammar, general structure), sparing each routed expert from relearning it. DeepSeek-V3 combines **1 shared expert** and **256 routed experts** per layer, with 8 routed experts active per token.

The benefit of these two ideas is cumulative: shared experts stabilize the knowledge base while fine-grained experts multiply specialization diversity at no extra compute. The number of possible expert combinations explodes: choosing 8 experts out of 256 offers an astronomical number of distinct routings, where 2 out of 8 (Mixtral) offers only 28. DeepSeek-V3 thus reaches **671 billion** total parameters for only **~37 billion** activated per token.

## What do experts actually "do"?

A misleading intuition holds that each expert specializes in a legible theme ("the law expert," "the code expert"). In practice, analysis of Mixtral shows the specialization is mostly **syntactic and positional** rather than high-level semantic: a given expert often handles tokens of the same grammatical nature, and routing is strongly correlated from one layer to the next. In other words, experts are not thematic modules but learned routines, and one should not over-interpret their role.

```text
Token "def"  -> experts {3, 7}
Token "("    -> experts {3, 1}
Token "x"    -> experts {5, 7}
# pattern correlated across layers, little obvious semantic alignment
```

A practical consequence: you **cannot** "prune" an expert by betting it is useless for a given domain, because its role is diffuse and distributed. Compressing an MoE goes through quantization or expert merging rather than selective removal.

## The inference serving challenge

An MoE is faster to *compute* than a quality-equivalent dense model, but far harder to *serve* efficiently:

- **Memory footprint**: **all** experts must be loaded into VRAM (Mixtral: ~47 G), even if only a few serve each token. Memory tracks total, not active, parameters.
- **Per-token, per-layer routing**: at each MoE layer, different tokens of the same sequence go to different experts. The activation pattern is **dynamic and unpredictable**, complicating batching and steady accelerator utilization.
- **Production imbalance**: the real request distribution can overload certain experts, creating hotspots training did not anticipate.

Several techniques mitigate this: dedicated *expert parallelism* for serving, *offloading* of rare experts to cheaper memory (CPU/NVMe), aggressive quantization of expert weights, and grouping tokens by expert before execution (*grouped GEMM*). The *prefill* phase (many tokens in parallel) tends to fill experts well, whereas *decode* (one token at a time) suffers most from imbalance — hence the value of dynamic batching that aggregates tokens from several requests.

## Monitoring an MoE in production

A few signals to instrument to keep an MoE healthy:

- **Token drop rate** per expert and per layer: a spike reveals imbalance or too low a capacity factor.
- **Routing entropy**: collapsing entropy signals concentration on a few experts (collapse in progress).
- **Per-expert load**: the distribution should stay close to uniform; persistent hotspots warrant adjustment.
- **All-to-all latency**: monitor separately, since it can dominate total latency in distributed serving.

These metrics, easy to log, prevent the silent drifts that degrade quality without any visible error.

## Pros and cons versus dense

**For MoE:**

- better quality-per-FLOP: more capacity at equal compute;
- faster training to a target quality (up to ~7× pre-training speedup reported by the Switch Transformer);
- low-latency inference relative to a quality-equivalent dense model.

**Against MoE:**

- massive memory footprint (all experts reside in VRAM);
- engineering complexity (routing, all-to-all communication, balancing);
- trickier inference serving (dynamic patterns, batching);
- a tendency to **overfit** and more finicky fine-tuning than a dense model.

## A few benchmark numbers

To anchor the orders of magnitude of the models cited:

- **Switch-C**: ~1.6 T total parameters, 2,048 experts, top-1 routing.
- **GLaM**: ~1.2 T total, 64 experts/layer, top-2, ~97 G active (8%).
- **Mixtral 8x7B**: ~47 G total, 8 experts, top-2, ~13 G active.
- **DeepSeek-V3**: ~671 G total, 256 routed experts + 1 shared, ~37 G active.

The common thread is constant: the active/total ratio falls below 10%, which captures the entire economic promise of MoE.

## When to prefer dense, when to prefer MoE

MoE is not universally superior. A few decision pointers:

- **Memory constrained** (single GPU, edge): a dense model is often simpler and more efficient, since MoE forces loading all experts.
- **Constrained training compute, maximum quality targeted**: MoE excels, offering more capacity at equal FLOPs.
- **Heavily knowledge-oriented tasks** (factual questions, multilingual): MoE shines, its extra capacity storing more facts.
- **Pure reasoning tasks and light fine-tuning**: a dense model sometimes stays more robust and less prone to overfitting.
- **High-throughput serving with parallelism available**: MoE is viable if the infrastructure supports all-to-all and expert placement.

## A practical checklist for training and serving an MoE

Before launching a training run or going to production, verify point by point:

- **Choice of `k`**: top-1 for maximum sparsity, top-2 for a more stable gradient, fine-grained top-8 for combinatorial diversity.
- **Balancing strategy**: auxiliary loss (`α` ~0.01) *or* loss-free dynamic bias (DeepSeek-V3), with a low-coefficient safeguard.
- **Stabilization**: router z-loss enabled, router in `float32`, optional routing noise during training.
- **Capacity factor**: start at 1.0–1.25; watch the drop rate and adjust.
- **Placement**: degree of expert parallelism vs data/tensor/pipeline; enable node-limited routing if inter-node dominates.
- **Overlap**: expert compute hidden behind the all-to-all communication.
- **Serving**: VRAM sized to total parameters, dynamic batching for decode, expert quantization/offloading if needed.
- **Observability**: drop rate, routing entropy, per-expert load, and all-to-all latency logged from day one.

## In summary

MoE decouples capacity from compute: a router sends each token to only a few experts, so the *total* parameter count can explode while *active* parameters stay modest. The whole craft lies in **balancing the load** — via auxiliary loss or dynamic bias — to avoid expert collapse, tuning the **capacity factor** and token dropping, and **distributing** experts across hardware without choking the network. From Switch Transformer to DeepSeek-V3, it is this machinery that makes trillion-parameter scale economically viable.