Quantization means representing a neural network's weights (and sometimes activations) with fewer bits: going from 32 bits per parameter down to 8, 4, or even 2 bits. For a large language model the stakes are real — a 7B model in FP32 takes **28 GB** for its weights alone, making it impossible to load on most consumer GPUs. This article explains the numeric formats involved, why LLM inference is **memory-bandwidth bound**, the major method families (GPTQ, AWQ, SmoothQuant, GGUF, NF4), and the serving techniques (continuous batching, PagedAttention, speculative decoding) that turn a quantized model into a high-throughput service.

## Why quantize: memory and bandwidth

A weight stored as 32-bit floating point (FP32) costs 4 bytes. A model with **N** parameters therefore weighs `4 × N` bytes in FP32; half that in half precision (16 bits), a quarter in INT8, an eighth in INT4. The handy mental rule is simple: **weight memory ≈ N × (bits / 8)**.

| Format | Bytes/weight | Memory of a 7B |
| --- | --- | --- |
| FP32 | 4 | 28 GB |
| FP16 / BF16 | 2 | 14 GB |
| FP8 / INT8 | 1 | 7 GB |
| INT4 | 0.5 | 3.5 GB |
| INT2 | 0.25 | 1.75 GB |

![Bit layout (sign, exponent, mantissa) and memory footprint of a 7B model, from FP32 to INT4.](/articles/quantization-et-inference-efficace/formats-numeriques.svg)
*Figure: numeric formats and memory. Each row shows the bit split and the corresponding memory for a 7-billion-parameter model.*

But memory is only half the story. The real inference win comes from **bandwidth**. To produce **a single** output token, the GPU must reread the model's entire set of weights from VRAM into the compute units. At every decode step you therefore transfer `model size` bytes — halving or quartering that size cuts the memory traffic proportionally, and that traffic is the dominant bottleneck. Above all, quantizing means **reading fewer bytes per token**.

### Arithmetic intensity and the roofline

To formalize this bottleneck, reason in terms of **arithmetic intensity**: the number of floating-point operations (FLOPs) performed per byte read from memory. The **roofline** model says a workload is compute-bound if its intensity exceeds the GPU's `peak FLOPs / peak bandwidth` ratio, and memory-bound otherwise. On an H100 that "pivot" ratio sits around **300 FLOPs/byte**.

Yet decoding an LLM with a batch of size 1 has an arithmetic intensity close to **2 FLOPs/byte** (each weight read feeds one multiply-add, i.e. 2 FLOPs, for ~2 bytes in FP16). That is far to the left of the pivot: the GPU is massively underused on the compute side and spends its time waiting on VRAM. Two levers shift the cursor rightward: **quantizing** (fewer bytes per weight) and **batching** (reusing each weight read across several requests, which directly raises intensity).

## The numeric formats in detail

A floating-point number splits into three fields: **sign** (1 bit), **exponent** (the range), and **mantissa** (the precision). The trade-offs differ by format:

- **FP32** — 1 sign + 8 exponent + 23 mantissa. The "full precision" reference, used in historical training.
- **FP16** — 1 + 5 + 10. IEEE half precision. Good relative precision but a **narrow exponent range** (5 bits): risk of overflow/underflow on large activations.
- **BF16** (bfloat16) — 1 + 8 + 7. Same exponent range as FP32 (8 bits) but reduced mantissa. It has become the default training and inference format for LLMs: it "blows up" far less than FP16 on extreme values.
- **FP8** (E4M3 or E5M2) — 8-bit floats, natively supported on recent GPUs (Hopper/Blackwell). Two variants coexist: **E4M3** (4 exponent bits, 3 mantissa) favors precision and serves weights and activations; **E5M2** (5 exponent, 2 mantissa) favors dynamic range and often serves gradients. A good compromise when the hardware has dedicated units: you keep "float" behavior (no zero-point) while doubling tensor-core throughput.
- **INT8 / INT4** — integers. No exponent: a real `x` is encoded as `x ≈ scale × q` where `q` is an integer and `scale` a floating-point scaling factor. Choosing the `scale` (and an optional `zero-point`) is the whole art of integer quantization.
- **INT2 / ternary** — the extreme of compression. At 2 bits (or even 1.58 bits for ternary {−1, 0, +1} schemes), memory collapses but so does quality, barring dedicated training (QAT). Reserved for research and extreme hardware constraints.

The basic affine integer-quantization equation is:

```text
q = round(x / scale) + zero_point      (quantize)
x ≈ scale × (q − zero_point)           (dequantize)
```

Two variants coexist:

- **Symmetric**: `zero_point = 0`, the range is centered on 0 (`[−s·2^{b−1}, s·(2^{b−1}−1)]`). Simpler and slightly faster.
- **Asymmetric (affine)**: a nonzero `zero_point` aligns the real zero onto an integer. Essential for skewed distributions (ReLU, GELU outputs…).

The **granularity** of the `scale` is crucial:

- **Per tensor**: a single scale for the whole matrix. Compact, but coarse.
- **Per channel**: one scale per row (per output). Much better, little overhead.
- **Per group**: one scale per block of 32, 64, or 128 contiguous weights. This is the setting of serious 4-bit methods; `group_size = 128` is the de facto standard.

The finer the granularity, the better local amplitude variation is captured — at the cost of a little extra metadata (the scales and zero-points). A small calculation: for 4-bit weights with `group_size = 128` and one FP16 scale per group, the scale overhead is `16 bits / 128 weights ≈ 0.125 bit/weight`, i.e. ~3% on top of the nominal 4 bits. That is why these formats are sometimes called **"4.5 effective bits"**: the real size always includes the metadata.

### Quantization error and clipping

Two error sources coexist. **Rounding error** (each real is rounded to the nearest level) is at most `scale / 2`; it shrinks as you refine the scale. **Clipping error** (saturation) appears when you bound the range to better resolve the center of the distribution, at the cost of crushing the extremes. Finding the optimal scale means trading off the two: too large a scale increases rounding, too small a scale increases clipping. Serious methods seek this point by minimizing a reconstruction error (often the squared error on the layer's output, not on the weights themselves).

## Weight-only vs weight + activation

Two big families differ by **what** gets quantized:

- **Weight-only**: weights are compressed to INT4/INT8, but activations stay in FP16/BF16 and the math runs in float after on-the-fly dequantization. This is the royal road for **decode** (bandwidth-bound): fewer weight bytes read, full activation precision kept. GPTQ and AWQ belong here (W4A16: 4-bit weights, 16-bit activations).
- **Weight + activation**: activations are quantized **too** (e.g. W8A8: INT8 weights and activations). You then exploit integer compute units (INT8 tensor cores), which speeds up **compute-bound** phases like prefill and high-throughput serving. But it's much harder, because of **activation outliers**.

The **WxAy** notation captures all of this: `x` bits for weights, `y` for activations. W4A16 (4-bit weights, 16-bit activations) is weight-only; W8A8 and W4A8 are weight + activation schemes. The choice depends on the phase you optimize and the batch regime.

### Mixed kernels: the hidden cost of dequantization

Weight-only only wins if on-the-fly dequantization doesn't eat the bandwidth gain. That is the job of **mixed GEMM kernels** (Marlin for W4A16/W8A16, Machete on Hopper): they multiply 16-bit activations by 4- or 8-bit weights while **overlapping** the load and dequantization latency with the float compute, so that the compressed read truly translates into a speedup. At low batch, a GPTQ packed for Marlin can beat AWQ by ~25%, and by nearly 50% at larger batch; but at very large batch you become compute-bound again and FP16 eventually catches up. The moral: a format is only worth as much as the kernel that serves it.

## The activation outlier problem

This is the central obstacle to activation quantization.

In large transformers, **a handful of channels** (on the order of 0.1 to 1% of hidden dimensions) carry activations 100 to 1000× larger in magnitude than the rest. The dilemma is hopeless with a naive scale:

- If you pick a scale **big enough** to cover those extremes, all the "normal" channels collapse onto a few quantization levels and accuracy crumbles.
- If you pick a **fine** scale, the outliers saturate (clipping) and information is lost.

Weights, by contrast, are far better behaved: their distribution is close to a Gaussian, with no massive outliers. That is why **weight-only** quantization is markedly easier than activation quantization. This outlier dilemma is exactly what modern methods (AWQ, SmoothQuant) resolve, each in its own way.

## PTQ vs QAT

Two regimes for producing a quantized model:

- **PTQ** (Post-Training Quantization): quantize **after** training, with no backpropagation. Fast (minutes to hours), it's the default. Most methods use a small **calibration set** (a few hundred sentences) to estimate activation ranges and minimize reconstruction error.
- **QAT** (Quantization-Aware Training): **simulate** quantization during training (or a fine-tune) so the model learns to be robust to the precision loss. Better quality at very low precision (2-3 bits), but expensive: it needs data, compute, and time.

In practice, at 8 and 4 bits **PTQ is enough** in the overwhelming majority of cases; QAT is reserved for very aggressive regimes or strict hardware constraints.

## The PTQ methods that matter

**GPTQ** (2022) — weight-only, INT3/INT4. A **one-shot**, layer-wise method: it quantizes weights column by column using an approximation of the **inverse Hessian** to minimize reconstruction error, adjusting the remaining weights at each step. Concretely, once a column is frozen to its quantized value, the error introduced is **propagated** onto the still-float columns, which "compensate" for it — hence far less degradation than a naive round. Result: you can fit a 175B model on a single A100 and get ~3.25× speedup vs FP16, with near-unchanged perplexity.

**AWQ** (Activation-aware Weight Quantization, 2023) — weight-only, INT4. Key insight: not all weights are equal. A small fraction of "**salient**" weights (identified via **activation** magnitude, not weight magnitude) dominates quality. Rather than keeping those weights in full precision (which breaks the hardware efficiency of mixed formats), AWQ **scales per channel**: it multiplies the important channels by a factor `s > 1` before quantization and divides the corresponding activation by `s`, a mathematically neutral transformation that reduces the relative error on salient weights. No backprop and no Hessian, just a grid search for the optimal `s` on the calibration set. It has become a de facto standard for INT4 deployment, especially on the edge.

**SmoothQuant** (2022) — weight + activation, W8A8. Rather than suffering activation outliers, it **"smooths"** them: through a mathematically equivalent transformation, it shifts part of the difficulty from activations to weights (which are easier to quantize). Reported result: up to 2× memory reduction and ~1.56× speedup, training-free.

**GGUF / k-quants** (llama.cpp) — the CPU/Apple Silicon ecosystem. The **k-quants** (Q2_K to Q6_K) organize weights into **super-blocks of 256 values**, subdivided, with **double quantization** of the scales (the scales themselves are quantized) to cut metadata overhead. Practical benchmarks on Llama-3-8B: **Q4_K_M** (~4.5 GB, +0.18 perplexity) is the "default" quality/size pick, **Q5_K_M** (~5.3 GB, +0.06) when quality matters more.

**bitsandbytes NF4** — the QLoRA format. **NF4** (4-bit NormalFloat) is a data type that is **information-theoretically optimal for normally distributed weights**: its 16 levels are placed on the quantiles of a normal distribution (each "bin" receives equal probability mass) rather than uniformly. Combined with **double quantization** — the quantization constants (one FP32 scale per block of 64) are themselves quantized to 8 bits, saving ~0.4 bit/weight — NF4 matches BF16 performance while enabling **QLoRA**: fine-tuning a frozen 4-bit model via LoRA adapters, with a memory footprint 20× to 60× smaller than full fine-tuning. The gradient only flows through the 4-bit weights (dequantized on the fly) to reach the small trainable adapters.

### A worked memory gain at fine-tuning

On a 7B, full fine-tuning in BF16 demands not only the **weights** (~14 GB) but also the **gradients** (~14 GB) and Adam's **optimizer states** (two moments, ~28 GB in FP32), i.e. ~56 GB before even counting activations — out of reach of a single 24 GB GPU. QLoRA flips the equation: weights frozen in NF4 (~3.5 GB), no gradient or optimizer state on the model weights, and only the LoRA adapters (often < 1% of parameters) carry gradients and moments. The total fits comfortably on a 24 GB GPU, sometimes less.

In summary, the methods sort roughly as follows:

- **Weight-only INT4 on GPU**: GPTQ, AWQ.
- **Weight + activation on GPU**: SmoothQuant (W8A8), FP8.
- **CPU / Apple Silicon / edge**: GGUF (k-quants).
- **Budget fine-tuning**: NF4 (QLoRA).

## The role of the calibration set

Most PTQ methods need a few hundred examples to estimate activation ranges. This **calibration set** shapes the result:

- It must be **representative** of the usage domain (code if you serve code, multilingual text if you serve multiple languages).
- A reasonable size is on the order of **128 to 512 sequences**; beyond that, the gain is marginal.
- An ill-suited set can silently **degrade** quality on your real task.

Note: recent work shows that on modern LLMs the influence of outliers and of the calibration-set choice **tends to diminish** — but caution is still warranted.

## Quantizing the KV cache too

On long contexts, the **KV cache** (the attention keys/values of every token) can weigh as much as the model itself, or more. Its size follows a simple formula:

```text
kv_bytes = 2 × n_layers × n_kv_heads × head_dim × length × bytes_per_element
```

The factor `2` covers keys **and** values. On a typical 7B model (32 layers, hidden dim 4096) in FP16, that is roughly **0.5 MB per token**, i.e. ~2 GB for a 4096-token context — and that's per request. This is why multi-query attention (MQA) and grouped-query attention (GQA), which reduce `n_kv_heads`, have become standard: they directly cut the KV-cache weight.

Quantizing the cache (KV cache in FP8 or INT8):

- frees memory for **larger batches** or **longer contexts**;
- reduces the bandwidth to reread at every decode step.

This is a lever distinct from weight quantization, often combined with it in recent serving engines.

## From compression to throughput: serving inference

Compressing the model isn't enough: you need a **serving engine** that saturates the GPU. First key: understand that inference has **two phases** with opposite profiles.

- **Prefill**: process **all** of the prompt at once. Many FLOPs in parallel → **compute-bound**.
- **Decode**: generate tokens **one at a time**, rereading all weights (and the KV cache) at each step just to produce a single token → **memory-bandwidth bound**.

This asymmetry explains everything that follows. During decode the GPU spends its time **waiting on memory**, not computing. The fix: **batch several requests together** to amortize each weight read across many tokens at once.

## Continuous batching and PagedAttention

**Continuous batching** (a.k.a. in-flight batching) dynamically interleaves requests: as soon as a generation finishes, a new request joins the batch without waiting for the whole batch to complete. The scheduler mixes prefill (waiting requests) and decode (running requests) in the same step. Typical gain: **3 to 10×** throughput on the same hardware.

To batch efficiently you must manage the **KV cache** (the attention keys/values for every token seen so far) without wasting it. **PagedAttention** borrows from OS **virtual memory**: the KV cache is split into fixed-size **blocks** (16 tokens by default), allocated on demand and non-contiguous. This eliminates fragmentation (~19-27% memory saved), allows larger batches, and makes it easy to share common prefixes across requests.

**Chunked prefill** rounds it out: a very long prompt is split into chunks across several engine steps so it doesn't monopolize the GPU at the expense of requests being decoded. Without it, a 100,000-token prompt would block every other generation for the duration of its prefill; with it, the engine interleaves prefill chunks and decode steps, smoothing inter-token latency.

### Prefix caching

When several requests share the same beginning — a long *system prompt*, a common few-shot, a RAG document — recomputing their KV cache every time is pure waste. **Prefix caching** keeps and **reuses** the common prefix's KV cache: only the variable part is recomputed. PagedAttention makes this sharing trivial (the prefix blocks are simply referenced by multiple sequences), and SGLang turns it into an art with **RadixAttention**, which organizes prefixes in a radix tree to maximize overlap. On highly shared workloads the time-to-first-token latency gain is dramatic.

## Speculative decoding: several tokens per pass

Decode is sequential **by nature**: token `t+1` depends on `t`. **Speculative decoding** breaks that barrier without changing the output. The idea: a **small draft model** (fast) proposes `k` tokens ahead; the **large target model** **verifies them all in a single** parallel pass, accepts the longest correct prefix, and corrects the first rejected token. Because verification is exact, **the output is strictly identical** to plain decoding — you pay no quality loss, you just gain several tokens per pass of the large model.

![Speculative decoding: the draft model proposes k tokens, the large model verifies them in parallel, accepts the correct prefix, and corrects the first rejection.](/articles/quantization-et-inference-efficace/speculative-decoding.svg)
*Figure: the draft + verify flow of speculative decoding. Throughput depends on the acceptance rate of proposed tokens.*

Several variants even avoid needing a separate draft model:

- **n-gram / prompt lookup**: search the already-generated text for an earlier occurrence of the last n-gram and propose the tokens that followed it. Nearly free, very effective on repetitive text (code, JSON, RAG).
- **Medusa**: graft several lightweight **heads** onto the large model that predict tokens `+1, +2, +3…` in parallel, verified via a candidate tree.
- **EAGLE**: predict at the **feature** (hidden-state) level rather than the token level, for a higher acceptance rate. **EAGLE-2** adds a **dynamic draft tree**: instead of a fixed structure, it grows the tree based on confidence scores (acceptance depends on context, not only position), gaining 20-40% over EAGLE-1. **EAGLE-3** taps multiple intermediate hidden layers and relaxes the feature-prediction constraint, pushing acceptance rates further (reported speedups on the order of 3-5×).

### Why the acceptance rate is everything

The gain depends entirely on the fraction of proposed tokens the target model accepts. If the draft proposes `k` tokens and the mean acceptance rate is `α`, the expected number of tokens validated per pass of the large model is about `(1 − α^{k+1}) / (1 − α)`. At `α = 0.8` and `k = 4`, you validate ~3.4 tokens per pass instead of one — hence the speedup. But each large-model pass is still more expensive (verifying `k+1` positions) and the draft has its own cost: if `α` is low, speculative decoding can **slow** the service down. Hence the importance of a draft well aligned with the target.

Speculative decoding shines most at **low concurrency** (when per-request latency matters); at high throughput, continuous batching already fills the GPU and the marginal gain shrinks.

## Scaling out: tensor parallelism

When a model doesn't fit on one GPU (or to cut latency), you **split** it across several GPUs. **Tensor parallelism** (TP) distributes each weight matrix across ranks: with TP=8, each GPU holds 1/8 of the weights and computes its share, then ranks synchronize (all-reduce) at each layer. It's complementary to quantization: first reduce the per-GPU size, then parallelize what remains.

## The serving engines

These techniques ship in ready-to-use engines:

- **vLLM** — the open-source reference, native PagedAttention, continuous batching, broad quantization support (GPTQ, AWQ, FP8…), speculative decoding.
- **TensorRT-LLM** (NVIDIA) — top performance on NVIDIA GPUs, optimized kernels, native FP8.
- **TGI** (Text Generation Inference, Hugging Face) — turnkey serving, strong ecosystem integration.
- **SGLang** — geared toward structured LLM programs, very efficient prefix caching (RadixAttention).
- **llama.cpp** — the go-to on CPU / Apple Silicon / edge, the engine behind GGUF.

## Measuring the accuracy / size trade-off

You never quantize blind. Two families of measures:

1. **Perplexity (PPL)** — an intrinsic measure on a reference corpus (WikiText, C4): how "surprised" the model is by real text. Fast to compute, useful for spotting gross degradation, but a small PPL increase doesn't guarantee that **capabilities** are preserved.
- **Task evaluations** — real benchmarks (MMLU, GSM8K, HumanEval, etc.) that measure what truly counts: reasoning, code, knowledge. Slower, but they are ground truth.

A few robust empirical rules of thumb: INT8 and W4A16 (GPTQ/AWQ, group 128) generally degrade very little (often < 1 point on evals); below 4 bits quality falls faster; and **always measure on your own task and your own language** — a quantized model can hold up in English yet degrade more on a less-represented language.

## A memory-budget calculation

Before picking a format, a back-of-the-envelope calculation avoids nasty surprises. Here's a rough estimate of **weight** memory:

```py
def weight_memory_gb(params_billions: float, bits: int) -> float:
    bytes_per_param = bits / 8
    total_bytes = params_billions * 1e9 * bytes_per_param
    return total_bytes / (1024 ** 3)

for bits in (16, 8, 4):
    gb = weight_memory_gb(7, bits)
    print(f"7B @ {bits} bits: {gb:.1f} GB")
# 7B @ 16 bits: 13.0 GB
# 7B @ 8 bits :  6.5 GB
# 7B @ 4 bits :  3.3 GB
```

On top of that come the **KV cache** (proportional to `batch × context`) and a little overhead. A 24 GB GPU comfortably hosts a 7B in 4 bits with a long context, or a 13B tightly.

## Common pitfalls

- **Quantizing then not measuring**: perplexity alone isn't enough, check task evals.
- **Non-representative calibration set**: it skews the activation-range estimate.
- **Wrong test language**: validating only in English hides degradation on other languages.
- **Granularity too coarse**: a per-tensor scale in 4 bits often crushes quality; prefer `group_size = 128`.
- **Forgetting the KV cache**: on long contexts it can saturate VRAM even with compressed weights.
- **Believing speculative decoding is free at high throughput**: its gain fades once continuous batching already fills the GPU.

## A practical recipe

- **Start** in BF16 to validate behavior, then quantize.
- **GPU, low latency**: weight-only INT4 (AWQ or GPTQ, `group_size=128`) + vLLM/TensorRT-LLM; enable speculative decoding if concurrency is low.
- **GPU, high throughput**: consider W8A8 (SmoothQuant) or FP8 to exploit integer tensor cores, with continuous batching.
- **CPU / Mac / edge**: GGUF, target **Q4_K_M** by default, **Q5_K_M** if quality matters more.
- **Small-budget fine-tuning**: QLoRA (NF4 + double quantization + LoRA adapters).
- **Always**: validate with perplexity **and** task evals, on your data and your languages, before locking in the choice.

In short, quantization attacks the memory bottleneck (fewer bytes per token), while serving techniques (continuous batching, PagedAttention, speculative decoding, parallelism) saturate the GPU and amortize each weight read. It's the combination of the two — a compressed model **and** an efficient engine — that turns an LLM from a costly demo into a viable service at scale.