A pretrained large language model "knows" an enormous amount, yet it cannot spontaneously follow instructions, adopt a tone, or respect your business constraints. **Post-training** — everything that comes after pretraining — turns that reservoir of knowledge into a useful assistant. This article breaks down that chain: supervised fine-tuning (SFT), preference optimization (RLHF then DPO), and the techniques that make fine-tuning affordable (LoRA, QLoRA). We close on the practical question: do you really need to fine-tune, or are a good prompt and RAG enough?

The thread running through the article fits in one sentence: **post-training does not create knowledge, it sculpts behavior**. Pretraining has already filled the model with facts and patterns; post-training chooses which to bring forward, in what form, and within which limits. Keeping this distinction in mind avoids most costly mistakes — starting with fine-tuning to "learn facts" when RAG would do the job better and cheaper.

## The post-training pipeline in three stages

The modern recipe for an aligned assistant chains three distinct stages, each with its own objective and data.

1. **Pretraining**: the base model learns to predict the next token over a massive web corpus. It picks up grammar, facts, implicit reasoning — but it is a *text completer*, not an interlocutor.
2. **SFT (Supervised Fine-Tuning)**: the model is fine-tuned on **demonstrations** (instruction → ideal response pairs), often written by humans. This is *instruction tuning*: the model learns the "you ask, I answer" format.
3. **Preference optimization (alignment)**: learning from **comparisons** (preferred response ≻ rejected response) to refine style, safety, and helpfulness. Two main paths coexist: RLHF (a reward model + reinforcement learning) and DPO (direct optimization, no RL).

![Post-training pipeline: pretraining, then SFT, then preference optimization via RLHF or DPO.](/articles/fine-tuning-et-post-training-lora-dpo-rlhf/post-training-pipeline.svg)
*Figure: the post-training pipeline and its two alignment paths (RLHF vs DPO). Original diagram.*

The key point: these stages are **cumulative**. You start from a base model, run SFT on it, then align by preference on the SFT model. Each stage assumes the previous one — skipping SFT to jump straight to preference almost always yields unstable alignment, because you are optimizing a model that cannot yet hold the dialogue format.

Why this separation? Because the two learning signals are different in nature:

- SFT answers "**what** does a good response look like": it needs *positive* examples (demonstrations), and it learns by imitation, like any classic supervised learning.
- Preference optimization answers "**which** of two responses is better": it needs *comparisons*, a far cheaper signal to produce (judging is easier than writing) that captures nuances of taste no demonstration could express.

This complementarity is the pipeline's strength: you first teach the baseline behavior, then polish it by preference. The three stages can be summarized like this:

| Stage | Signal | Data | What it learns |
| --- | --- | --- | --- |
| Pretraining | Next token | Raw web corpus | Language, facts, implicit reasoning |
| SFT | Imitation | Demonstrations (instruction → response) | The format and baseline behavior |
| Preference | Comparison | Pairs (preferred ≻ rejected) | Fine style, safety, helpfulness |

## Full fine-tuning vs PEFT

Fine-tuning a model, in the simplest case, means **updating all of its weights** by gradient descent. That is *full fine-tuning*. The problem is memory. For a model with 7 to 70 billion parameters, you must hold simultaneously:

- the **weights** themselves;
- the **gradients** (one per weight);
- the **optimizer states** — Adam keeps two moments (mean and variance) per parameter.

In 16-bit, this easily amounts to **four times or more** the model's size in VRAM, before counting activations. A small calculation anchors the idea:

```text
7B model, full fine-tuning, Adam, mixed precision:
  weights (fp16)          : 7 G × 2 B  = 14 GB
  gradients (fp16)        : 7 G × 2 B  = 14 GB
  Adam states (m, v, fp32): 7 G × 8 B  = 56 GB
  fp32 master copy        : 7 G × 4 B  = 28 GB
  -----------------------------------------------
  ≈ 112 GB, excluding activations
```

Fully fine-tuning a 13B quickly exceeds a consumer GPU's 24 GB; a 70B demands a cluster. The entry cost is thus prohibitive for most teams.

**PEFT** (Parameter-Efficient Fine-Tuning) answers this: you **freeze** the vast majority of weights and train only a small number of added parameters. The benefits cascade:

- **Memory**: no gradients or optimizer states for the frozen weights — that is the bulk of the saving (the optimizer states, the biggest line item above, almost entirely disappear).
- **Storage**: you save only the small adapters (a few MB), not a full model copy per task.
- **Modularity**: you can load/unload adapters on the fly, even serve several on the same shared base model.
- **Iteration speed**: fewer parameters to update = shorter, cheaper experiment cycles.

Beyond LoRA, the PEFT family includes methods like serial *adapters*, *prefix tuning* (learning context "pseudo-tokens"), or *BitFit* (training only the biases). But LoRA has become the de facto standard, thanks to its simplicity/quality balance and its absence of inference overhead. It is the one we detail.

## LoRA: the math of low-rank adaptation

LoRA (*Low-Rank Adaptation*, Hu et al., 2021) rests on a strong, empirically verified hypothesis: **the weight update needed to adapt to a task is intrinsically low-rank**. In other words, `ΔW` — the difference between the fine-tuned weights and the base weights — can be well approximated by a product of two small matrices.

For a frozen weight matrix `W` of size `d × d`, LoRA does not learn `ΔW` directly (that would be `d²` parameters) but factorizes it:

```text
ΔW = B · A
  where  A ∈ ℝ^(r × d)   (Gaussian init.)
         B ∈ ℝ^(d × r)   (zero init.)
         r ≪ d           (low rank, typically 4 to 64)

Forward pass:  h = W·x + (α / r) · B · A · x
```

Let us unpack:

- **Rank `r`** controls the adapter's capacity. `rank(B·A) ≤ r`, so small `r` = few parameters (`2·r·d` instead of `d²`). For GPT-3 175B, the paper reports up to **10,000× fewer** trainable parameters and **3× less** VRAM, at quality on par with or better than full fine-tuning.
- **`α` (alpha)** is a scaling factor. The ratio `α/r` sets the magnitude of the update; in practice one often sets `α = 2r` (or tunes `α/r` like an adapter learning rate).
- **Initialization** is deliberate: `A` is Gaussian, `B` is **zero**. At the start, `B·A = 0`, so the adapter does not perturb the model — training begins exactly from the pretrained model.
- **Which layers?** The original paper applies LoRA to the **query and value** projection matrices of attention. Today's common practice also targets key, output, and the MLP layers — more targets = more capacity, more cost.

![LoRA: the frozen weight W receives an update ΔW factorized into two small matrices B and A of rank r.](/articles/fine-tuning-et-post-training-lora-dpo-rlhf/lora-low-rank.svg)
*Figure: LoRA structure — W + (α/r)·B·A, with only A and B trained. Original diagram.*

A worked figure makes the saving tangible. Take a single matrix with `d = 4096`:

```text
Full fine-tuning of this matrix:  d²     = 4096 × 4096   ≈ 16.8 M parameters
LoRA, rank r = 8:                 2·r·d  = 2 × 8 × 4096  ≈ 65.5 k parameters
                                  → ~256× fewer trainable parameters
```

Decisive advantage: **no added inference latency**. Unlike "serial" adapters, you can **merge** `B·A` into `W` (`W' = W + (α/r)·B·A`) before deployment. The served model has exactly the same shape as the original. And since each adapter weighs only a few megabytes, you can train one per task and swap them — a single base model serving dozens of specializations.

How do you choose the rank? A few practical guidelines:

- For a **narrow task** (a format, a precise style), `r = 4` to `8` is often enough.
- For a **broader adaptation** (a whole domain, a rich behavior), go up to `r = 16` to `64`.
- Increasing `r` further yields diminishing returns and approaches the cost of full fine-tuning without guaranteeing its gain.

Rank is therefore the main capacity/cost lever; you tune it empirically, watching performance on a validation set. A stable heuristic: start at `r = 16`, `α = 32`, target query/key/value/output plus the MLP projections, then adjust by the validation curve rather than blindly.

## QLoRA: fine-tuning a 4-bit quantized model

LoRA reduces trainable parameters, but the base model still has to be loaded in memory. **QLoRA** (Dettmers et al., 2023) attacks this last wall by **quantizing the base model to 4 bits**, while keeping the LoRA adapters in higher precision. The gradient backpropagates through the frozen 4-bit model into the adapters. Three innovations make this work without loss of quality:

- **NF4 (4-bit NormalFloat)**: a 4-bit data type that is **information-theoretically optimal** for normally distributed weights (which is the case for a network's weights). Rather than a linear grid, NF4 places its 16 quantization levels according to the **quantiles of a normal distribution**, so that each bin receives an equal expected number of values — quantization thus matches the actual shape of the weight distribution.
- **Double quantization**: you **quantize the quantization constants** themselves. Since quantization is done block by block (each block has its own scale constant), those constants end up weighing something; requantizing them shaves off a few more tenths of a bit per parameter.
- **Paged optimizers**: you use NVIDIA unified memory to absorb the optimizer's memory **spikes** (e.g. on long sequences or during *gradient checkpointing*) by *paging* to CPU RAM instead of crashing on an OOM.

Result: QLoRA fine-tunes a **65-billion-parameter model on a single 48 GB GPU**, preserving the performance of 16-bit fine-tuning. The resulting **Guanaco** family reached **99.3%** of ChatGPT's performance on the Vicuna benchmark after only 24 h of training on one GPU. This is what democratized serious fine-tuning on modest hardware.

One important nuance: 4-bit quantization concerns only the **storage** of the frozen model. At compute time, the NF4 weights are **dequantized on the fly** to a more precise type (bf16) for the multiplication, then the result combines with the full-precision LoRA adapters. You thus trade a little compute (dequantization) for a massive memory saving — an almost always winning trade-off for fine-tuning.

## SFT and instruction tuning: data first

Before any preference optimization, SFT lays the foundation. You fine-tune the model on `(instruction, response)` pairs that demonstrate the desired behavior: answer politely, follow a format, refuse a dangerous request, reason step by step.

The constant lesson from both literature and practice: **data quality beats quantity**. A few thousand careful, diverse, correct examples beat hundreds of thousands of noisy ones. Things to watch:

- **Diversity** of instructions (tasks, lengths, domains) to avoid overfitting to a single format.
- **Consistency** of response style: the model imitates what it sees, flaws included.
- **Loss masking** on the prompt: the loss is often trained only on the **response tokens**, not on the instruction.
- **Chat format**: respect the model's expected template (system/user/assistant role tags), or you teach a format inconsistent with inference.

Concretely, an SFT example in chat format looks like this; only the assistant's response tokens count toward the loss:

```text
<|system|> You are a concise, factual assistant.
<|user|>   Summarize what LoRA is in one sentence.
<|assistant|> LoRA fine-tunes a model by training only two small
              low-rank matrices, while the rest of the weights stay frozen.
   ▲ loss computed only here (response tokens)
```

Where does this data come from? Three sources, often blended:

- **Human annotation**: the most expensive, but the most reliable for tone and safety.
- **Recycled existing data**: support tickets, FAQs, transcripts, reframed as instruction/response pairs.
- **Distillation**: generate responses with a stronger model, then filter the best — cheap, but mind licensing terms and the propagation of the teacher model's biases.

## RLHF: reward model + PPO, à la InstructGPT

After SFT, how do you encode subtle preferences ("this answer is more helpful than that one") that you cannot write as a demonstration? The historical answer: **RLHF** (*Reinforcement Learning from Human Feedback*), popularized by **InstructGPT** (Ouyang et al., 2022). Three steps:

1. **SFT** (already seen): the starting point.
2. **Reward model (RM)**: humans **rank** several model responses to the same prompt. A separate model (the RM) is trained to predict a scalar preference score, via a Bradley-Terry model over the pairs.
3. **RL with PPO**: optimize the policy (the LLM) to **maximize the reward** predicted by the RM, with a **KL penalty** that keeps the policy from drifting too far from the SFT model (otherwise it "cheats" by exploiting the RM's flaws — *reward hacking*).

The reward model itself is trained with a simple preference loss that pushes the preferred response `y_w`'s score above the rejected `y_l`'s:

```text
L_RM = − log σ( r(x, y_w) − r(x, y_l) )

  r(x, y) : scalar score predicted by the RM
  σ       : sigmoid (Bradley-Terry model)
```

The RL phase objective then combines reward and KL penalty:

```text
maximize  E[ r(x, y) ]  −  β · KL( π_θ(·|x) ‖ π_ref(·|x) )
```

The striking result: a **1.3-billion**-parameter InstructGPT was **preferred** over the **175B** GPT-3 by human evaluators, despite 100× fewer parameters. Alignment matters as much as scale.

The downside: the RLHF pipeline is heavy. Concretely, you must:

- keep **several models** in memory at once (the trained policy, the reward model, the reference model for the KL penalty, and PPO's value critic);
- **sample** responses from the policy *inside* the training loop, which is slow;
- cope with **unstable** RL that is highly sensitive to hyperparameters (learning rate, KL coefficient, PPO clipping window).

On top of this comes the risk of *reward hacking*: the policy finds responses that maximize the RM's score without truly being better (overly long, sycophantic answers, or ones exploiting a blind spot in the annotation). The KL penalty curbs this, but does not eliminate it. This complexity is what motivated the search for simpler alternatives — and DPO is the most notable.

## DPO: direct preference optimization

**DPO** (*Direct Preference Optimization*, Rafailov et al., 2023) starts from an elegant observation: the constrained optimization problem that RLHF solves with PPO admits a **closed-form analytical solution**. The optimal policy is `π*(y|x) ∝ π_ref(y|x) · exp( r(x,y) / β )`. Inverting this relation expresses the **implicit reward** as a function of the policy itself — hence the paper's subtitle, "your language model is secretly a reward model":

```text
r(x, y) = β · log( π_θ(y|x) / π_ref(y|x) )  +  β · log Z(x)
```

Plugging this implicit reward into the Bradley-Terry model, the partition constant `Z(x)` **cancels** between the two responses of the same pair. The consequence: **you no longer need to train a separate RM or run RL**. The loss becomes a simple **supervised classification** over preference pairs:

```text
For a pair (y_w preferred, y_l rejected) and a prompt x:

L_DPO = − log σ(  β · [ log( π_θ(y_w|x) / π_ref(y_w|x) )
                       − log( π_θ(y_l|x) / π_ref(y_l|x) ) ]  )

  π_θ   : the trained policy
  π_ref : the reference policy (the frozen SFT model)
  β     : tempers the tolerated drift from π_ref (the role of the KL penalty)
  σ     : sigmoid (Bradley-Terry model)
```

Intuitively: DPO **raises** the relative probability of the preferred response and **lowers** that of the rejected one, all *weighted by* the drift from the reference. The gradient implicitly weighs harder on the pairs where the model is most wrong (where it wrongly prefers the rejected response), giving a self-adjusting learning signal. No RL loop, no RM, just a gradient pass over pairs. The paper shows DPO **matches or beats** PPO on sentiment control, summarization, and dialogue, while being **much simpler and more stable** to train.

The concrete advantages of DPO:

- **Only two models** in memory (the policy and the frozen reference), versus four for full RLHF.
- **No in-loop sampling**: you train directly on a fixed set of pairs.
- **Deterministic, stable training**, like any supervised learning.

Its limits: DPO learns offline on fixed preferences, whereas online RLHF can keep exploring; and the choice of `β` remains sensitive (too small, the model drifts; too large, it barely learns). In practice, DPO has become the **pragmatic default** for preference alignment — often combined with LoRA to stay frugal, yielding a full SFT → DPO loop entirely doable on a single GPU. A whole family of variants has followed (IPO, KTO, ORPO, SimPO) to fix pathological cases of DPO, but the central idea — optimizing preference without RL — stays the same.

## Beyond: RLAIF, Constitutional AI, GRPO

The family has grown:

- **RLAIF / Constitutional AI**: replace (in part) human feedback with **AI** feedback. A critic model evaluates and revises responses according to a "constitution" — a set of explicit written principles. The appeal is threefold:
  - **cost**: large-scale annotation becomes nearly free;
  - **consistency**: principles are applied uniformly, without the variance of human annotators;
  - **safety**: you avoid exposing humans to toxic content to label it.
  The risk is that the critic model's biases and blind spots propagate; so a core of human supervision is often kept.
- **GRPO** (*Group Relative Policy Optimization*, DeepSeekMath, 2024): a PPO variant that **drops the value critic**. For each prompt, you sample a **group** of `G` responses, compute each one's reward, and **normalize within the group** to estimate each response's relative advantage:
  ```text
  For a group of rewards {r_1, …, r_G}:
    A_i = ( r_i − mean(r) ) / std(r)
  ```
  The advantage `A_i` replaces the estimate produced by PPO's value critic. Advantages:
  - **less memory**: one fewer model to train (no separate critic);
  - **more stability**: normalization dampens noisy or sparse rewards;
  - **ideal for verifiable rewards**: in math or code, you can automatically tell whether an answer is correct, with no annotator.
  It is the algorithm behind reasoning models like DeepSeek-R1.

## Where to find preference data

Preference optimization (DPO as well as RLHF) needs **pairs** (preferred response, rejected response). You typically obtain them as follows:

- **Generation + human ranking**: sample several responses from the SFT model for one prompt, then have humans order them. This is the reference source for tone and safety.
- **Pairs from logs**: a regenerated response the user accepted (thumbs up) vs one they rejected (thumbs down) — a free, continuous signal in production.
- **Synthetic pairs (RLAIF)**: a judge model produces the preference verdict at scale, multiplying volume at the cost of a possible judge bias.

Prompt diversity and the reliability of the preference signal matter more than raw volume: ambiguous or contradictory pairs (where even a human hesitates) only provide a noisy gradient and can degrade the model. A smaller but clearly decided set of pairs is better.

## To fine-tune, or not? Fine-tuning vs RAG vs prompt engineering

This is the most important decision — and the most often botched. General rule:

- **Prompt engineering** first: cheapest, instant, no training. Enough for many cases (format, tone, one-shot/few-shot tasks).
- **RAG** (Retrieval-Augmented Generation) for **knowledge**: if the need is "the model must know MY documents / up-to-date facts," RAG injects the context at query time. Fine-tuning does not reliably teach new facts — and facts change.
- **Fine-tuning** for **behavior**: if the need is "the model must *behave* a certain way" (brand style, strict format, specialized task, lower latency/cost via a small specialized model), fine-tuning is the right tool.

| Need | Recommended tool | Why |
| --- | --- | --- |
| Format, tone, one-off examples | Prompt engineering | Instant, zero training |
| Up-to-date facts, private documents | RAG | Context changes without retraining |
| Stable style, strict format, niche task | Fine-tuning | Behavior is baked into the weights |
| Cheap small specialized model | Fine-tuning (+ distillation) | Lower latency and inference cost |

A mnemonic: **RAG for what the model must know, fine-tuning for how it must act.** The two combine very well — a model fine-tuned for business behavior that queries an up-to-date RAG store is a very common pattern.

## Pitfalls: catastrophic forgetting, eval, data quality

- **Catastrophic forgetting**: overly aggressive fine-tuning erodes general capabilities (the model "forgets" what it knew outside your task). PEFT/LoRA mitigates the risk, since the base weights stay frozen; keep a moderate learning rate, limit the number of epochs, and mix in general data if needed.
- **Evaluation**: without a clear benchmark (ideally automatable) defined *before* you fine-tune, you will not know if you are improving. Measure on a held-out validation set, never on the training examples, and always compare against the base model and a simple prompt baseline.
- **Data quality and leakage**: noisy labels are learned as such; train/test leakage artificially inflates scores. Deduplicate, hand-check a sample, and always favor quality over volume.
- **Over-specialization**: a model fine-tuned on a narrow distribution becomes brittle outside it. Include adversarial examples and edge cases in the training set.
- **Mis-tuned LoRA hyperparameters**: too large an `α/r` makes training diverge; too small, and the adapter barely learns. Watch the validation loss, not just the training loss.

### An end-to-end recipe, in brief

To anchor the ideas, a modern, affordable loop looks like this:

1. Load the base model **quantized to 4-bit** (NF4).
2. Run **SFT with LoRA** on a few thousand quality demonstrations.
3. Run **DPO with LoRA** on preference pairs to polish style and safety.
4. **Merge** the adapters into the weights and **evaluate** on a held-out set.
5. Deploy — or iterate if the eval is inconclusive.

The whole thing fits on a single GPU, which was unthinkable before LoRA/QLoRA and DPO.

In short: post-training is a **chain** (SFT → preference) that LoRA/QLoRA make affordable, and where DPO has greatly simplified alignment compared to classic RLHF. But the best optimization is still to **not fine-tune when a prompt and RAG are enough**.