A pretrained large language model "knows" an enormous amount, yet it cannot spontaneously follow instructions, adopt a tone, or respect your business constraints. Post-training — everything that comes after pretraining — turns that reservoir of knowledge into a useful assistant. This article breaks down that chain: supervised fine-tuning (SFT), preference optimization (RLHF then DPO), and the techniques that make fine-tuning affordable (LoRA, QLoRA). We close on the practical question: do you really need to fine-tune, or are a good prompt and RAG enough?

The thread running through the article fits in one sentence: post-training does not create knowledge, it sculpts behavior. Pretraining has already filled the model with facts and patterns; post-training chooses which to bring forward, in what form, and within which limits. Keeping this distinction in mind avoids most costly mistakes — starting with fine-tuning to "learn facts" when RAG would do the job better and cheaper.

The post-training pipeline in three stages

The modern recipe for an aligned assistant chains three distinct stages, each with its own objective and data.

  1. Pretraining: the base model learns to predict the next token over a massive web corpus. It picks up grammar, facts, implicit reasoning — but it is a text completer, not an interlocutor.
  2. SFT (Supervised Fine-Tuning): the model is fine-tuned on demonstrations (instruction → ideal response pairs), often written by humans. This is instruction tuning: the model learns the "you ask, I answer" format.
  3. Preference optimization (alignment): learning from comparisons (preferred response ≻ rejected response) to refine style, safety, and helpfulness. Two main paths coexist: RLHF (a reward model + reinforcement learning) and DPO (direct optimization, no RL).

Post-training pipeline: pretraining, then SFT, then preference optimization via RLHF or DPO. Figure: the post-training pipeline and its two alignment paths (RLHF vs DPO). Original diagram.

The key point: these stages are cumulative. You start from a base model, run SFT on it, then align by preference on the SFT model. Each stage assumes the previous one — skipping SFT to jump straight to preference almost always yields unstable alignment, because you are optimizing a model that cannot yet hold the dialogue format.

Why this separation? Because the two learning signals are different in nature:

  • SFT answers "what does a good response look like": it needs positive examples (demonstrations), and it learns by imitation, like any classic supervised learning.
  • Preference optimization answers "which of two responses is better": it needs comparisons, a far cheaper signal to produce (judging is easier than writing) that captures nuances of taste no demonstration could express.

This complementarity is the pipeline's strength: you first teach the baseline behavior, then polish it by preference. The three stages can be summarized like this:

Stage Signal Data What it learns
Pretraining Next token Raw web corpus Language, facts, implicit reasoning
SFT Imitation Demonstrations (instruction → response) The format and baseline behavior
Preference Comparison Pairs (preferred ≻ rejected) Fine style, safety, helpfulness

Full fine-tuning vs PEFT

Fine-tuning a model, in the simplest case, means updating all of its weights by gradient descent. That is full fine-tuning. The problem is memory. For a model with 7 to 70 billion parameters, you must hold simultaneously:

  • the weights themselves;
  • the gradients (one per weight);
  • the optimizer states — Adam keeps two moments (mean and variance) per parameter.

In 16-bit, this easily amounts to four times or more the model's size in VRAM, before counting activations. A small calculation anchors the idea:

7B model, full fine-tuning, Adam, mixed precision:
  weights (fp16)          : 7 G × 2 B  = 14 GB
  gradients (fp16)        : 7 G × 2 B  = 14 GB
  Adam states (m, v, fp32): 7 G × 8 B  = 56 GB
  fp32 master copy        : 7 G × 4 B  = 28 GB
  -----------------------------------------------
  ≈ 112 GB, excluding activations

Fully fine-tuning a 13B quickly exceeds a consumer GPU's 24 GB; a 70B demands a cluster. The entry cost is thus prohibitive for most teams.

PEFT (Parameter-Efficient Fine-Tuning) answers this: you freeze the vast majority of weights and train only a small number of added parameters. The benefits cascade:

  • Memory: no gradients or optimizer states for the frozen weights — that is the bulk of the saving (the optimizer states, the biggest line item above, almost entirely disappear).
  • Storage: you save only the small adapters (a few MB), not a full model copy per task.
  • Modularity: you can load/unload adapters on the fly, even serve several on the same shared base model.
  • Iteration speed: fewer parameters to update = shorter, cheaper experiment cycles.

Beyond LoRA, the PEFT family includes methods like serial adapters, prefix tuning (learning context "pseudo-tokens"), or BitFit (training only the biases). But LoRA has become the de facto standard, thanks to its simplicity/quality balance and its absence of inference overhead. It is the one we detail.

LoRA: the math of low-rank adaptation

LoRA (Low-Rank Adaptation, Hu et al., 2021) rests on a strong, empirically verified hypothesis: the weight update needed to adapt to a task is intrinsically low-rank. In other words, ΔW — the difference between the fine-tuned weights and the base weights — can be well approximated by a product of two small matrices.

For a frozen weight matrix W of size d × d, LoRA does not learn ΔW directly (that would be parameters) but factorizes it:

ΔW = B · A
  where  A ∈ ℝ^(r × d)   (Gaussian init.)
         B ∈ ℝ^(d × r)   (zero init.)
         r ≪ d           (low rank, typically 4 to 64)

Forward pass:  h = W·x + (α / r) · B · A · x

Let us unpack:

  • Rank r controls the adapter's capacity. rank(B·A) ≤ r, so small r = few parameters (2·r·d instead of ). For GPT-3 175B, the paper reports up to 10,000× fewer trainable parameters and 3× less VRAM, at quality on par with or better than full fine-tuning.
  • α (alpha) is a scaling factor. The ratio α/r sets the magnitude of the update; in practice one often sets α = 2r (or tunes α/r like an adapter learning rate).
  • Initialization is deliberate: A is Gaussian, B is zero. At the start, B·A = 0, so the adapter does not perturb the model — training begins exactly from the pretrained model.
  • Which layers? The original paper applies LoRA to the query and value projection matrices of attention. Today's common practice also targets key, output, and the MLP layers — more targets = more capacity, more cost.

LoRA: the frozen weight W receives an update ΔW factorized into two small matrices B and A of rank r. Figure: LoRA structure — W + (α/r)·B·A, with only A and B trained. Original diagram.

A worked figure makes the saving tangible. Take a single matrix with d = 4096:

Full fine-tuning of this matrix:  d²     = 4096 × 4096   ≈ 16.8 M parameters
LoRA, rank r = 8:                 2·r·d  = 2 × 8 × 4096  ≈ 65.5 k parameters
                                  → ~256× fewer trainable parameters

Decisive advantage: no added inference latency. Unlike "serial" adapters, you can merge B·A into W (W' = W + (α/r)·B·A) before deployment. The served model has exactly the same shape as the original. And since each adapter weighs only a few megabytes, you can train one per task and swap them — a single base model serving dozens of specializations.

How do you choose the rank? A few practical guidelines:

  • For a narrow task (a format, a precise style), r = 4 to 8 is often enough.
  • For a broader adaptation (a whole domain, a rich behavior), go up to r = 16 to 64.
  • Increasing r further yields diminishing returns and approaches the cost of full fine-tuning without guaranteeing its gain.

Rank is therefore the main capacity/cost lever; you tune it empirically, watching performance on a validation set. A stable heuristic: start at r = 16, α = 32, target query/key/value/output plus the MLP projections, then adjust by the validation curve rather than blindly.

QLoRA: fine-tuning a 4-bit quantized model

LoRA reduces trainable parameters, but the base model still has to be loaded in memory. QLoRA (Dettmers et al., 2023) attacks this last wall by quantizing the base model to 4 bits, while keeping the LoRA adapters in higher precision. The gradient backpropagates through the frozen 4-bit model into the adapters. Three innovations make this work without loss of quality:

  • NF4 (4-bit NormalFloat): a 4-bit data type that is information-theoretically optimal for normally distributed weights (which is the case for a network's weights). Rather than a linear grid, NF4 places its 16 quantization levels according to the quantiles of a normal distribution, so that each bin receives an equal expected number of values — quantization thus matches the actual shape of the weight distribution.
  • Double quantization: you quantize the quantization constants themselves. Since quantization is done block by block (each block has its own scale constant), those constants end up weighing something; requantizing them shaves off a few more tenths of a bit per parameter.
  • Paged optimizers: you use NVIDIA unified memory to absorb the optimizer's memory spikes (e.g. on long sequences or during gradient checkpointing) by paging to CPU RAM instead of crashing on an OOM.

Result: QLoRA fine-tunes a 65-billion-parameter model on a single 48 GB GPU, preserving the performance of 16-bit fine-tuning. The resulting Guanaco family reached 99.3% of ChatGPT's performance on the Vicuna benchmark after only 24 h of training on one GPU. This is what democratized serious fine-tuning on modest hardware.

One important nuance: 4-bit quantization concerns only the storage of the frozen model. At compute time, the NF4 weights are dequantized on the fly to a more precise type (bf16) for the multiplication, then the result combines with the full-precision LoRA adapters. You thus trade a little compute (dequantization) for a massive memory saving — an almost always winning trade-off for fine-tuning.

SFT and instruction tuning: data first

Before any preference optimization, SFT lays the foundation. You fine-tune the model on (instruction, response) pairs that demonstrate the desired behavior: answer politely, follow a format, refuse a dangerous request, reason step by step.

The constant lesson from both literature and practice: data quality beats quantity. A few thousand careful, diverse, correct examples beat hundreds of thousands of noisy ones. Things to watch:

  • Diversity of instructions (tasks, lengths, domains) to avoid overfitting to a single format.
  • Consistency of response style: the model imitates what it sees, flaws included.
  • Loss masking on the prompt: the loss is often trained only on the response tokens, not on the instruction.
  • Chat format: respect the model's expected template (system/user/assistant role tags), or you teach a format inconsistent with inference.

Concretely, an SFT example in chat format looks like this; only the assistant's response tokens count toward the loss:

<|system|> You are a concise, factual assistant.
<|user|>   Summarize what LoRA is in one sentence.
<|assistant|> LoRA fine-tunes a model by training only two small
              low-rank matrices, while the rest of the weights stay frozen.
   ▲ loss computed only here (response tokens)

Where does this data come from? Three sources, often blended:

  • Human annotation: the most expensive, but the most reliable for tone and safety.
  • Recycled existing data: support tickets, FAQs, transcripts, reframed as instruction/response pairs.
  • Distillation: generate responses with a stronger model, then filter the best — cheap, but mind licensing terms and the propagation of the teacher model's biases.

RLHF: reward model + PPO, à la InstructGPT

After SFT, how do you encode subtle preferences ("this answer is more helpful than that one") that you cannot write as a demonstration? The historical answer: RLHF (Reinforcement Learning from Human Feedback), popularized by InstructGPT (Ouyang et al., 2022). Three steps:

  1. SFT (already seen): the starting point.
  2. Reward model (RM): humans rank several model responses to the same prompt. A separate model (the RM) is trained to predict a scalar preference score, via a Bradley-Terry model over the pairs.
  3. RL with PPO: optimize the policy (the LLM) to maximize the reward predicted by the RM, with a KL penalty that keeps the policy from drifting too far from the SFT model (otherwise it "cheats" by exploiting the RM's flaws — reward hacking).

The reward model itself is trained with a simple preference loss that pushes the preferred response y_w's score above the rejected y_l's:

L_RM = − log σ( r(x, y_w) − r(x, y_l) )

  r(x, y) : scalar score predicted by the RM
  σ       : sigmoid (Bradley-Terry model)

The RL phase objective then combines reward and KL penalty:

maximize  E[ r(x, y) ]  −  β · KL( π_θ(·|x) ‖ π_ref(·|x) )

The striking result: a 1.3-billion-parameter InstructGPT was preferred over the 175B GPT-3 by human evaluators, despite 100× fewer parameters. Alignment matters as much as scale.

The downside: the RLHF pipeline is heavy. Concretely, you must:

  • keep several models in memory at once (the trained policy, the reward model, the reference model for the KL penalty, and PPO's value critic);
  • sample responses from the policy inside the training loop, which is slow;
  • cope with unstable RL that is highly sensitive to hyperparameters (learning rate, KL coefficient, PPO clipping window).

On top of this comes the risk of reward hacking: the policy finds responses that maximize the RM's score without truly being better (overly long, sycophantic answers, or ones exploiting a blind spot in the annotation). The KL penalty curbs this, but does not eliminate it. This complexity is what motivated the search for simpler alternatives — and DPO is the most notable.

DPO: direct preference optimization

DPO (Direct Preference Optimization, Rafailov et al., 2023) starts from an elegant observation: the constrained optimization problem that RLHF solves with PPO admits a closed-form analytical solution. The optimal policy is π*(y|x) ∝ π_ref(y|x) · exp( r(x,y) / β ). Inverting this relation expresses the implicit reward as a function of the policy itself — hence the paper's subtitle, "your language model is secretly a reward model":

r(x, y) = β · log( π_θ(y|x) / π_ref(y|x) )  +  β · log Z(x)

Plugging this implicit reward into the Bradley-Terry model, the partition constant Z(x) cancels between the two responses of the same pair. The consequence: you no longer need to train a separate RM or run RL. The loss becomes a simple supervised classification over preference pairs:

For a pair (y_w preferred, y_l rejected) and a prompt x:

L_DPO = − log σ(  β · [ log( π_θ(y_w|x) / π_ref(y_w|x) )
                       − log( π_θ(y_l|x) / π_ref(y_l|x) ) ]  )

  π_θ   : the trained policy
  π_ref : the reference policy (the frozen SFT model)
  β     : tempers the tolerated drift from π_ref (the role of the KL penalty)
  σ     : sigmoid (Bradley-Terry model)

Intuitively: DPO raises the relative probability of the preferred response and lowers that of the rejected one, all weighted by the drift from the reference. The gradient implicitly weighs harder on the pairs where the model is most wrong (where it wrongly prefers the rejected response), giving a self-adjusting learning signal. No RL loop, no RM, just a gradient pass over pairs. The paper shows DPO matches or beats PPO on sentiment control, summarization, and dialogue, while being much simpler and more stable to train.

The concrete advantages of DPO:

  • Only two models in memory (the policy and the frozen reference), versus four for full RLHF.
  • No in-loop sampling: you train directly on a fixed set of pairs.
  • Deterministic, stable training, like any supervised learning.

Its limits: DPO learns offline on fixed preferences, whereas online RLHF can keep exploring; and the choice of β remains sensitive (too small, the model drifts; too large, it barely learns). In practice, DPO has become the pragmatic default for preference alignment — often combined with LoRA to stay frugal, yielding a full SFT → DPO loop entirely doable on a single GPU. A whole family of variants has followed (IPO, KTO, ORPO, SimPO) to fix pathological cases of DPO, but the central idea — optimizing preference without RL — stays the same.

Beyond: RLAIF, Constitutional AI, GRPO

The family has grown:

  • RLAIF / Constitutional AI: replace (in part) human feedback with AI feedback. A critic model evaluates and revises responses according to a "constitution" — a set of explicit written principles. The appeal is threefold:
    • cost: large-scale annotation becomes nearly free;
    • consistency: principles are applied uniformly, without the variance of human annotators;
    • safety: you avoid exposing humans to toxic content to label it. The risk is that the critic model's biases and blind spots propagate; so a core of human supervision is often kept.
  • GRPO (Group Relative Policy Optimization, DeepSeekMath, 2024): a PPO variant that drops the value critic. For each prompt, you sample a group of G responses, compute each one's reward, and normalize within the group to estimate each response's relative advantage:
    For a group of rewards {r_1, …, r_G}:
      A_i = ( r_i − mean(r) ) / std(r)
    The advantage A_i replaces the estimate produced by PPO's value critic. Advantages:
    • less memory: one fewer model to train (no separate critic);
    • more stability: normalization dampens noisy or sparse rewards;
    • ideal for verifiable rewards: in math or code, you can automatically tell whether an answer is correct, with no annotator. It is the algorithm behind reasoning models like DeepSeek-R1.

Where to find preference data

Preference optimization (DPO as well as RLHF) needs pairs (preferred response, rejected response). You typically obtain them as follows:

  • Generation + human ranking: sample several responses from the SFT model for one prompt, then have humans order them. This is the reference source for tone and safety.
  • Pairs from logs: a regenerated response the user accepted (thumbs up) vs one they rejected (thumbs down) — a free, continuous signal in production.
  • Synthetic pairs (RLAIF): a judge model produces the preference verdict at scale, multiplying volume at the cost of a possible judge bias.

Prompt diversity and the reliability of the preference signal matter more than raw volume: ambiguous or contradictory pairs (where even a human hesitates) only provide a noisy gradient and can degrade the model. A smaller but clearly decided set of pairs is better.

To fine-tune, or not? Fine-tuning vs RAG vs prompt engineering

This is the most important decision — and the most often botched. General rule:

  • Prompt engineering first: cheapest, instant, no training. Enough for many cases (format, tone, one-shot/few-shot tasks).
  • RAG (Retrieval-Augmented Generation) for knowledge: if the need is "the model must know MY documents / up-to-date facts," RAG injects the context at query time. Fine-tuning does not reliably teach new facts — and facts change.
  • Fine-tuning for behavior: if the need is "the model must behave a certain way" (brand style, strict format, specialized task, lower latency/cost via a small specialized model), fine-tuning is the right tool.
Need Recommended tool Why
Format, tone, one-off examples Prompt engineering Instant, zero training
Up-to-date facts, private documents RAG Context changes without retraining
Stable style, strict format, niche task Fine-tuning Behavior is baked into the weights
Cheap small specialized model Fine-tuning (+ distillation) Lower latency and inference cost

A mnemonic: RAG for what the model must know, fine-tuning for how it must act. The two combine very well — a model fine-tuned for business behavior that queries an up-to-date RAG store is a very common pattern.

Pitfalls: catastrophic forgetting, eval, data quality

  • Catastrophic forgetting: overly aggressive fine-tuning erodes general capabilities (the model "forgets" what it knew outside your task). PEFT/LoRA mitigates the risk, since the base weights stay frozen; keep a moderate learning rate, limit the number of epochs, and mix in general data if needed.
  • Evaluation: without a clear benchmark (ideally automatable) defined before you fine-tune, you will not know if you are improving. Measure on a held-out validation set, never on the training examples, and always compare against the base model and a simple prompt baseline.
  • Data quality and leakage: noisy labels are learned as such; train/test leakage artificially inflates scores. Deduplicate, hand-check a sample, and always favor quality over volume.
  • Over-specialization: a model fine-tuned on a narrow distribution becomes brittle outside it. Include adversarial examples and edge cases in the training set.
  • Mis-tuned LoRA hyperparameters: too large an α/r makes training diverge; too small, and the adapter barely learns. Watch the validation loss, not just the training loss.

An end-to-end recipe, in brief

To anchor the ideas, a modern, affordable loop looks like this:

  1. Load the base model quantized to 4-bit (NF4).
  2. Run SFT with LoRA on a few thousand quality demonstrations.
  3. Run DPO with LoRA on preference pairs to polish style and safety.
  4. Merge the adapters into the weights and evaluate on a held-out set.
  5. Deploy — or iterate if the eval is inconclusive.

The whole thing fits on a single GPU, which was unthinkable before LoRA/QLoRA and DPO.

In short: post-training is a chain (SFT → preference) that LoRA/QLoRA make affordable, and where DPO has greatly simplified alignment compared to classic RLHF. But the best optimization is still to not fine-tune when a prompt and RAG are enough.