Evaluating an LLM is hard because there is often no single right answer: a summary, an explanation, or an email can be correct in a thousand ways. The **LLM-as-a-judge** pattern — using one model to grade another's outputs — has become the dominant trade-off between the scalability of automatic metrics and the nuance of human judgment. But it is a biased instrument: calibrate it against humans, and never trust it blindly.

## Why evaluation is intrinsically hard

The quality of open-ended output combines fluency, factuality, relevance, tone, and format — dimensions often in tension, irreducible to one number. Add the lack of ground truth for generative tasks and non-determinism (the same input yields varying outputs). Finally, cost is combinatorial: *N* models × *M* prompts × *K* criteria. Evaluation isn't academic; it's continuous engineering: catch regressions on every deploy, compare models and prompts, monitor production.

## Classic automatic metrics, and their limits

**Exact-match / accuracy** is perfect for closed QA, classification, or single-answer math — and useless once phrasing varies. **BLEU** and **ROUGE** measure n-gram overlap with a reference: built for translation and summarization, they correlate weakly with human judgment on open text. **BERTScore** captures semantic similarity but still depends on a reference and ignores factuality. The core flaw: these metrics measure *resemblance to a reference*, not *absolute quality* — blind to correct paraphrases and to well-phrased hallucinations alike.

## Human evaluation: the gold standard, but…

Humans remain the ultimate reference: they capture nuance, factuality, and real user preference. But they are **costly, slow, and inconsistent**: experts agree with each other only about 80% of the time, and annotation doesn't scale to a CI/CD cycle. We distinguish *pointwise* scoring (an absolute grade) from *pairwise* preference (A vs B) — the latter more reliable, since comparing is easier than scoring in the absolute.

## The LLM-as-a-judge pattern

The idea: hand evaluation to a strong LLM, guided by a prompt and a **scoring rubric**. Well-framed, it correlates strongly with humans while staying automatable. On MT-Bench, LLM-judge ↔ human agreement reaches ~85% (excluding ties), the same order as human ↔ human agreement (~81%). Treat it as a component you test and calibrate, not as an oracle.

![The LLM judge grades a response in three modes: pointwise, pairwise, and reference-guided.](/articles/evaluer-un-llm-llm-as-a-judge/llm-judge.svg)
*Diagram: the three modes of LLM-as-a-judge.*

### The three modes

- **Pointwise** (absolute grade): the judge assigns a score (e.g. 1–5) per a rubric. Fast, but poorly calibrated and unstable over time.
- **Pairwise** (comparison): the judge picks A vs B (or a tie). More robust because it's comparative; the basis of arenas, but O(n²) if you compare all pairs.
- **Reference-guided**: give the judge a reference answer (or have it answer first). Strongly reduces errors on reasoning and math.

## Writing a good judge prompt

Give the judge a role, **explicit criteria**, and a rubric with per-level descriptors — not a vague instruction. Make it **reason before scoring**, use a short integer scale (1–5, no floats), and require structured output for reliable parsing:

```text
Grade the response from 1 to 5 on:
- Factual accuracy (are the claims true?)
- Relevance (does it answer the question asked?)
- Clarity (is it understandable and well-structured?)
Justify each score in one sentence, THEN give the total.
```

Graded examples (*few-shot*) anchor the scale; for multiple criteria, decompose rather than scoring globally (RAGAS-style: faithfulness, answer relevancy, etc.).

## The biases to neutralize

LLM judges have documented biases (the CALM taxonomy lists about a dozen):

- **Position**: preference for the first (or second) answer, regardless of quality.
- **Verbosity**: preference for longer answers, even without better content.
- **Self-preference**: a model scores its own outputs higher.
- **Format / authority**: sensitivity to polished markdown, citations (even fake), and tone.

Proven mitigations: **swap the order** (call the judge twice with A/B reversed, declare a winner only if it wins both ways); **never judge a model with itself** (use a third-party judge); add **few-shot** and chain-of-thought reasoning; and switch to **reference-guided** for verifiable tasks.

## Agreement with humans: the only real validity criterion

A judge is only worth something if it correlates with humans — and that agreement is **task- and domain-dependent**; it doesn't transfer automatically. Before trusting it, measure its agreement yourself on a human validation set (accuracy, Cohen's κ, Spearman correlation), then re-calibrate regularly. G-Eval, for instance, reaches a Spearman correlation of ~0.51 with humans on summarization — well above BLEU/ROUGE, but far from perfect.

## Arenas and Elo: human preference at scale

To compare models "in the wild," arenas (like Chatbot Arena) pit two anonymous answers against each other and collect a blind *pairwise* vote at massive scale. Ranking, first via Elo (from chess), is now estimated by a **Bradley-Terry** model with bootstrap confidence intervals, which is more stable.

![The Elo system, imported from chess, underpins the pairwise-preference ranking of arenas.](/articles/evaluer-un-llm-llm-as-a-judge/elo-graph.svg)
*Figure: Elo score evolution. Source: Wikimedia Commons (CC BY-SA).*

Limits: the arena population is self-selected and the prompt distribution isn't representative; arenas mostly measure perceived "helpfulness," not safety or factuality.

## Offline vs online; and when NOT to trust the judge

**Offline evaluation** (frozen test sets, judge in CI) catches regressions; **online evaluation** (real production preferences) is more representative but noisier. HELM is a reminder to be **holistic**: beyond accuracy, measure calibration, robustness, bias, toxicity, fairness.

Beware the *echo chamber* (an LLM judging an LLM reinforces shared biases), prompt injection on the judged content, and Goodhart's law (optimizing the judge ≠ optimizing the user). Switch to humans for safety, compliance, or critical-factuality stakes, on an unvalidated domain, or when judge ↔ human agreement is low.

## In practice: the tripod

Reliable evaluation combines three levels: **deterministic tests** for what's verifiable, **LLM-as-a-judge** for qualitative work at scale, and a **sample of human review** to calibrate the judge. It's this tripod — automatable but anchored in humans — that makes evaluation both scalable and trustworthy.