Is LLM-as-a-judge reliable?

Well-framed, LLM-judge ↔ human agreement reaches ~85% on MT-Bench (excluding ties), close to human ↔ human agreement. But it is task- and domain-dependent: measure agreement yourself on a human validation set before trusting the judge.

What biases do LLM judges have, and how do you neutralize them?

Position bias (first answer), verbosity bias (longer answers), and self-preference (its own outputs). Mitigate by swapping A/B order, using a third-party judge, adding few-shot and reasoning, and switching to reference-guided for verifiable tasks.

Evaluating an LLM: the LLM-as-a-judge pattern

Evaluating an LLM is hard because there is often no single right answer: a summary, an explanation, or an email can be correct in a thousand ways. The LLM-as-a-judge pattern — using one model to grade another's outputs — has become the dominant trade-off between the scalability of automatic metrics and the nuance of human judgment. But it is a biased instrument: calibrate it against humans, and never trust it blindly.

Why evaluation is intrinsically hard

The quality of open-ended output combines fluency, factuality, relevance, tone, and format — dimensions often in tension, irreducible to one number. Add the lack of ground truth for generative tasks and non-determinism (the same input yields varying outputs). Finally, cost is combinatorial: N models × M prompts × K criteria. Evaluation isn't academic; it's continuous engineering: catch regressions on every deploy, compare models and prompts, monitor production.

Classic automatic metrics, and their limits

Exact-match / accuracy is perfect for closed QA, classification, or single-answer math — and useless once phrasing varies. BLEU and ROUGE measure n-gram overlap with a reference: built for translation and summarization, they correlate weakly with human judgment on open text. BERTScore captures semantic similarity but still depends on a reference and ignores factuality. The core flaw: these metrics measure resemblance to a reference, not absolute quality — blind to correct paraphrases and to well-phrased hallucinations alike.

Human evaluation: the gold standard, but…

Humans remain the ultimate reference: they capture nuance, factuality, and real user preference. But they are costly, slow, and inconsistent: experts agree with each other only about 80% of the time, and annotation doesn't scale to a CI/CD cycle. We distinguish pointwise scoring (an absolute grade) from pairwise preference (A vs B) — the latter more reliable, since comparing is easier than scoring in the absolute.

The LLM-as-a-judge pattern

The idea: hand evaluation to a strong LLM, guided by a prompt and a scoring rubric. Well-framed, it correlates strongly with humans while staying automatable. On MT-Bench, LLM-judge ↔ human agreement reaches ~85% (excluding ties), the same order as human ↔ human agreement (~81%). Treat it as a component you test and calibrate, not as an oracle.

The LLM judge grades a response in three modes: pointwise, pairwise, and reference-guided. Diagram: the three modes of LLM-as-a-judge.

The three modes

Pointwise (absolute grade): the judge assigns a score (e.g. 1–5) per a rubric. Fast, but poorly calibrated and unstable over time.
Pairwise (comparison): the judge picks A vs B (or a tie). More robust because it's comparative; the basis of arenas, but O(n²) if you compare all pairs.
Reference-guided: give the judge a reference answer (or have it answer first). Strongly reduces errors on reasoning and math.

Writing a good judge prompt

Give the judge a role, explicit criteria, and a rubric with per-level descriptors — not a vague instruction. Make it reason before scoring, use a short integer scale (1–5, no floats), and require structured output for reliable parsing:

Grade the response from 1 to 5 on:
- Factual accuracy (are the claims true?)
- Relevance (does it answer the question asked?)
- Clarity (is it understandable and well-structured?)
Justify each score in one sentence, THEN give the total.

Graded examples (few-shot) anchor the scale; for multiple criteria, decompose rather than scoring globally (RAGAS-style: faithfulness, answer relevancy, etc.).

The biases to neutralize

LLM judges have documented biases (the CALM taxonomy lists about a dozen):

Position: preference for the first (or second) answer, regardless of quality.
Verbosity: preference for longer answers, even without better content.
Self-preference: a model scores its own outputs higher.
Format / authority: sensitivity to polished markdown, citations (even fake), and tone.

Proven mitigations: swap the order (call the judge twice with A/B reversed, declare a winner only if it wins both ways); never judge a model with itself (use a third-party judge); add few-shot and chain-of-thought reasoning; and switch to reference-guided for verifiable tasks.

Agreement with humans: the only real validity criterion

A judge is only worth something if it correlates with humans — and that agreement is task- and domain-dependent; it doesn't transfer automatically. Before trusting it, measure its agreement yourself on a human validation set (accuracy, Cohen's κ, Spearman correlation), then re-calibrate regularly. G-Eval, for instance, reaches a Spearman correlation of ~0.51 with humans on summarization — well above BLEU/ROUGE, but far from perfect.

Arenas and Elo: human preference at scale

To compare models "in the wild," arenas (like Chatbot Arena) pit two anonymous answers against each other and collect a blind pairwise vote at massive scale. Ranking, first via Elo (from chess), is now estimated by a Bradley-Terry model with bootstrap confidence intervals, which is more stable.

The Elo system, imported from chess, underpins the pairwise-preference ranking of arenas. Figure: Elo score evolution. Source: Wikimedia Commons (CC BY-SA).

Limits: the arena population is self-selected and the prompt distribution isn't representative; arenas mostly measure perceived "helpfulness," not safety or factuality.

Offline vs online; and when NOT to trust the judge

Offline evaluation (frozen test sets, judge in CI) catches regressions; online evaluation (real production preferences) is more representative but noisier. HELM is a reminder to be holistic: beyond accuracy, measure calibration, robustness, bias, toxicity, fairness.

Beware the echo chamber (an LLM judging an LLM reinforces shared biases), prompt injection on the judged content, and Goodhart's law (optimizing the judge ≠ optimizing the user). Switch to humans for safety, compliance, or critical-factuality stakes, on an unvalidated domain, or when judge ↔ human agreement is low.

In practice: the tripod

Reliable evaluation combines three levels: deterministic tests for what's verifiable, LLM-as-a-judge for qualitative work at scale, and a sample of human review to calibrate the judge. It's this tripod — automatable but anchored in humans — that makes evaluation both scalable and trustworthy.