A reasoning model no longer just predicts the next token as fast as possible: it spends compute at inference to think before it answers, unrolling a long chain of thought. That is the test-time compute bet: with a frozen model, giving it more time to think can raise accuracy as much as — sometimes more than — making the model bigger. This article explains why it works, how these models are trained (o1/o3, DeepSeek-R1, RL and verifiable reward), which strategies exist at inference (voting, best-of-N, verifiers, tree search), and where the limits are.

From fast prediction to deliberate reasoning

A classic LLM answers in one shot: it emits the answer token by token, with no explicit thinking step. That is fine for simple tasks, but collapses on multi-step problems — competition math, proofs, planning, complex code.

The founding idea is chain-of-thought (CoT): instead of blurting the answer, the model writes its intermediate reasoning — "let X, then Y, therefore Z." This explicit reasoning acts as externalized working memory:

  • each step conditions the next, spreading computation across many tokens;
  • the model can catch and fix its own mistakes along the way;
  • the trace becomes inspectable, hence easier to debug than an opaque answer.

What is new about reasoning models (o1, o3, DeepSeek-R1, and their successors) is that this behavior is no longer merely triggered by a clever prompt ("think step by step"). It is now trained: the model learns to produce long thinking traces, to backtrack, to verify itself, because it is explicitly rewarded for doing so during training.

So there are two regimes:

  • prompt-triggered CoT on a general model, unreliable and sensitive to wording;
  • the learned reasoning of a dedicated model, robust, that knows when and how much to think.

Why a standard LLM plateaus in one shot

A mechanical reason: a Transformer spends a fixed amount of compute per token produced. For a question whose solution needs, say, the equivalent of ten deduction steps, demanding the answer in a single token means compressing those ten steps into one pass — which exceeds the network's effective depth. Chain-of-thought sidesteps this bound: by writing the intermediate steps, the model re-injects its own work into the context and grants itself, token after token, as many compute passes as there are steps.

That is also why "think step by step" works on an untrained model, but fragilely: the prompt allows spread-out compute without guaranteeing it is productive. RL, by contrast, shapes the policy of thinking: how many steps, when to backtrack, when to stop.

Why spending compute at inference helps

Pretraining shifts cost upstream: you spend a huge amount of FLOPs once, then each answer is cheap. Test-time compute partly inverts this: you keep a more modest model, but spend more on each hard query.

Why is this often a winning trade? Three reasons:

  • Many problems are verifiable or decomposable. A wrong answer out of ten attempts costs nothing if you can recognize the right one; sampling several solutions mechanically raises the chance that at least one is correct.
  • Sequential reasoning builds on itself. A long trace lets the model explore, try a lead, drop it, try another — exactly what a human does on a hard problem.
  • Allocation is adaptive. You can give little compute to easy questions and a lot to hard ones, instead of a uniform cost.

Snell et al. (arXiv 2408.03314) show that a compute-optimal allocation strategy — adapting method and budget to each prompt's difficulty — can be more than 4× more efficient than naive best-of-N. More striking still: on problems where a small model already has a non-trivial success rate, test-time compute can let it outperform a 14× larger model at an equal compute budget.

This does not mean inference replaces training. The two levers are complementary: a better base model raises the ceiling, and test-time compute extracts more from it on the spot. The real engineering question is: where to spend the next unit of compute?

pass@k versus pass@1: what coverage reveals

A key intuition reads in the gap between two metrics. pass@1 measures the probability that a single attempt is correct; pass@k, the probability that at least one of k attempts is. When pass@k far exceeds pass@1, it means the model often can produce the right answer, but not reliably on the first try. The whole point of parallel test-time compute is then to convert that coverage into accuracy: sample k traces (to reach for pass@k), then select the right one with a vote or a verifier. If, on the contrary, pass@k stays close to pass@1, sampling is pointless — the problem is beyond the model, and only a better model (or deeper sequential reasoning) will help.

Two families of test-time compute strategies: sequential scaling (one long trace with budget forcing) and parallel scaling (best-of-N, majority vote, search with a verifier). Figure: the two broad families of test-time compute — sequential and parallel.

o1 and o3: learning to reason with reinforcement learning

OpenAI opened this era with o1 ("Learning to reason with LLMs"). The central idea: train the model with reinforcement learning (RL) to produce a productive chain of thought. Over training, the model learns to:

  • recognize its own mistakes and correct them;
  • try different approaches when the first one fails;
  • break a hard problem into tractable subproblems.

Two scaling laws then stack up:

  • Performance improves with train-time compute (more RL).
  • It also improves with inference-time compute (more time thinking per query).

This is a new frontier of the "bitter lesson": you trade inference compute for better decisions. o3 extends o1 by scaling up RL and adding a test-time search during inference; part of its rapid gains came from using o1's full thinking traces as synthetic data for training.

A quirk: these models' raw chain of thought is partly hidden from the user (you only see a summary), for safety and IP reasons. That is precisely what DeepSeek-R1 made open — hence its importance for the community.

Two orthogonal compute axes

It helps to separate two dimensions that are often conflated:

  • train-time compute (pretraining + RL post-training), paid once and amortized over billions of queries;
  • inference-time compute (thinking tokens, samples, search), paid on every query.

The turning point of reasoning models is that they made inference compute productive: on a standard LLM, giving it 10× more tokens does not make it 10× better; on a model trained to reason, the accuracy/compute curve rises in an exploitable way. In other words, RL does not merely improve the model "on average" — it unlocks a second scaling law, the inference one, on top of the pretraining one.

DeepSeek-R1: open RL for reasoning

DeepSeek-R1 (arXiv 2501.12948) is the open replication that demystified the recipe. Its provocative starting point: R1-Zero, trained by pure RL, with no prior supervised fine-tuning (SFT) phase. The hypothesis: human-imposed reasoning patterns limit exploration; better to let the model discover for itself how to reason.

The reward signal is only the correctness of the final answer, plus a format bonus:

  • ground truth in math (is the numerical answer exact?);
  • running tests in code (does the program pass?);
  • a bonus if the reasoning stays neatly inside the expected tags.

Crucial point: no supervision of the reasoning itself. You never tell the model how to think, only whether the result is right.

Result: on AIME 2024, R1-Zero's pass@1 score climbs from 15.6% to 71.0% over RL training, and reaches 86.7% with majority voting — on par with o1. The model spontaneously develops sophisticated behaviors: longer traces, backtracking, self-verification. The final R1 adds a small SFT cold start for readability, but the bulk of the capability comes from RL.

R1's multi-stage pipeline

R1-Zero proves pure RL is enough to reason, but its output reads poorly (mixed languages, erratic formatting). The "final" R1 fixes this with a four-stage pipeline, which is the real reproducible blueprint:

  1. Cold-start SFT: a small set of long, readable CoTs primes the model, stabilizing the format before RL.
  2. Reasoning-oriented RL: GRPO with a verifiable reward (math/code); a language-consistency reward is added to fight language mixing.
  3. Rejection sampling + SFT: sample massively with the RL model, keep the correct traces (filtered by verification), then re-SFT — including on non-reasoning data (writing, factuality).
  4. Final all-domain RL: a last RL round also aligns helpfulness and harmlessness, not just reasoning.

The lesson: alternating RL (which explores and discovers strategies) and SFT on filtered traces (which consolidates and cleans up) is more robust than either one alone.

GRPO: RL without a value model

DeepSeek uses GRPO (Group Relative Policy Optimization), a PPO variant designed to be lighter. Instead of training a costly critic (value model) in parallel, GRPO samples a group of G responses for the same question, scores them, then computes each response's advantage relative to the group mean.

For a question q:
  sample G responses o_1 … o_G from the policy
  reward r_i  =  +1 if o_i is correct, + format bonus
  advantage A_i  =  (r_i − mean(r)) / std(r)
  push up the gradient of traces with A_i > 0,
  push down that of traces with A_i < 0

The trick: a response better than its group's average is reinforced, a worse one is penalized. No separate critic is needed — hence simpler, less memory-hungry training. Group normalization (centering and scaling the rewards) automatically rescales every batch to zero-mean, unit-variance, which stabilizes gradients without having to tune a reward scale.

The full objective looks like PPO: a clipped likelihood ratio (to avoid steps that are too large) plus a KL-divergence term toward the reference model (to keep the policy from drifting too far, i.e. from "breaking" the language acquired in pretraining):

maximize  E[ min( ρ_i · A_i,  clip(ρ_i, 1−ε, 1+ε) · A_i ) ]  −  β · KL(π_θ ‖ π_ref)
   with   ρ_i = π_θ(o_i | q) / π_old(o_i | q)        (importance ratio)

Two guardrails, then: the clip bounds the magnitude of an update, and the KL anchors the policy to the reference model. Both counter the classic instability of RL on LLMs (diversity collapse, catastrophic forgetting).

GRPO reinforcement-learning loop: the policy samples a group of responses, a verifiable reward scores them, the group-relative advantage updates the policy, and long chains of thought emerge. Figure: the RL loop with a verifiable reward (GRPO / DeepSeek-R1 style).

The "aha moment"

The paper's most quoted moment: during training, the model on its own starts allocating more thinking time by re-evaluating its initial approach — it literally writes something like "wait, let's check this step." Nobody taught it this strategy; it emerges because thinking more earns more reward. It is the demonstration that good incentives alone can make advanced problem-solving strategies appear.

A measurable corollary comes with that moment: the average length of traces grows over RL. The model learns, without being told, that spending more tokens on hard problems pays off — this is the inference scaling law emerging in situ, observed during training.

Reward: verifiable first, modeled later

The strength of this recipe rests on the verifiable reward: in math and code, you can reliably tell whether the answer is right (run the tests, compare the result). For open-ended domains (writing, dialogue), you need a learned reward model — more fragile, and exposed to reward hacking (the model optimizes the score without solving the task). DeepSeek avoided neural rewards on pure reasoning precisely so the signal could not be gamed.

Underpinning all this is the concept of RLVR (Reinforcement Learning from Verifiable Rewards): replacing the human/neural judge with a deterministic verification function (test runner, formal checker, comparison to ground truth). It is dense, reliable, and impossible to flatter — but confined to domains where an automatic truth exists. Extending reasoning to "soft" domains remains an open problem, precisely because the reward there becomes learned again and thus gameable.

Once the model is trained, several levers let you invest compute at answer time. They fall into two families (see the first figure): parallel (independent samples then selection) and sequential (a single, longer trace).

Self-consistency (majority vote)

The simplest technique: sample N distinct chains of thought (temperature > 0), extract the final answer from each, and keep the most frequent one. Several reasoning paths often converge on the right answer even when they differ; voting filters out isolated errors. It is purely parallel, trivial to implement, and very effective on tasks with a single verifiable answer.

The assumption is statistical: if the right answer is the mode of the output distribution (each error being idiosyncratic), aggregating N draws makes the estimator converge to that mode. Hence its limits: on a problem where the model errs systematically in the same way, voting amplifies the error instead of correcting it; and on free-form tasks (no canonical shape to compare), voting simply does not apply.

Best-of-N with a verifier

Rather than voting, you score each candidate with a verifier (a model trained to judge quality) and keep the best-scoring one. Two granularities:

  • ORM (Outcome Reward Model): judges the final answer, a binary "correct / incorrect" verdict.
  • PRM (Process Reward Model): judges each step of the reasoning, giving a dense signal. A PRM spots where the trace goes wrong, which guides finer selection and search.

PRMs are at the heart of Snell et al.'s result: searching against a per-step (process-based) verifier is markedly more efficient than simply multiplying samples.

How do you train a PRM without hand-labeling every step? The common trick is the per-step roll-out: from a reasoning prefix, you sample several continuations through to the final answer; the fraction of continuations that reach the correct answer becomes a value label for that prefix. This yields a step signal with no human annotation — at the cost of a lot of generation compute.

Tree search (Tree of Thoughts)

You can turn reasoning into search. Instead of a linear trace, you explore a tree of possible steps: at each node the model proposes several continuations, a verifier (or PRM) evaluates them, and you expand the promising branches (beam search, lookahead, even MCTS). It is more expensive but allows clean backtracking and avoids dead ends — useful on problems where a bad first step dooms everything.

Three variants are worth distinguishing:

  • PRM-guided beam search: keep the top b prefixes at each depth. Very effective at moderate budgets, but prone to verifier over-optimization on easy questions (you optimize the PRM's score, not the truth).
  • Lookahead: before scoring a node, simulate k steps ahead to assess its true potential. More informative but costly — often dominated by plain beam search at an equal budget.
  • MCTS: balances exploration and exploitation through successive trials; powerful but heavy to implement and tune.

Snell's lesson: beam search beats best-of-N at small budgets, but its gains erode as you sample more, sometimes ending below best-of-N — proof that no method dominates everywhere.

Budget forcing and "thinking" tokens

On the sequential side, s1 (arXiv 2501.19393) proposes budget forcing, disarmingly simple. To make it think more: when the model tries to end its reasoning, you suppress the end token and insert "Wait" — which pushes it to resume and often fix a mistake. To make it think less: you force the end once the budget is reached. You thus directly control the number of "thinking" tokens spent. The authors show this sequential budget forcing scales better than majority voting, because later steps build on earlier ones.

A striking detail: s1 reaches this result after SFT on only ~1,000 carefully chosen examples, with no RL at all. This suggests much of the "know-how to reason" is already latent in the base model, and that training mostly serves to activate an exploitable thinking format — not to teach mathematics from scratch.

Summary table

A cheat sheet for choosing:

Strategy Family Needs a verifier Best for
Self-consistency Parallel No (vote) Single verifiable answer
Best-of-N + ORM/PRM Parallel Yes Math, code, factual QA
Tree of Thoughts Hybrid Yes (or heuristic) Exploratory problems
Budget forcing Sequential No Deep iterative reasoning

None of these approaches is universally best: the right choice depends on the task, the budget, and whether the answer can be verified.

Scaling laws: compute versus accuracy

The central empirical fact: accuracy grows with inference compute, often log-linearly, until a plateau. But the slope and ceiling depend on the method and the difficulty:

  • On easy questions, refining a good starting answer (sequential) is enough; multiplying samples wastes compute.
  • On hard questions, you need broader search (parallel + verifier) because the first lead is often wrong.

Hence the compute-optimal strategy: estimate difficulty, then choose method and budget accordingly. This is what yields the famous 4× gains and the beating of a 14× larger model. The strategic lesson: for a share of workloads, it is better to invest at inference than to grow the model — but not for all of them.

Sequential or parallel: a difficulty-driven trade-off

Snell formalizes the idea with difficulty bins. On easy bins, sequential refinement dominates: the model starts near the right answer and polishes it. On hard bins, parallel (sample wide then select) dominates: diversity matters more than depth, because no single trace reaches the answer reliably. The real compute-optimal strategy combines both: for a fixed total budget, you choose the sequential/parallel ratio according to the prompt's estimated difficulty.

total budget B = (number of samples N) × (length per trace L)

easy  →  small N, large L    (few leads, dig deep)
hard  →  large N, moderate L  (many leads, cast wide)

A practical consequence follows: without a difficulty estimator, you cannot split B — and you fall back on uniform best-of-N, up to 4× less efficient.

A concrete chain-of-thought example

To make this tangible, here is what a trace produced by a reasoning model looks like on a small problem. Note the self-correction in the middle — the characteristic "wait."

Question: a train covers 120 km in 1 h 30. What is its average speed?

<think>
Speed = distance / time.
1 h 30 = 1.5 h.
120 / 1.5 = 80... wait, let's check: 80 × 1.5 = 120. OK.
So 80 km/h.
</think>

Answer: 80 km/h.

With self-consistency, you would sample several such traces; the "80 km/h" answer would come out as the majority. With a PRM, you would score each step ("1 h 30 = 1.5 h" is correct, etc.) to catch a trace that goes wrong as early as the unit conversion.

To illustrate budget forcing, suppose the model wants to conclude too early on a harder problem; you intercept the end and resume:

<think>
… so the answer is 42.
</think>          ← the decoder wants to stop here
Wait,            ← suppress the end and inject "Wait"
let's re-check the case where n is odd…  ← the model resumes and corrects

The same lever, inverted, serves to cap: as soon as the "thinking" token quota is reached, you force the end tag and request the answer, which bounds cost and latency.

Distilling reasoning into smaller models

Running a large reasoning model is expensive. An elegant path: distillation. DeepSeek generated ~800,000 reasoning traces with R1, then SFT-fine-tuned smaller models (Qwen, Llama) on those traces. Surprise: these small distilled models inherit good reasoning ability without doing RL themselves — the costly RL is done once, on the large model, then transferred.

A striking result from the paper: applying RL directly to a small base model performs worse than distilling from R1. In other words, the large model discovered reasoning patterns that a small model, lacking capacity, would not find on its own through RL — but that it can imitate once they are spelled out as traces. Distillation thus transfers a discovery, not just answers.

Limits of distillation:

  • Small models digest poorly traces that are too long or too formal; naive SFT can even degrade their solve rate.
  • They inherit the teacher's flaws: a tendency to overthink and to hallucinate.
  • They are often less faithful: a distilled model articulates certain cues from its reasoning far less often than R1 itself.

Finer approaches (e.g. generating diverse traces via tree search, then filtering by verification before SFT) mitigate these issues, but distillation remains a trade-off: you gain in cost what you lose a little in robustness and transparency. In practice, many deployments combine: a distilled model for the bulk of traffic, and a call to the large model (or an increased thinking budget) reserved for queries the router deems hard.

Costs, latency, and limits

Test-time compute is not free. Its trade-offs are real:

  • Cost and latency. Multiplying thinking tokens multiplies the cost per query and the response delay. A trivial question handled like an olympiad problem is wasted money and seconds.
  • Overthinking. Reasoning models sometimes over-reason: they unroll pages of CoT on a simple question, or even talk themselves out of a correct first intuition. Hence the value of controlling the budget.
  • Reasoning faithfulness. A delicate, counterintuitive point: the displayed chain of thought does not always reflect the model's true internal computation. Studies show models can produce plausible reasoning that rationalizes an answer decided otherwise, without mentioning the cues actually used. A readable CoT is therefore no guarantee of transparency or correctness.
  • Reward hacking. Outside verifiable reward, the model can learn to please the verifier rather than solve the problem.
  • Verifier over-optimization. Even with a PRM, pushing search too far optimizes the verifier's score rather than the truth — hence the collapse of beam search at large budgets on easy questions.
  • Saturation. Beyond a certain budget, each extra token earns less and less — you have to know when to stop.
  • Variance. Two runs on the same question can take very different paths; voting or best-of-N stabilizes the output but multiplies the cost.

These limits do not disqualify reasoning: they frame its use. A mature system decides when to reason, not just how.

In practice: when to turn reasoning on

A few guideposts for a real system:

  • Route by difficulty. Detect simple queries and handle them in fast mode; reserve long reasoning for hard problems.
  • Pick the family by task. Verification possible (math, code, factual QA) → best-of-N + verifier. Single-path answer → self-consistency. Exploratory problem → tree search.
  • Cap the budget. Set a maximum number of thinking tokens to bound cost and latency; budget forcing is a direct lever.
  • Distill for production. When volume is high, distilling a large model's reasoning into a smaller one sharply cuts unit cost.
  • Don't confuse readable with reliable. Treat CoT as a reasoning aid, not an audit trail — verify conclusions independently when the stakes are high.
  • Measure, don't assume. Plot accuracy against the token budget on your data: the curve tells you where saturation sits, hence the default budget to set.

These principles combine: a difficulty router up front, a strategy family matched to each task type, a budget cap, and a distilled model for the bulk of traffic form an inference architecture that is both performant and cost-controlled.

Production checklist

To move from prototype to service, a concrete checklist:

  • Estimate difficulty up front (length, keywords, a small classifier's score, or disagreement from a mini-vote) to route fast vs. reasoning.
  • Bound the "thinking" tokens per query (a hard budget) and cut off beyond it, to guarantee a latency SLA.
  • Choose the selector: majority vote if the answer has a canonical shape; a verifier (ORM/PRM) if quality must be judged; nothing if a single trace suffices.
  • Watch for reward hacking outside verifiable domains: audit a sample of outputs, not just the verifier's score.
  • Cache answers to frequent queries so you don't re-pay for thinking every time.
  • Plot the accuracy/budget curve on your data and set the default budget before saturation, not after.
  • Plan a fallback: if thinking exceeds the budget without converging, return the best partial answer rather than waiting indefinitely.
  • Log consumed budget, method, and outcome per query, to measure real cost and tune the router.

In summary

Reasoning models turn inference compute into a quality lever: you teach the model, via RL and verifiable reward, to think more when the problem demands it (o1/o3, DeepSeek-R1 and its GRPO, the "aha moment"). At inference, a range of strategies — majority vote, best-of-N with a PRM, tree search, budget forcing — lets you trade FLOPs for accuracy, ideally in a compute-optimal way. The whole thing distills into smaller models for production.

But inference compute is paid for in cost, in latency, and runs into overthinking and the uncertain faithfulness of the displayed reasoning: the real skill is not to think the most, but to think just enough.