"Naive" RAG — retrieve the top-k, stuff the context, generate — hits a ceiling the moment a question requires **reasoning**: comparing two sources, aggregating scattered facts, or making sense of an entire corpus. This article assumes you already know embeddings and vector search (see the dedicated article) and sits one layer **above**: how an agent turns retrieval into a reasoning loop (Agentic RAG), and how GraphRAG structures the corpus into a knowledge graph to answer the global questions that vector RAG misses.

## Why naive RAG hits a wall

The classic pipeline makes **a single pass**: one query, one retrieval, one generation. It does not iterate, does not self-correct, and never checks whether the retrieved documents actually answer the question. Three blind spots recur constantly:

- **One-shot retrieval.** If the top-k is bad (ambiguous query, vocabulary mismatch, poorly cut chunk), the answer is bad — with no recourse.
- **No multi-step reasoning.** A "multi-hop" question ("Who is the CEO of the company that acquired X in 2021?") needs two chained retrievals; one-shot does only one.
- **No self-evaluation.** The LLM receives the context and generates, whether it is relevant or not. No guardrail catches an off-topic retrieval before it pollutes the answer.
- **Global questions.** "What are the main themes of this corpus?" has no answer inside a subset of k passages: the information is **spread** across the whole corpus, not concentrated in a few chunks.

The typical symptom is **hallucination from poor context**: lacking relevant evidence, the model fills the gaps. Naive RAG remains excellent for **simple factual questions** where a single passage is enough — that is precisely where you should keep it. The problem isn't RAG, it's the absence of a loop when the question demands one.

![A comparison between a linear one-pass naive RAG pipeline and an agentic loop with evaluation, web fallback, and self-reflection.](/articles/agentic-rag-et-graphrag/naive-vs-agentic-rag.svg)
*Figure: naive RAG (one-shot) vs agentic RAG (retrieve–reason loop). Original diagram.*

## The idea of Agentic RAG

Agentic RAG replaces the frozen pipeline with an **agent** that decides, at each step, what to do: reformulate the query, choose a source, retrieve again, evaluate what it got, and only then generate. Retrieval becomes a **tool** called inside a reasoning loop, not a single upstream step. It is often formalized as a **state machine** (a graph of nodes joined by conditional edges): retrieve → grade documents → rewrite → generate → check — with feedback loops when a step fails.

### Query rewriting and decomposition

The user's query is rarely the best search query. Two key transformations:

- **Query rewriting**: reformulate to match the corpus vocabulary, resolve pronouns, expand acronyms.
- **Decomposition (multi-hop)**: break a complex question into sub-questions retrieved **independently**, then recompose. Cost grows linearly with the "fan-out," but it is what unlocks comparisons and aggregations.
- **Expansion (HyDE)**: generate a *hypothetical* answer to the question, then use it as the search query — often closer to the target documents than the raw question.

These transformations are cheap (one short LLM call) and fix the most common cause of failure: a query that doesn't "speak the language" of the indexed corpus.

### Routing

Before retrieving, a **router** directs the query to the right source: vector DB for unstructured data, SQL for structured, web for current events, a specific business tool. **Adaptive RAG** pushes the idea further: a lightweight classifier predicts the query's **difficulty** (no retrieval / single-step / multi-step) and picks the pipeline depth — cutting 30–50% of average cost on mixed traffic by skipping the agent for easy lookups.

Routed well, a system can answer a plain "hello" with no retrieval at all, and engage the full agentic machinery only on the questions that deserve it.

### Retrieve–reason loops (ReAct)

The **ReAct** pattern alternates *reason* and *act*: each turn the agent thinks, calls a tool (retrieval, SQL, web, MCP server), observes the result, and repeats until it can answer. This is essential whenever you mix unstructured sources, structured data, and current events. **Mandatory guardrail**: an iteration cap (5–6) to avoid infinite loops, and a bound on tool depth.

Concretely, the loop is modeled as an explicit **state graph**, which tools like LangGraph make natural:

- **nodes** = steps (retrieve, grade, rewrite, generate, check);
- **conditional edges** = decisions ("relevant documents?" → generate / rewrite / web);
- **shared state** = the current query, the kept documents, the iteration counter.

This "flow engineering" makes the logic **auditable**: each decision is a named point in the graph, traceable and testable — far more maintainable than a monolithic prompt that "decides on its own."

Here is the skeleton of such a loop, where **retrieval is a node** called again until the checks pass:

```python
# Agentic RAG state graph (pseudo-LangGraph)
def agent_loop(question, max_iters=5):
    state = {"question": question, "iters": 0, "docs": []}
    while state["iters"] < max_iters:
        state["iters"] += 1
        state["docs"] = retrieve(state["question"])         # node: retrieve
        verdict = grade_documents(state)                    # node: grade
        if verdict["route"] == "web_search":
            state["docs"] = web_search(state["question"])    # external fallback (CRAG)
        elif verdict["route"] == "rewrite":
            state["question"] = rewrite_query(state)         # conditional edge
            continue
        answer = generate(state)                             # node: generate
        if is_grounded(answer, state["docs"]) and answers(answer, question):
            return answer                                    # Self-RAG checks pass
        state["question"] = rewrite_query(state)             # else: loop again
    return escalate_to_human(question)                       # guardrail: cap reached
```

The output is never "the first generation that came out": it must pass two checks (grounding + answering the question) inherited from Self-RAG, otherwise the loop rewrites the query and retries — up to the cap, where you **escalate to a human** rather than spinning forever.

## Self-RAG: reflection via tokens

**Self-RAG** trains the model to emit **reflection tokens** that make it self-evaluate at each step:

- `Retrieve` — should it retrieve (yes / no / continue)?
- `ISREL` — are the retrieved passages **relevant**?
- `ISSUP` — is the generation **supported** by the passages (anti-hallucination)?
- `ISUSE` — is the answer **useful** for the question (1–5 rating)?

The original paper actually trains **two models**: a *critic* that learns to predict these tokens, and a *generator* that emits them as it decodes. At inference, a **segment-level beam search** selects the continuation that maximizes a linear combination of text likelihood and the probabilities of "desirable" tokens (`ISREL` = relevant, `ISSUP` = supported, high `ISUSE`). The weight of each token is a **tunable dial** at inference time: harden `ISSUP` for a sensitive medical task, relax it for creative writing — no retraining needed.

In practice (e.g. with LangGraph), *grading* nodes implement this logic without a specially trained model: you grade document relevance; if all fail, you **rewrite** the query and retrieve again. After generation, two checks: is the answer grounded in the documents (anti-hallucination) and does it actually answer the question? If either fails, the loop restarts with an improved query.

## Corrective RAG (CRAG): grade, then correct

**CRAG** inserts a **retrieval evaluator** that assigns a confidence score to the documents and routes to three actions:

1. **Correct** (at least one document above the upper threshold) → use the documents, but **refine** them first.
2. **Incorrect** (all documents below the lower threshold) → discard them and trigger a **web search** (external fallback, e.g. via a search API).
3. **Ambiguous** (in between) → a mixed action, **combining** internal documents and a web supplement.

Refinement uses a **decompose-then-recompose** strategy: each document is split into **knowledge strips**, each strip is re-scored by the evaluator, off-topic strips are filtered out, and only the relevant ones are stitched back. This avoids polluting the context with neighboring passages that, even if correct, do not serve the question. That is what makes CRAG **plug-and-play**: it slots in front of any existing RAG pipeline.

Self-RAG and CRAG are **complementary**: CRAG improves the **quality of the evidence** (the input), Self-RAG improves the **way of reasoning** over that evidence (the output). You can chain both: CRAG cleans/corrects the inputs, then Self-RAG reflects on the final answer.

### Pattern recap

Production systems rarely use a single pattern. Here is how they position themselves:

- **Self-RAG** — the model emits reflection tokens and self-grades. Shines when retrieval signals are **noisy**; wastes tokens on a reliable corpus.
- **CRAG** — an external evaluator sorts into three paths (keep / web / hybrid). Ideal for **heterogeneous** bases mixing product docs and forum threads.
- **Adaptive RAG** — a classifier predicts difficulty upfront and picks the depth. Cuts 30–50% of average cost on varied traffic.
- **ReAct** — reason/act loop with heterogeneous tools (vectors, SQL, web, MCP). Essential in multi-source environments.
- **Multi-hop decomposition** — parallel sub-questions then recomposition. Linear cost with fan-out; perfect for comparisons and aggregations.

The practical rule: start simple (hybrid + rerank), add **Adaptive routing** to preserve speed, then enable Self-RAG/CRAG/multi-hop only on the paths that need it.

```python
# Sketch of a CRAG evaluation node (pseudo-LangGraph)
def grade_documents(state):
    question, docs = state["question"], state["documents"]
    kept = [d for d in docs if grader(question, d).score == "relevant"]
    if len(kept) == 0:
        return {"route": "web_search", "documents": []}   # external fallback
    if len(kept) < len(docs):
        return {"route": "rewrite", "documents": kept}      # medium confidence
    return {"route": "generate", "documents": kept}         # high confidence
```

## Quick recap: hybrid + reranking

Before reasoning, you must **retrieve well**. Without re-explaining embeddings, keep the two-stage pipeline that has become standard:

1. **Hybrid search** — fuse **lexical** (BM25, robust on rare terms, codes, proper nouns) and **dense** (semantic) retrieval; return a wide top-50.
2. **Reranking** — a **cross-encoder** (Cohere, Voyage…) finely re-scores the (query, passage) pair and keeps only the top-5. The gain is clear: Recall@5 typically rises from ~0.70 (hybrid alone) to ~0.82 with reranking.

Dense alone misses exact lexical matches; BM25 alone misses synonyms. Hybrid covers both, the reranker sorts it out. This is the foundation the agentic loops operate on.

A few practical reminders on chunking, without reopening the embeddings debate:

- **Semantic chunking** (512–1024 tokens, 50–100 overlap) beats fixed-size: it respects meaning boundaries.
- **Metadata** on each chunk (source, date, section) for **filtering** before vector search.
- **Score fusion** (Reciprocal Rank Fusion) to combine BM25 and dense rankings robustly, without calibrating incompatible scales.

A reranker costs one extra model call per query, but on a top-50 narrowed to top-5 it directly improves downstream **faithfulness**: less noise in the context, fewer hallucinations to correct later.

## GraphRAG: structuring the corpus into a graph

Vector RAG fails on **global** questions because it retrieves "a subset of individually relevant passages" — yet a synthesis question has no single passage that contains it. **GraphRAG** (Microsoft, *From Local to Global*) answers by pre-building a **knowledge graph** then summarizing it hierarchically. The indexing pipeline has six stages:

1. **Chunking + extraction.** The LLM reads each chunk and extracts **triples** (entity, relationship, entity) and claims — e.g. *Steve Jobs — founded — Apple*. Tailored prompts and **self-reflection** (the LLM rereads for missed entities) allow larger chunks without recall loss.
2. **Graph construction.** Entity instances are **deduplicated** and aggregated: nodes are entities, edges are relationships, weighted by their frequency across the corpus.
3. **Community detection (Leiden).** The **Leiden** algorithm (an improved Louvain) recursively partitions the graph into **nested communities**: leaf level (highly related entities), intermediate levels, root level — a true **tree** of communities.
4. **Community summaries.** Each community gets a report-like **summary**, generated bottom-up; higher levels reuse the lower-level summaries under space constraints.
5. **Query (map-reduce).** For a global question, you **map** across community summaries in parallel (partial answers + helpfulness score), then **reduce** (sort by score → final synthesis).
6. **Local query.** For a precise factual question, local search starts from the entities matching the query, **expands to neighbors** and relationships, retrieves the associated chunks, and answers over this combined context.

**Extraction quality** is the critical step: a graph built from poorly extracted or poorly deduplicated entities propagates its errors everywhere. Three levers:

- **Domain-tailored prompts**, with *few-shot* examples, to frame the expected entity types.
- **Extraction self-reflection**: the LLM rereads the chunk to spot missed entities, which allows larger chunks without recall loss.
- **Entity resolution** (deduplication): merge "Apple," "Apple Inc.," and "the Cupertino firm" into a single node, or the graph fragments.

Note the parallel: self-reflection here serves **indexing**, whereas in Agentic RAG it serves **querying**. The same idea — the model rereads itself — operates at two different moments of the pipeline.

![An entity graph partitioned into Leiden communities, and the hierarchical summary tree (root, intermediate, leaf) used for global queries via map-reduce.](/articles/agentic-rag-et-graphrag/graphrag-communities.svg)
*Figure: entity graph, Leiden communities, and hierarchical summaries. Original diagram.*

### Global vs local: when to use which

- **Local query** — targeted factual questions ("Who leads X?"). Combines an entity subgraph + text chunks. Close to vector RAG, but better grounded in relationships.
- **Global query** — **synthesis** questions ("What are the dominant themes?", "What tensions run through this corpus?"). Uses community summaries via map-reduce.

The community **level** queried is a dial: the root (C0) gives a very condensed, cheap view; the leaves give detail at the cost of far more tokens. In the Microsoft study, the **intermediate level** offers the best comprehensiveness/cost balance — a good starting default.

### When GraphRAG beats vector RAG

On global **sensemaking** questions, GraphRAG clearly dominates vector RAG: in the Microsoft study, win rates of **72–83% on comprehensiveness** and **62–82% on diversity** of perspectives, on corpora of about one million tokens. Better still: querying the **root summaries** (C0) consumes **9× to 43× fewer** context tokens than the "all source text" approach, while keeping a ~72% comprehensiveness advantage over vector RAG. **Intermediate-level** summaries offer the best trade-off.

The downside is **indexing cost**: building the graph is heavy (LLM extraction on every chunk, summaries at each level) — the study reports ~281 minutes of indexing for a "podcast" corpus with GPT-4-turbo. So GraphRAG is not a universal replacement for vector RAG: it is the tool for **global** questions and corpora where **relational structure** carries the meaning.

### Combining GraphRAG and Agentic RAG

The two approaches don't compete: an **agentic router** can pick the strategy based on the question. A frequent 2026 pattern:

1. The router classifies the query: targeted factual, relational multi-hop, or global synthesis.
2. **Factual** → hybrid vector RAG + rerank (fast, cheap).
3. **Relational / multi-hop** → GraphRAG **local** query (follow the graph's edges over several hops).
4. **Global** → GraphRAG **global** query (map-reduce over community summaries).
5. A Self-RAG/CRAG loop wraps it all to verify and, if needed, correct or fall back to the web.

The graph thus becomes one **tool** among others in the ReAct loop, invoked only when relational structure matters.

## Evaluating: RAGAS and beyond

You don't run an agentic system without measurement. **RAGAS** provides per-query metrics, computed by an LLM judge:

- **Faithfulness** (target ≥ 0.9) — are the answer's claims **supported** by the retrieved context? This is the anti-hallucination metric.
- **Answer relevancy** (≥ 0.85) — does the answer match the question?
- **Context precision / relevancy** (≥ 0.8) — were the **right** chunks retrieved?
- **Context recall** — does the context cover **all** the information needed for the reference answer?

These metrics are computed **per query** on a test set, which lets you objectively compare two configurations (with/without rerank, with/without CRAG) before touching production.

Three operational precautions: (1) the **evaluator paradox** — asking the same fallible LLM to grade itself creates circularity; mitigate by **ensembling** several judges, freezing a *golden set*, and having humans review ~5% of traces; (2) **trajectory tracing** (Phoenix, Langfuse, OpenTelemetry) to debug each agent iteration; (3) **drift monitoring** (knowledge base, embeddings, evaluation) weekly against a frozen reference set.

## Costs and operational trade-offs

Agentic RAG is powerful but **expensive**: expect **3× to 10×** more tokens than plain RAG, and **2× to 5×** the latency (from ~1–2 s to ~8–12 s per query). At 10,000 queries/day, a ~$500/day cost can climb to $1,500–5,000/day without optimization. The levers: **adaptive routing** (run the loop only on ambiguous questions), **semantic caching**, **iteration caps**, and per-query token budgets.

When **not** to be agentic: FAQs, single-fact lookups, sub-3-second UX constraints. The right reflex is **Adaptive RAG** — preserve speed on easy questions, reason in a loop only on those that truly require it.

Recurring production gotchas to anticipate from day one:

- **Infinite loops** — cap at 5–6 iterations then escalate to a human.
- **Knowledge-base drift** — track Recall@5 weekly against a frozen set.
- **Cost tail** — watch the p99, far worse than the mean, not just the median cost.
- **Tool-call cascades** — bound the tool-graph depth (≤ 3).
- **Evaluator paradox** — ensemble the judges, freeze the golden set, review ~5% by hand.

### Implementation checklist

A pragmatic order of march to go from prototype to production:

1. **Foundation**: hybrid BM25 + dense, semantic chunking, cross-encoder reranking.
2. **Measurement**: wire up RAGAS (faithfulness, relevancy, context precision) + tracing (OpenTelemetry) from the start.
3. **Adaptive routing**: a difficulty classifier so you only pay for reasoning when it pays off.
4. **Loops**: add Self-RAG (reflection) and CRAG (correction + web fallback) on the hard paths.
5. **Graph**: introduce GraphRAG for global / relational questions, as a tool of the loop.
6. **Guardrails**: iteration caps, token budgets, semantic caching, drift alerts.

## In summary

Three layers stack up. **Retrieval**: hybrid (BM25 + dense) + reranking, the foundation. **Agentic RAG**: rewriting/decomposition, routing, ReAct loops, Self-RAG (reflection) and CRAG (correction + web fallback) to turn retrieval into verified reasoning. **GraphRAG**: a knowledge graph + Leiden communities + hierarchical summaries for the global questions vector search cannot cover.

None of these layers is free: every loop, every graph, every judge adds tokens and latency. Maturity therefore means **measuring before adding** — instrument with RAGAS and tracing, route by difficulty, and invest in reasoning only where the quality gain justifies the overhead. Naive RAG is not obsolete: it remains the right choice for the majority of factual queries. The art, in 2026, is to **spend reasoning where it pays off**, and stay one-shot everywhere else.