"Naive" RAG — retrieve the top-k, stuff the context, generate — hits a ceiling the moment a question requires reasoning: comparing two sources, aggregating scattered facts, or making sense of an entire corpus. This article assumes you already know embeddings and vector search (see the dedicated article) and sits one layer above: how an agent turns retrieval into a reasoning loop (Agentic RAG), and how GraphRAG structures the corpus into a knowledge graph to answer the global questions that vector RAG misses.
Why naive RAG hits a wall
The classic pipeline makes a single pass: one query, one retrieval, one generation. It does not iterate, does not self-correct, and never checks whether the retrieved documents actually answer the question. Three blind spots recur constantly:
- One-shot retrieval. If the top-k is bad (ambiguous query, vocabulary mismatch, poorly cut chunk), the answer is bad — with no recourse.
- No multi-step reasoning. A "multi-hop" question ("Who is the CEO of the company that acquired X in 2021?") needs two chained retrievals; one-shot does only one.
- No self-evaluation. The LLM receives the context and generates, whether it is relevant or not. No guardrail catches an off-topic retrieval before it pollutes the answer.
- Global questions. "What are the main themes of this corpus?" has no answer inside a subset of k passages: the information is spread across the whole corpus, not concentrated in a few chunks.
The typical symptom is hallucination from poor context: lacking relevant evidence, the model fills the gaps. Naive RAG remains excellent for simple factual questions where a single passage is enough — that is precisely where you should keep it. The problem isn't RAG, it's the absence of a loop when the question demands one.
Figure: naive RAG (one-shot) vs agentic RAG (retrieve–reason loop). Original diagram.
The idea of Agentic RAG
Agentic RAG replaces the frozen pipeline with an agent that decides, at each step, what to do: reformulate the query, choose a source, retrieve again, evaluate what it got, and only then generate. Retrieval becomes a tool called inside a reasoning loop, not a single upstream step. It is often formalized as a state machine (a graph of nodes joined by conditional edges): retrieve → grade documents → rewrite → generate → check — with feedback loops when a step fails.
Query rewriting and decomposition
The user's query is rarely the best search query. Two key transformations:
- Query rewriting: reformulate to match the corpus vocabulary, resolve pronouns, expand acronyms.
- Decomposition (multi-hop): break a complex question into sub-questions retrieved independently, then recompose. Cost grows linearly with the "fan-out," but it is what unlocks comparisons and aggregations.
- Expansion (HyDE): generate a hypothetical answer to the question, then use it as the search query — often closer to the target documents than the raw question.
These transformations are cheap (one short LLM call) and fix the most common cause of failure: a query that doesn't "speak the language" of the indexed corpus.
Routing
Before retrieving, a router directs the query to the right source: vector DB for unstructured data, SQL for structured, web for current events, a specific business tool. Adaptive RAG pushes the idea further: a lightweight classifier predicts the query's difficulty (no retrieval / single-step / multi-step) and picks the pipeline depth — cutting 30–50% of average cost on mixed traffic by skipping the agent for easy lookups.
Routed well, a system can answer a plain "hello" with no retrieval at all, and engage the full agentic machinery only on the questions that deserve it.
Retrieve–reason loops (ReAct)
The ReAct pattern alternates reason and act: each turn the agent thinks, calls a tool (retrieval, SQL, web, MCP server), observes the result, and repeats until it can answer. This is essential whenever you mix unstructured sources, structured data, and current events. Mandatory guardrail: an iteration cap (5–6) to avoid infinite loops, and a bound on tool depth.
Concretely, the loop is modeled as an explicit state graph, which tools like LangGraph make natural:
- nodes = steps (retrieve, grade, rewrite, generate, check);
- conditional edges = decisions ("relevant documents?" → generate / rewrite / web);
- shared state = the current query, the kept documents, the iteration counter.
This "flow engineering" makes the logic auditable: each decision is a named point in the graph, traceable and testable — far more maintainable than a monolithic prompt that "decides on its own."
Here is the skeleton of such a loop, where retrieval is a node called again until the checks pass:
# Agentic RAG state graph (pseudo-LangGraph)
def agent_loop(question, max_iters=5):
state = {"question": question, "iters": 0, "docs": []}
while state["iters"] < max_iters:
state["iters"] += 1
state["docs"] = retrieve(state["question"]) # node: retrieve
verdict = grade_documents(state) # node: grade
if verdict["route"] == "web_search":
state["docs"] = web_search(state["question"]) # external fallback (CRAG)
elif verdict["route"] == "rewrite":
state["question"] = rewrite_query(state) # conditional edge
continue
answer = generate(state) # node: generate
if is_grounded(answer, state["docs"]) and answers(answer, question):
return answer # Self-RAG checks pass
state["question"] = rewrite_query(state) # else: loop again
return escalate_to_human(question) # guardrail: cap reached
The output is never "the first generation that came out": it must pass two checks (grounding + answering the question) inherited from Self-RAG, otherwise the loop rewrites the query and retries — up to the cap, where you escalate to a human rather than spinning forever.
Self-RAG: reflection via tokens
Self-RAG trains the model to emit reflection tokens that make it self-evaluate at each step:
Retrieve— should it retrieve (yes / no / continue)?ISREL— are the retrieved passages relevant?ISSUP— is the generation supported by the passages (anti-hallucination)?ISUSE— is the answer useful for the question (1–5 rating)?
The original paper actually trains two models: a critic that learns to predict these tokens, and a generator that emits them as it decodes. At inference, a segment-level beam search selects the continuation that maximizes a linear combination of text likelihood and the probabilities of "desirable" tokens (ISREL = relevant, ISSUP = supported, high ISUSE). The weight of each token is a tunable dial at inference time: harden ISSUP for a sensitive medical task, relax it for creative writing — no retraining needed.
In practice (e.g. with LangGraph), grading nodes implement this logic without a specially trained model: you grade document relevance; if all fail, you rewrite the query and retrieve again. After generation, two checks: is the answer grounded in the documents (anti-hallucination) and does it actually answer the question? If either fails, the loop restarts with an improved query.
Corrective RAG (CRAG): grade, then correct
CRAG inserts a retrieval evaluator that assigns a confidence score to the documents and routes to three actions:
- Correct (at least one document above the upper threshold) → use the documents, but refine them first.
- Incorrect (all documents below the lower threshold) → discard them and trigger a web search (external fallback, e.g. via a search API).
- Ambiguous (in between) → a mixed action, combining internal documents and a web supplement.
Refinement uses a decompose-then-recompose strategy: each document is split into knowledge strips, each strip is re-scored by the evaluator, off-topic strips are filtered out, and only the relevant ones are stitched back. This avoids polluting the context with neighboring passages that, even if correct, do not serve the question. That is what makes CRAG plug-and-play: it slots in front of any existing RAG pipeline.
Self-RAG and CRAG are complementary: CRAG improves the quality of the evidence (the input), Self-RAG improves the way of reasoning over that evidence (the output). You can chain both: CRAG cleans/corrects the inputs, then Self-RAG reflects on the final answer.
Pattern recap
Production systems rarely use a single pattern. Here is how they position themselves:
- Self-RAG — the model emits reflection tokens and self-grades. Shines when retrieval signals are noisy; wastes tokens on a reliable corpus.
- CRAG — an external evaluator sorts into three paths (keep / web / hybrid). Ideal for heterogeneous bases mixing product docs and forum threads.
- Adaptive RAG — a classifier predicts difficulty upfront and picks the depth. Cuts 30–50% of average cost on varied traffic.
- ReAct — reason/act loop with heterogeneous tools (vectors, SQL, web, MCP). Essential in multi-source environments.
- Multi-hop decomposition — parallel sub-questions then recomposition. Linear cost with fan-out; perfect for comparisons and aggregations.
The practical rule: start simple (hybrid + rerank), add Adaptive routing to preserve speed, then enable Self-RAG/CRAG/multi-hop only on the paths that need it.
# Sketch of a CRAG evaluation node (pseudo-LangGraph)
def grade_documents(state):
question, docs = state["question"], state["documents"]
kept = [d for d in docs if grader(question, d).score == "relevant"]
if len(kept) == 0:
return {"route": "web_search", "documents": []} # external fallback
if len(kept) < len(docs):
return {"route": "rewrite", "documents": kept} # medium confidence
return {"route": "generate", "documents": kept} # high confidence
Quick recap: hybrid + reranking
Before reasoning, you must retrieve well. Without re-explaining embeddings, keep the two-stage pipeline that has become standard:
- Hybrid search — fuse lexical (BM25, robust on rare terms, codes, proper nouns) and dense (semantic) retrieval; return a wide top-50.
- Reranking — a cross-encoder (Cohere, Voyage…) finely re-scores the (query, passage) pair and keeps only the top-5. The gain is clear: Recall@5 typically rises from ~0.70 (hybrid alone) to ~0.82 with reranking.
Dense alone misses exact lexical matches; BM25 alone misses synonyms. Hybrid covers both, the reranker sorts it out. This is the foundation the agentic loops operate on.
A few practical reminders on chunking, without reopening the embeddings debate:
- Semantic chunking (512–1024 tokens, 50–100 overlap) beats fixed-size: it respects meaning boundaries.
- Metadata on each chunk (source, date, section) for filtering before vector search.
- Score fusion (Reciprocal Rank Fusion) to combine BM25 and dense rankings robustly, without calibrating incompatible scales.
A reranker costs one extra model call per query, but on a top-50 narrowed to top-5 it directly improves downstream faithfulness: less noise in the context, fewer hallucinations to correct later.
GraphRAG: structuring the corpus into a graph
Vector RAG fails on global questions because it retrieves "a subset of individually relevant passages" — yet a synthesis question has no single passage that contains it. GraphRAG (Microsoft, From Local to Global) answers by pre-building a knowledge graph then summarizing it hierarchically. The indexing pipeline has six stages:
- Chunking + extraction. The LLM reads each chunk and extracts triples (entity, relationship, entity) and claims — e.g. Steve Jobs — founded — Apple. Tailored prompts and self-reflection (the LLM rereads for missed entities) allow larger chunks without recall loss.
- Graph construction. Entity instances are deduplicated and aggregated: nodes are entities, edges are relationships, weighted by their frequency across the corpus.
- Community detection (Leiden). The Leiden algorithm (an improved Louvain) recursively partitions the graph into nested communities: leaf level (highly related entities), intermediate levels, root level — a true tree of communities.
- Community summaries. Each community gets a report-like summary, generated bottom-up; higher levels reuse the lower-level summaries under space constraints.
- Query (map-reduce). For a global question, you map across community summaries in parallel (partial answers + helpfulness score), then reduce (sort by score → final synthesis).
- Local query. For a precise factual question, local search starts from the entities matching the query, expands to neighbors and relationships, retrieves the associated chunks, and answers over this combined context.
Extraction quality is the critical step: a graph built from poorly extracted or poorly deduplicated entities propagates its errors everywhere. Three levers:
- Domain-tailored prompts, with few-shot examples, to frame the expected entity types.
- Extraction self-reflection: the LLM rereads the chunk to spot missed entities, which allows larger chunks without recall loss.
- Entity resolution (deduplication): merge "Apple," "Apple Inc.," and "the Cupertino firm" into a single node, or the graph fragments.
Note the parallel: self-reflection here serves indexing, whereas in Agentic RAG it serves querying. The same idea — the model rereads itself — operates at two different moments of the pipeline.
Figure: entity graph, Leiden communities, and hierarchical summaries. Original diagram.
Global vs local: when to use which
- Local query — targeted factual questions ("Who leads X?"). Combines an entity subgraph + text chunks. Close to vector RAG, but better grounded in relationships.
- Global query — synthesis questions ("What are the dominant themes?", "What tensions run through this corpus?"). Uses community summaries via map-reduce.
The community level queried is a dial: the root (C0) gives a very condensed, cheap view; the leaves give detail at the cost of far more tokens. In the Microsoft study, the intermediate level offers the best comprehensiveness/cost balance — a good starting default.
When GraphRAG beats vector RAG
On global sensemaking questions, GraphRAG clearly dominates vector RAG: in the Microsoft study, win rates of 72–83% on comprehensiveness and 62–82% on diversity of perspectives, on corpora of about one million tokens. Better still: querying the root summaries (C0) consumes 9× to 43× fewer context tokens than the "all source text" approach, while keeping a ~72% comprehensiveness advantage over vector RAG. Intermediate-level summaries offer the best trade-off.
The downside is indexing cost: building the graph is heavy (LLM extraction on every chunk, summaries at each level) — the study reports ~281 minutes of indexing for a "podcast" corpus with GPT-4-turbo. So GraphRAG is not a universal replacement for vector RAG: it is the tool for global questions and corpora where relational structure carries the meaning.
Combining GraphRAG and Agentic RAG
The two approaches don't compete: an agentic router can pick the strategy based on the question. A frequent 2026 pattern:
- The router classifies the query: targeted factual, relational multi-hop, or global synthesis.
- Factual → hybrid vector RAG + rerank (fast, cheap).
- Relational / multi-hop → GraphRAG local query (follow the graph's edges over several hops).
- Global → GraphRAG global query (map-reduce over community summaries).
- A Self-RAG/CRAG loop wraps it all to verify and, if needed, correct or fall back to the web.
The graph thus becomes one tool among others in the ReAct loop, invoked only when relational structure matters.
Evaluating: RAGAS and beyond
You don't run an agentic system without measurement. RAGAS provides per-query metrics, computed by an LLM judge:
- Faithfulness (target ≥ 0.9) — are the answer's claims supported by the retrieved context? This is the anti-hallucination metric.
- Answer relevancy (≥ 0.85) — does the answer match the question?
- Context precision / relevancy (≥ 0.8) — were the right chunks retrieved?
- Context recall — does the context cover all the information needed for the reference answer?
These metrics are computed per query on a test set, which lets you objectively compare two configurations (with/without rerank, with/without CRAG) before touching production.
Three operational precautions: (1) the evaluator paradox — asking the same fallible LLM to grade itself creates circularity; mitigate by ensembling several judges, freezing a golden set, and having humans review ~5% of traces; (2) trajectory tracing (Phoenix, Langfuse, OpenTelemetry) to debug each agent iteration; (3) drift monitoring (knowledge base, embeddings, evaluation) weekly against a frozen reference set.
Costs and operational trade-offs
Agentic RAG is powerful but expensive: expect 3× to 10× more tokens than plain RAG, and 2× to 5× the latency (from ~1–2 s to ~8–12 s per query). At 10,000 queries/day, a ~$500/day cost can climb to $1,500–5,000/day without optimization. The levers: adaptive routing (run the loop only on ambiguous questions), semantic caching, iteration caps, and per-query token budgets.
When not to be agentic: FAQs, single-fact lookups, sub-3-second UX constraints. The right reflex is Adaptive RAG — preserve speed on easy questions, reason in a loop only on those that truly require it.
Recurring production gotchas to anticipate from day one:
- Infinite loops — cap at 5–6 iterations then escalate to a human.
- Knowledge-base drift — track Recall@5 weekly against a frozen set.
- Cost tail — watch the p99, far worse than the mean, not just the median cost.
- Tool-call cascades — bound the tool-graph depth (≤ 3).
- Evaluator paradox — ensemble the judges, freeze the golden set, review ~5% by hand.
Implementation checklist
A pragmatic order of march to go from prototype to production:
- Foundation: hybrid BM25 + dense, semantic chunking, cross-encoder reranking.
- Measurement: wire up RAGAS (faithfulness, relevancy, context precision) + tracing (OpenTelemetry) from the start.
- Adaptive routing: a difficulty classifier so you only pay for reasoning when it pays off.
- Loops: add Self-RAG (reflection) and CRAG (correction + web fallback) on the hard paths.
- Graph: introduce GraphRAG for global / relational questions, as a tool of the loop.
- Guardrails: iteration caps, token budgets, semantic caching, drift alerts.
In summary
Three layers stack up. Retrieval: hybrid (BM25 + dense) + reranking, the foundation. Agentic RAG: rewriting/decomposition, routing, ReAct loops, Self-RAG (reflection) and CRAG (correction + web fallback) to turn retrieval into verified reasoning. GraphRAG: a knowledge graph + Leiden communities + hierarchical summaries for the global questions vector search cannot cover.
None of these layers is free: every loop, every graph, every judge adds tokens and latency. Maturity therefore means measuring before adding — instrument with RAGAS and tracing, route by difficulty, and invest in reasoning only where the quality gain justifies the overhead. Naive RAG is not obsolete: it remains the right choice for the majority of factual queries. The art, in 2026, is to spend reasoning where it pays off, and stay one-shot everywhere else.