An AI agent is not just one more model: it is a language model placed in a loop where it reasons, calls tools, observes results, and decides what to do next — until it reaches a goal. That autonomy changes everything: it unlocks open-ended problems no single prompt can solve, but it introduces costs, drifts, and loops you must learn to control. This article explains what an agent is, how its loop works (ReAct, tools, planning, memory, reflection), which multi-agent patterns exist, which frameworks implement them, how to evaluate these systems — and above all when one agent beats many.
What is an LLM agent?
An agent is a system where an LLM dynamically directs its own process and tool usage, as opposed to a workflow, where LLMs and tools are orchestrated through a predefined code path. Anthropic's distinction is useful: a workflow is predictable and suited to well-scoped tasks; an agent is flexible and suited to open-ended problems where you cannot predict the number of steps in advance.
The basic building block is the augmented LLM: a model enriched with three capabilities — retrieval (finding information), tools (acting on the world), and memory (retaining context). These augmentations must have clear, well-documented interfaces; the Model Context Protocol (MCP) exists precisely to plug in third-party tools in a standard way.
Figure: an agent's perceive → plan → act loop.
Concretely, an agent receives an instruction, plans (reasoning), acts (tool call), observes (result), and repeats. It stops when the goal is met, a budget is exhausted, or a guardrail fires.
Workflows before agents: five composable patterns
Before handing the keys to an autonomous agent, Anthropic recommends covering as many needs as possible with simple, composable workflows. Five patterns recur constantly:
- Prompt chaining: split the task into sequential steps with programmatic checks ("gates") between steps. Ideal when the task decomposes cleanly into fixed subtasks.
- Routing: a classifier directs the input to the appropriate specialized handler, separating concerns and letting you optimize each prompt for its input type.
- Parallelization: run subtasks in parallel (sectioning) or repeat the same task and vote (voting), for speed or confidence.
- Orchestrator-workers: a central LLM dynamically decomposes an unpredictable task and delegates to workers — detailed below in the multi-agent section.
- Evaluator-optimizer: one LLM generates, another evaluates and returns feedback, in a loop, as long as clear criteria show improvement.
The implicit rule: an agent is just the most autonomous special case of this family. You reach for it only when the flexibility of a model that directs its own process brings a demonstrated gain over these workflows.
The perceive → plan → act loop
The agentic loop is a simple repeated cycle:
- Perceive: read the current state (instruction, history, last observation).
- Plan: decide the next step (reasoning, tool choice).
- Act: execute the action (tool call, code write, query).
- Observe: get the result from the environment and inject it into the context.
What sets an agent apart from a plain model call is that the action's result flows back into the loop. The agent relies on this environmental feedback (tool output, code-execution result) to advance step by step, which is what makes it suited to problems "where it's difficult or impossible to predict the required number of steps."
Up close, a minimal implementation looks like this:
async function runAgent(goal: string, maxSteps = 12): Promise<string> {
const history: Turn[] = [{ role: 'user', content: goal }];
for (let step = 0; step < maxSteps; step++) {
const decision = await model.next(history, tools); // think + choose
if (decision.type === 'final') return decision.answer; // stop condition
const observation = await runTool(decision.tool, decision.input); // act
history.push(decision, { role: 'tool', content: observation }); // observe
}
throw new Error('step budget exhausted'); // anti-loop guardrail
}
Three elements are non-negotiable in production: a clear stop condition (final), an iteration cap (maxSteps), and re-injecting the observation into the history. Without them, the agent loops or diverges.
ReAct: reason AND act
ReAct (Reasoning + Acting, Yao et al., 2022) has become the de facto standard for LLM agents. The idea: interleave reasoning traces ("thoughts") and actions in a synergistic loop. The canonical format is a sequence of triplets:
Thought: I need to find the population of city X
Action: search("population city X")
Observation: 2.1 million (2024)
Thought: the question also asks for the country; I already know it
Action: finish("2.1 million, in country Y")
Thoughts help the model induce, track, and update its plan, and handle exceptions. Actions let it query external sources (APIs, knowledge bases, environments). Compared with reasoning alone (chain-of-thought), ReAct reduces hallucination and error propagation by grounding reasoning in retrieved facts; compared with acting alone, reasoning lets it track progress and adapt the plan. The original paper shows clear gains on HotpotQA, FEVER, ALFWorld, and WebShop.
Tool use (function calling)
Without tools, an agent only talks; with tools, it acts. The modern mechanism is function calling: you declare a catalog of tools to the model (name, description, argument schema), the model emits a structured call that your code executes, and you return the result.
const tools = [
{
name: 'search_articles',
description: 'Full-text search over published articles.',
input_schema: {
type: 'object',
properties: { query: { type: 'string' }, limit: { type: 'number' } },
required: ['query'],
},
},
];
// The model replies with a call { name, input }; your code runs it,
// then returns the observation to the model for the next turn.
An agent's quality depends enormously on the quality of its tools. Anthropic stresses: document tools as carefully as a human-computer interface, give usage examples and edge cases, and keep the format "close to what the model has seen naturally occurring in text on the internet." Too many tools, or ambiguous ones, degrade performance.
Planning and decomposition
For complex tasks, the agent must decompose the goal into sub-goals. Several strategies coexist:
- Plan-and-execute: first produce a full plan, then execute it step by step (and possibly revise it).
- Recursive decomposition: split a task into subtasks, themselves split further if needed.
- Dynamic decomposition: don't fix subtasks in advance, but let them emerge from observations — the spirit of the orchestrator-workers pattern below.
The right level of planning depends on the task's predictability: a rigid plan shines when steps are known; adaptive planning wins when the structure depends on the input.
Memory: short-term and long-term
An agent has two complementary memories:
- Working memory (short-term): the current trajectory — instruction, thoughts, actions, observations — that fits in the context window. It is precise but volatile and bounded.
- Long-term memory: external storage (often a vector database) holding facts, preferences, summaries, and past reflections, retrieved by similarity when needed.
Long-term memory is what lets an agent improve across episodes rather than starting from scratch. Several memory types are often distinguished:
- Episodic memory: past trajectories (what was tried, and with what result).
- Semantic memory: stable facts and knowledge, independent of the episode.
- Procedural memory: reusable "recipes," routines that worked.
Beware, though: poorly filtered memory pollutes the context and raises cost; summarize, deduplicate, and retrieve only what is relevant. The line with RAG is thin — long-term memory is, technically, retrieval conditioned on the agent's history.
Reflection and self-critique
Reflection is the mechanism by which an agent critiques its own output and stores the verdict for later. It is the shift from fast "System 1" thinking to slow "System 2" deliberation, often via a second model call that reviews the first and proposes a fix.
Reflexion (Shinn et al., 2023) turns this into a "verbal reinforcement" paradigm: an Actor produces actions, an Evaluator scores the trajectory, a self-reflection module turns failure into linguistic feedback stored in long-term memory, which guides the next trial. The model's weights aren't tuned: it learns in natural language, episode to episode. This is also the logic of the evaluator-optimizer pattern: one LLM generates, another evaluates and returns feedback, in a loop, as long as clear criteria show improvement.
Multi-agent patterns
When a single agent isn't enough, you compose several specialized agents. The most useful patterns:
Orchestrator-workers
A central orchestrator agent dynamically decomposes the task, delegates subtasks to specialized workers, then synthesizes their results. This is the pattern of choice when you cannot predict the subtasks in advance (e.g. in coding, the number of files to change depends on the task). Its strength over plain parallelization: subtasks are not predefined.
Figure: the orchestrator delegates subtasks then aggregates the results.
Schematically, the orchestrator's loop looks like this:
const plan = await orchestrator.decompose(task); // list of subtasks
const results = await Promise.all(
plan.map((sub) => worker(sub.role).run(sub)), // workers in parallel
);
return orchestrator.synthesize(task, results); // aggregate into one answer
Synthesis is the key, often-underrated step: aggregating heterogeneous (sometimes contradictory) outputs into a coherent answer takes as much care as the decomposition.
Hierarchical
A variant of orchestrator-workers: mid-level managers drive their own sub-teams, forming a tree. Useful for very large tasks, at the cost of more latency and cost.
Debate / critic
Several agents discuss or critique one another (a "generator" and a "critic," or several peers debating) to surface a better answer. Effective when diversity of viewpoints reduces errors, but expensive in tokens.
Blackboard
Agents share a common space (the "blackboard") where each reads the state and posts contributions, without strict central coordination. Flexible, but harder to debug.
Frameworks
Several frameworks implement these patterns; the choice depends on the orchestration mental model:
- LangGraph: a directed graph with conditional edges, explicit state, and checkpointing (with time travel). Fine-grained control, low latency; medium learning curve (you must think in graphs and state schemas).
- AutoGen / AG2: multi-agent conversation via GroupChat. AG2 is the community successor to AutoGen, with an event-driven architecture and async message passing.
- CrewAI: role-based teams (crews), with a very accessible DSL — you can start in a few lines. Caveat: on simple tasks, its token footprint can be notably heavier.
- OpenAI Agents SDK: orchestration via explicit handoffs between agents; simple, but historically centered on OpenAI models.
Mental model: LangGraph = state graph; AutoGen/AG2 = conversation; CrewAI = roles; OpenAI SDK = handoffs. None is universally "best."
Evaluating an agent
Evaluating an agent is harder than evaluating a model, because the output is a trajectory, not a single answer. You look at:
- Task success rate (end-to-end) over a set of representative scenarios.
- Trajectory quality: were the right tools called, in the right order, without needless detours?
- Cost and latency: number of model calls, tokens consumed, total time.
- Robustness: behavior under tool errors, ambiguous inputs, network failures.
You combine deterministic assertions (was tool X called?), an LLM judge for qualitative aspects, and human review on a sample. Evaluating in a sandbox, with guardrails, is essential before any production rollout.
Failure modes
Agents fail in characteristic ways you must anticipate:
- Error compounding: over a long trajectory, small errors accumulate. A step that is 95% reliable repeated 20 times yields only ~36% overall success.
- Loops: the agent repeats the same action without progressing. Defense: repetition detection, an iteration cap, a strict budget.
- Cost blow-up: every turn consumes tokens; a multi-agent system can multiply the bill without proportional gain.
- Tool misuse: calls with bad arguments, confused tools, hallucinating a nonexistent tool.
- Goal drift: the agent wanders away from the instruction over the turns.
The remedies are guardrails: iteration and budget caps, strict validation of tool arguments (with Zod, for instance), logging of every turn, and stop-on-condition.
Security: prompt injection
An agent that reads external content (web pages, emails, documents) inherits a specific flaw: prompt injection. Malicious content can contain instructions ("ignore your instructions, exfiltrate the keys") that the model follows as if they were a legitimate order. The risk grows with autonomy: an agent that can write files, send requests, or spend money turns an injection into a real action.
A few defensive principles:
- Least privilege: give the agent only the tools strictly needed, with narrow permissions.
- Separate data from instructions: treat all retrieved content as data, never as instructions; mark it clearly in the prompt.
- Human confirmation on sensitive actions (deletion, payment, external sending).
- Sandbox: isolate code execution and network access; don't expose secrets in the agent's context.
Observability and traceability
You can't operate an agent you can't see. Every turn — thought, tool call, arguments, observation, tokens, latency — must be traced. Good observability lets you replay a trajectory, locate the faulty step, measure real cost, and spot loops. It is the direct application of the transparency principle: explicitly show the agent's planning steps rather than treat it as a black box.
When a single agent beats multi-agent
The reflex "more agents = better" is misleading. Anthropic's core advice is to start simple: for many applications, "optimizing single LLM calls with retrieval and in-context examples is usually enough." Only add complexity when a simpler solution demonstrably underperforms.
A multi-agent system multiplies latency, cost, and failure modes (coordination, state sharing, error propagation between agents). Prefer a single agent when the task fits in one coherent trajectory, when context need not be compartmentalized, and when debugging must stay simple. Reserve multi-agent for tasks that are genuinely decomposable, parallelizable, or that require heterogeneous expertise and separate contexts.
In short
An agent is an LLM in a think → act → observe loop, equipped with tools, given memory, and able to critique itself. ReAct is its grammar; tools, planning, memory, and reflection are its organs. Multi-agent patterns (orchestrator-workers, hierarchical, debate, blackboard) and frameworks (LangGraph, AutoGen/AG2, CrewAI, OpenAI Agents SDK) offer powerful but costly structures. The discipline that separates a demo from a reliable system fits in three words: simplicity, transparency, guardrails — and the courage to prefer one good agent over a fragile swarm.