Agentic RAG is a pattern that gives an autonomous agent control over the retrieval process. The agent decides when to search, what queries to run, which sources to consult, and when it has gathered enough evidence to synthesize an answer, using a reasoning loop to drive retrieval decisions.
What problem does Agentic RAG solve?
Traditional RAG follows a rigid sequence. Query comes in, retrieval runs, chunks go to the LLM, answer comes out. Every query gets the same treatment regardless of what it actually needs. A simple factual question ("What is our refund policy?") goes through the same pipeline as a complex analytical question ("How has our refund rate changed since we updated the policy last March, and what do customers say about it?").
This rigidity creates two problems. Simple questions pay an unnecessary latency tax from retrieval steps they do not need, especially when the LLM already knows the answer from context or prior conversation. Complex questions get insufficient retrieval because a single pass through one data source cannot gather everything they require.
The deeper issue is that the system has no judgment about its own retrieval process. It cannot decide that a query needs data from both the knowledge base and a SQL database. It cannot recognize that retrieved results are poor and try a different search strategy. It cannot skip retrieval entirely when the answer is already available. The retrieval logic is hardcoded, and the LLM is a passive consumer of whatever the pipeline delivers.
How does Agentic RAG work?
Agentic RAG gives an AI agent control over the retrieval process itself. Instead of a fixed pipeline, the agent decides at each step whether to search, what to search for, which tools to use, and when it has gathered enough information to answer. The retrieval strategy emerges from the agent's reasoning about the specific question rather than from a predetermined flow.
The agent has access to a set of retrieval tools. A vector search tool for the knowledge base. A web search tool for current information. A SQL query tool for structured data. An API tool for pulling from internal systems. A code search tool for navigating repositories. Each tool has a description that tells the agent what kind of information it can provide and when it is useful.
When a query arrives, the agent first reasons about what information it needs. For "What is our refund policy?" it might decide that a single vector search against the policy documents is sufficient. For the complex refund analysis question, it might plan a multi-step approach: first query the knowledge base for the current and previous refund policies, then run a SQL query to pull refund rate metrics, then search customer feedback data for sentiment about the policy change.
The agent executes its plan iteratively. After each tool call, it evaluates the results and decides what to do next. If the vector search returns low-relevance chunks, it can reformulate the query and try again. If the SQL query reveals an unexpected spike in refunds during a specific week, it can search for internal communications from that week to understand why. This adaptive behavior is the key difference from a fixed pipeline.
Routing is a simpler form of the same idea. Instead of giving the agent full autonomy, you let it choose which retrieval path to take based on query classification. A question about current events routes to web search. A question about internal processes routes to the knowledge base. A question about metrics routes to the analytics database. This is less flexible than full agent control but easier to implement and reason about.
The agent can also decide not to retrieve at all. If the conversation already contains the necessary context, or if the question is about something the agent can reason about directly, skipping retrieval saves time and avoids introducing noise. This judgment call is something a fixed pipeline cannot make.
Sub-query decomposition happens naturally in this pattern. The agent breaks complex questions into parts, addresses each part with the most appropriate tool, and synthesizes the results. Unlike Deep Search, where the iteration loop is a structural pattern, here the decomposition and iteration emerge from the agent's reasoning. The agent might decompose a question into two sub-queries or seven, depending on what it discovers along the way.
When should you use Agentic RAG?
Agentic RAG is the right pattern when your system needs to serve diverse query types that require fundamentally different retrieval strategies. If some questions need vector search, others need SQL, others need web search, and others need no retrieval at all, an agent that can choose the right approach per query will outperform any fixed pipeline.
It is also the right choice when your data lives in multiple systems that cannot be unified into a single index. Enterprise environments often have knowledge spread across wikis, databases, ticketing systems, code repositories, and cloud storage. An agent that can reach into each system as needed is more practical than trying to index everything into one vector store.
Consider Agentic RAG when your users ask unpredictable questions. If you can enumerate all the query types your system needs to handle, a fixed pipeline with routing might be simpler. But if users regularly surprise you with questions that do not fit your existing retrieval patterns, an agent's flexibility becomes valuable.
Do not start here. Basic RAG, then Hybrid Retrieval, then Retrieval Refinement, then consider whether you need agent-level control. Each step up adds complexity. Make sure the simpler patterns are genuinely insufficient before introducing an agent.
What are the common pitfalls?
The agent can make poor tool choices. If it decides to run a SQL query when a vector search would be faster and more appropriate, you get worse results at higher latency. Tool descriptions need to be clear and specific so the agent can make informed decisions. Vague descriptions lead to vague routing decisions.
Unbounded exploration is a risk. Without cost and time limits, an agent can chain together dozens of tool calls, each one seeming reasonable in isolation but collectively burning through budget and patience. You need hard limits on total tool calls per query, total tokens consumed, and wall-clock time.
Error handling gets complicated. When a tool call fails, the agent needs to recover gracefully. It might retry with different parameters, fall back to a different tool, or decide it cannot answer that part of the question. Each failure path needs to be handled. In a fixed pipeline, failure modes are predictable. With an agent, they are combinatorial.
The agent can develop blind spots. If it learns that vector search usually works, it might default to vector search even when SQL would be better. Testing needs to cover diverse query types to ensure the agent actually uses the full range of tools available to it.
Observability is harder than with fixed pipelines. Each query can take a different path through the system, making it difficult to aggregate metrics, identify bottlenecks, or reproduce issues. You need detailed tracing that captures every decision the agent makes and every tool call it executes.
What are the trade-offs?
Flexibility comes at the cost of predictability. A fixed pipeline produces consistent latency and consistent resource usage per query. An agentic system might answer one query in 500 milliseconds with a single tool call and another in 20 seconds with eight tool calls. This variability is hard to plan for in terms of infrastructure and user experience.
The LLM is in the hot path for every decision. Each routing choice, each query reformulation, each quality evaluation requires an LLM call. This means LLM costs scale with the complexity of queries, not just their volume. For simple queries that the agent handles with a single tool call, the overhead of the agent's reasoning step may exceed the cost of just running a fixed pipeline.
Building and maintaining the tool set is ongoing work. Each tool needs a clear interface, good error handling, and accurate documentation that the agent can understand. When you add a new data source, you add a new tool. When a data source changes its schema, you update the tool. The agent's effectiveness is bounded by the quality of its tools.
Testing is significantly more involved. You cannot just test that retrieval returns good results. You need to test that the agent chooses the right tools, decomposes queries appropriately, handles failures, respects budget limits, and produces good final answers across a wide range of question types. This requires comprehensive evaluation datasets that cover the full space of possible agent behaviors.
The reward for this complexity is a system that genuinely adapts to each question. When it works well, it feels like having a knowledgeable research assistant who knows where to look for every kind of information. When it works poorly, it feels like an unreliable system that takes too long and sometimes goes down rabbit holes.
Goes Well With
Deep Search shares the iterative retrieval philosophy but with more structure. In practice, Deep Search can be implemented as a specific behavior mode within an Agentic RAG system, triggered when the agent recognizes a question requires multi-hop reasoning.
Tool Calling is the underlying mechanism that makes Agentic RAG work. The agent's ability to choose and invoke retrieval tools depends on a robust tool-calling framework with clear interfaces and error handling.
Basic RAG remains relevant as one of the tools in the agent's toolkit. Vector search against a knowledge base is still the right retrieval strategy for many queries. The agent just gets to decide when to use it instead of using it every time.
References
- Yao, S., et al. (2023). ReAct: Synergizing Reasoning and Acting in Language Models. ICLR 2023.
- Schick, T., et al. (2023). Toolformer: Language Models Can Teach Themselves to Use Tools. NeurIPS 2023.