How do they differ?
Basic RAG and Deep Search represent two levels of sophistication in retrieval-augmented generation. Basic RAG follows a simple, linear pipeline: receive a query, retrieve relevant documents, generate a response from those documents. One retrieval pass, one generation pass, done. Deep Search wraps this pipeline in a reasoning loop that retrieves, analyzes the results for gaps, formulates new queries to fill those gaps, retrieves again, and repeats until the information is sufficient.
The difference in output quality is dramatic for certain types of questions. Ask "What is the capital of France?" and Basic RAG will retrieve a relevant passage and answer correctly in under a second. Ask "How do the economic policies of the EU's five largest economies differ in their approach to AI regulation, and what are the implications for cross-border AI companies?" and Basic RAG will retrieve a handful of passages that touch on parts of the question but miss others entirely. Deep Search will systematically research each economy's AI policy, compare the regulatory frameworks, identify the implications, and synthesize a comprehensive answer.
The tradeoff is straightforward. Basic RAG is fast, cheap, and sufficient for most questions. Deep Search is slow, expensive, and necessary for complex ones. Knowing which questions need which approach is the key engineering decision.
| Dimension | Basic RAG | Deep Search |
|---|---|---|
| Retrieval passes | One | Multiple (typically 3-10 iterations) |
| Query strategy | Single query (possibly rewritten) | Multiple queries, dynamically generated |
| Reasoning | Minimal. Retrieve then generate. | Extensive. Analyze gaps, plan next query, synthesize. |
| Latency | Sub-second to low seconds | 10 seconds to several minutes |
| Cost | One embedding lookup + one LLM call | Multiple lookups + multiple LLM calls per iteration |
| Answer depth | Surface-level for complex questions | Comprehensive, multi-faceted |
| Failure mode | Incomplete or shallow answers | Over-research, looping, high cost |
| Transparency | Low. Single black-box retrieval step. | High. Each iteration's reasoning is visible. |
| Implementation complexity | Low. Standard pipeline. | High. Requires orchestration, loop control, deduplication. |
How Basic RAG works
The Basic RAG pipeline has three stages, and each one is well understood.
Query processing. The user's question arrives. Optionally, the system rewrites it for better retrieval (removing conversational fluff, expanding abbreviations, or transforming it into a better search query). This is a single transformation step.
Retrieval. The processed query is converted to an embedding and used to search a vector store. Alternatively, it is used as a keyword query against a BM25 index, or both (hybrid search). The system returns the top-k most relevant chunks, typically 3 to 10 passages.
Generation. The retrieved chunks are inserted into the prompt context alongside the original question. The LLM generates a response grounded in those chunks. Good implementations include instructions to cite sources and to acknowledge when the retrieved information does not fully answer the question.
This pipeline handles the majority of real-world questions effectively. Factual lookups, definition queries, how-to questions, and any question whose answer exists in a single passage or a small number of related passages are well served by Basic RAG.
The limitation is structural. Basic RAG gets one shot at retrieval. If the initial query does not surface the right documents, the generated answer will be incomplete or wrong. There is no mechanism to recognize that the retrieved context is insufficient and try again with a different approach.
How Deep Search works
Deep Search adds a reasoning layer that turns retrieval into an iterative research process. The architecture varies across implementations, but the core loop is consistent.
Step 1: Initial retrieval. The system performs a standard retrieval pass, similar to Basic RAG. It gets an initial set of relevant documents.
Step 2: Gap analysis. An LLM examines the retrieved documents against the original question and identifies what information is still missing. "I have information about EU AI regulation in general, but nothing specific about Germany's approach or the cross-border implications. I need to search for those specifically."
Step 3: Query generation. Based on the identified gaps, the system generates one or more new, targeted queries. These queries are specific and focused, designed to fill the exact gaps identified in the previous step.
Step 4: Additional retrieval. The new queries are executed against the document store (and sometimes against external sources like web search). New documents are added to the accumulated context.
Step 5: Sufficiency check. The system evaluates whether the accumulated information is now sufficient to answer the original question comprehensively. If yes, it proceeds to generation. If no, it loops back to step 2.
Step 6: Synthesis. With a rich, multi-source context, the LLM generates a comprehensive answer that synthesizes information across all retrieved documents. This is often a significantly longer and more nuanced response than Basic RAG would produce.
Perplexity, Google's AI Overviews, and various open-source implementations (like LangChain's "Adaptive RAG" and LlamaIndex's "Sub-Question Query Engine") all implement variations of this pattern.
When to use Basic RAG
Basic RAG should be your default. Most questions in most applications do not require iterative research.
Factual lookup questions. "What is our refund policy?" "When was this feature released?" "What are the system requirements?" These questions have answers that exist in a single document passage. One retrieval pass finds it.
Chatbot and customer support. Users ask relatively focused questions. They expect quick answers. The latency of Deep Search (10+ seconds) is unacceptable for conversational interfaces where users expect sub-second responses.
Internal knowledge bases. Employees searching for company policies, procedures, or documentation are usually looking for a specific piece of information. Basic RAG with good chunking and a decent embedding model handles this well.
High-volume applications. If your system handles thousands of queries per minute, the cost multiplier of Deep Search (5 to 20x per query) makes it impractical for every request. Basic RAG keeps costs linear and predictable.
Questions within a single domain. When the answer lives entirely within one document or one section of a knowledge base, there is nothing to synthesize across sources. Basic RAG retrieves the relevant section and generates the answer.
Real-time or streaming use cases. Basic RAG can begin streaming the response within a second or two. Deep Search needs to complete multiple retrieval cycles before it can start generating, which means the user stares at a loading spinner for 10 to 30 seconds.
When to use Deep Search
Deep Search earns its cost for questions that are genuinely complex and benefit from thoroughness.
Research questions requiring multi-source synthesis. "Compare the approaches to AI safety taken by Anthropic, OpenAI, and Google DeepMind." This requires finding separate information about each company's approach and then comparing them. No single retrieval pass will surface all of this.
Questions with hidden complexity. "Should we use PostgreSQL or MongoDB for this project?" seems simple but actually requires understanding the project's requirements, the strengths of each database for those requirements, operational considerations, team expertise, and more. Deep Search can iteratively explore each dimension.
Investigative and analytical tasks. Due diligence research, competitive analysis, literature reviews, and similar tasks where completeness matters more than speed. These are tasks where a user is willing to wait a minute for a thorough answer rather than getting a superficial one instantly.
Multi-hop reasoning questions. "Which companies in our portfolio have exposure to the new EU AI Act, and what is the estimated compliance cost?" This requires identifying portfolio companies, determining which ones deploy AI in the EU, finding the relevant AI Act provisions for each, and estimating costs. Each step depends on the previous one.
Long-form content generation. When the output is a report, briefing, or analysis that needs to be comprehensive and well-sourced, Deep Search provides the rich context that makes such documents useful. A report grounded in a single retrieval pass will have obvious gaps.
Can they work together?
The most practical architecture uses Basic RAG as the default and escalates to Deep Search when needed. There are several strategies for deciding when to escalate.
Query complexity classification. Before retrieval, classify the incoming query as simple or complex. Simple queries go through Basic RAG. Complex queries go through Deep Search. The classifier can be rule-based (questions with multiple clauses, comparison words, or research-oriented language get escalated) or model-based (a lightweight LLM classifies the query).
Confidence-based escalation. Run Basic RAG first. If the retrieved documents cover the question well (measured by relevance scores or an LLM confidence check), return the Basic RAG answer. If the coverage is poor (low relevance scores, the answer includes hedging language like "based on available information"), escalate to Deep Search.
User-controlled depth. Give users a choice. A "Quick answer" mode uses Basic RAG. A "Deep research" mode uses Deep Search. Perplexity does this with its standard vs. Pro search modes. This is the simplest approach and lets users decide the tradeoff between speed and thoroughness.
Streaming with progressive enhancement. Start with Basic RAG and stream the initial answer immediately. In parallel, kick off a Deep Search pass. If Deep Search finds significantly more information, append an "additional findings" section after the initial response.
Common mistakes
Using Deep Search for simple questions. Sending "What is our return policy?" through five retrieval iterations is wasteful. It burns tokens, adds latency, and often produces a worse answer than Basic RAG because the additional retrieved context introduces irrelevant information that confuses the generation step.
Not setting iteration limits on Deep Search. Without a maximum iteration count, Deep Search can loop indefinitely, especially on questions that have no complete answer in the corpus. Always set a hard limit (typically 5 to 10 iterations) and a time budget.
Ignoring deduplication in Deep Search. Multiple retrieval passes often surface the same documents. If duplicate chunks accumulate in the context, they waste tokens and bias the generation toward repeated information. Deduplicate retrieved chunks by content hash or document ID between iterations.
Treating Basic RAG as a solved problem. Teams rush to build Deep Search while their Basic RAG has fundamental issues: bad chunking, a weak embedding model, no hybrid search, no re-ranking. Fix Basic RAG first. A strong single-pass retrieval eliminates the need for Deep Search on many queries that currently fail.
Not showing reasoning in Deep Search. Users waiting 30 seconds for an answer want to know something is happening. Display the intermediate steps: "Searching for EU AI regulation...", "Found 3 sources. Looking for Germany-specific policies...", "Comparing regulatory frameworks..." This transforms waiting from frustrating to informative.
Accumulating too much context. Deep Search can accumulate 50,000+ tokens of retrieved content across multiple iterations. This can exceed context limits or degrade generation quality. Implement progressive summarization: after each iteration, summarize the accumulated findings into a compact representation before adding new content.
Not measuring the value of each iteration. Track how much new, relevant information each Deep Search iteration adds. If iteration 4 adds nothing that iterations 1 through 3 did not already cover, your stopping criteria are too loose. Most of the value comes from the first 2 to 3 iterations.
References
- Lewis, P. et al. (2020). "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." arXiv:2005.11401.
- Jiang, Z. et al. (2023). "Active Retrieval Augmented Generation." arXiv:2305.06983.
- Asai, A. et al. (2023). "Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection." arXiv:2310.11511.
- LangChain documentation on Adaptive RAG and Self-Corrective RAG.
- LlamaIndex documentation on Sub-Question Query Engine.
- Perplexity AI technical blog on search architecture.