How do they differ?
RAG pipelines can fail in two distinct ways. First, the retrieval step might miss relevant documents entirely because the query and the documents use different vocabulary, different levels of specificity, or different framing. Second, the retrieval step might find too many documents, including irrelevant ones that dilute the context and confuse the generator.
Hybrid Retrieval fixes the first problem. It operates on the query side, transforming or expanding the query so that it matches relevant documents that a naive search would miss. Retrieval Refinement fixes the second problem. It operates on the result side, filtering, reranking, and compressing retrieved documents so that only the most relevant content reaches the generator.
Think of it as the difference between asking a better question at the library versus sorting through the books the librarian brought you. Both improve the quality of what you read, but they attack the problem from opposite ends.
| Dimension | Hybrid Retrieval | Retrieval Refinement |
|---|---|---|
| Operates on | The query (before retrieval) | The results (after retrieval) |
| Core problem solved | Vocabulary mismatch, query-document gap | Irrelevant results, poor ranking |
| Techniques | HyDE, query expansion, hybrid search, query decomposition | Reranking, compression, filtering, deduplication |
| When it runs | Before the vector/keyword search | After the vector/keyword search |
| Failure it prevents | Missing relevant documents | Including irrelevant documents |
| Latency cost | Adds time before search (query transformation) | Adds time after search (result processing) |
| LLM calls | Often 1 (for query transformation) | 0-1 (rerankers can be cross-encoders, not LLMs) |
| Impact on recall | Increases recall (find more relevant docs) | Decreases recall, increases precision (keep only the best) |
The retrieval quality equation
A useful mental model: final context quality = recall (did we find it?) multiplied by precision (is what we found relevant?).
Hybrid Retrieval boosts recall. Without it, a query like "How do I fix memory leaks in my Node app?" might miss a document titled "V8 Heap Management and Garbage Collection Optimization" because the vocabulary does not overlap. Query expansion or HyDE (Hypothetical Document Embeddings) bridges that gap by generating a hypothetical answer that is more likely to match the document's language.
Retrieval Refinement boosts precision. After retrieval returns 20 chunks, a reranker scores each one against the original query and keeps the top 5. A compressor extracts only the relevant sentences from each chunk, removing padding content. A filter drops chunks below a relevance threshold.
Together, they ensure the generator sees all the relevant information and none of the irrelevant information. Separately, each one leaves a gap.
When to use Hybrid Retrieval
Hybrid Retrieval is the right choice when your retrieval step is missing documents you know are relevant.
- Vocabulary mismatch between users and documents. Users ask questions in casual language. Documents are written in technical jargon. Or the reverse: users use technical terms that the documents explain in plain English. Query expansion and synonym injection bridge this gap.
- Short, ambiguous queries. A query like "pricing" is too vague for embedding similarity to work well. Query expansion can transform it into multiple specific queries: "subscription pricing tiers," "enterprise pricing," "pricing comparison with competitors."
- Cross-lingual retrieval. When users query in one language but documents exist in another. Query translation or multilingual embeddings are forms of Hybrid Retrieval.
- Multi-aspect questions. A question like "Compare the security and performance of PostgreSQL vs MySQL for high-write workloads" touches multiple topics. Query decomposition splits it into sub-queries that each retrieve focused results.
- When embedding similarity is not enough. Pure vector search misses exact keyword matches that matter (product names, error codes, version numbers). Hybrid search combining dense vectors with sparse keyword matching (BM25) catches both semantic and lexical matches.
Key techniques
HyDE (Hypothetical Document Embeddings). Instead of embedding the query directly, ask an LLM to generate a hypothetical answer, then embed that answer and use it for retrieval. The hypothetical answer uses the same language and structure as real documents, so embedding similarity works better. This adds one LLM call and one embedding call before search.
Query expansion. Generate multiple reformulations of the query and retrieve for each one. Merge the results. This casts a wider net. Some implementations use an LLM for expansion; others use a thesaurus or query logs.
Hybrid search. Combine dense (vector) retrieval with sparse (keyword) retrieval. Use reciprocal rank fusion or a learned combination to merge results. This catches documents that match semantically and documents that match on specific terms.
Query decomposition. Break a complex query into simpler sub-queries. Retrieve for each sub-query independently. Merge results. This works well for multi-hop questions where no single query captures all the information needed.
When to use Retrieval Refinement
Retrieval Refinement is the right choice when retrieval returns too many results or the wrong results rank too high.
- Large retrieval sets. When you retrieve 20-50 chunks to ensure high recall but the generator's context window can only handle 5-10, you need to rank and prune.
- Noisy indexes. When your document collection includes content of varying quality and relevance, and the embedding model cannot distinguish between highly relevant and tangentially related content.
- Chunk boundary problems. When relevant information spans chunk boundaries and some retrieved chunks contain only partial context. A compressor can extract just the relevant sentences.
- Duplicate and near-duplicate content. When the same information appears in multiple documents (versioned docs, copied content), deduplication prevents the generator from seeing the same thing multiple times.
- Multi-source retrieval. When you retrieve from multiple indexes (product docs, support tickets, community forums), results need reranking to create a unified relevance ordering.
Key techniques
Cross-encoder reranking. A cross-encoder model (like Cohere Rerank, BGE Reranker, or a fine-tuned BERT) takes the query and each retrieved chunk as a pair and produces a relevance score. This is much more accurate than embedding cosine similarity because the model sees the query and document together. Rerankers are the single highest-impact postprocessing technique for most RAG pipelines.
LLM-based reranking. Use an LLM to score or rank retrieved chunks by relevance. More expensive than cross-encoder reranking but can handle more nuanced relevance judgments. Useful when the concept of "relevance" is complex or domain-specific.
Contextual compression. Pass each retrieved chunk through an LLM with the instruction "Extract only the sentences relevant to the query." This reduces chunk sizes dramatically and removes padding content that wastes context window space.
Relevance filtering. Drop chunks below a relevance score threshold. Better to give the generator five highly relevant chunks than ten chunks where half are noise. The threshold needs tuning per use case.
Deduplication. Identify near-duplicate chunks (using MinHash, embedding similarity, or exact overlap detection) and keep only the best representative. Prevents the generator from reading the same information multiple times.
Can they work together?
They should work together. This is not a case of choosing one or the other. In most production RAG pipelines, both patterns are present because they solve complementary problems.
The standard production architecture looks like this:
User Query
│
▼
┌───────────────────┐
│ Hybrid │
│ Retrieval │
│ (query side) │
│ │
│ • Query expansion │
│ • HyDE │
│ • Hybrid search │
└────────┬──────────┘
│ 20-50 raw chunks
▼
┌───────────────────┐
│ Retrieval │
│ Refinement │
│ (result side) │
│ │
│ • Reranking │
│ • Compression │
│ • Deduplication │
│ • Filtering │
└────────┬──────────┘
│ 5-8 refined chunks
▼
┌───────────────────┐
│ Generator LLM │
└───────────────────┘
The pipeline logic: cast a wide net (Hybrid Retrieval ensures high recall), then refine aggressively (Retrieval Refinement ensures high precision). The generator sees a small, highly relevant context.
Where to invest first
If you are building a new RAG pipeline and need to prioritize, start with Retrieval Refinement (specifically cross-encoder reranking). The reason is practical: adding a reranker to an existing pipeline is simple (it slots in between retrieval and generation) and the quality improvement is immediate and measurable. You do not need to rebuild your index or change your embedding model.
Hybrid Retrieval improvements (HyDE, hybrid search, query expansion) require more infrastructure changes and more tuning. They are worth the investment once you have established a solid postprocessing pipeline and identified that recall (not precision) is your bottleneck.
How do you know which problem you have? Measure it. Take a sample of queries, retrieve a large set of results, and have a human label relevance. If the relevant documents are in the large set but not in the top K, you have a ranking problem (Retrieval Refinement). If the relevant documents are not in the large set at all, you have a recall problem (Hybrid Retrieval).
Common mistakes
Skipping reranking because embeddings "should be good enough." Bi-encoder embeddings (the kind you get from embedding models) are fast but sacrifice accuracy compared to cross-encoders. A retrieval pipeline that retrieves 20 chunks by embedding similarity and passes all 20 to the generator will almost always perform worse than one that reranks those 20 and passes the top 5. The cost of a reranker is tiny compared to the LLM generation call.
Over-investing in query transformation without measuring recall. HyDE and query expansion add latency and cost. If your baseline retrieval already finds the right documents (you just need better ranking), these techniques add overhead without improving results. Measure recall before deciding you need Hybrid Retrieval techniques.
Retrieving too few initial results. A common mistake is retrieving only 5 chunks and then trying to rerank them. Reranking works best when it has a larger candidate set to work with. Retrieve 20-50 and let the reranker pick the best 5-8. The marginal cost of retrieving more chunks from a vector database is negligible.
Not tuning the relevance threshold. Filtering chunks below a relevance score is effective, but the threshold varies by domain. A threshold that works for technical documentation might be too aggressive for conversational content. Tune it using labeled relevance data from your specific domain.
Applying HyDE to all queries. HyDE works well for knowledge-seeking queries where the hypothetical answer resembles actual documents. It works poorly for navigational queries ("show me the API reference for batch processing") or very specific lookups ("error code E-4021"). Apply it selectively, or use a classifier to decide when HyDE is appropriate.
Ignoring chunk quality at indexing time. Both patterns work on top of your existing chunks. If your chunking strategy is poor (splitting mid-sentence, chunks too small for context, chunks too large for specificity), no amount of query transformation or postprocessing will fully compensate. Fix chunking first.
Compressing away useful context. Contextual compression is powerful but aggressive compression can remove sentences that provide necessary background for understanding the relevant sentences. Test compression quality on your specific document types before deploying.
References
- Hybrid Retrieval pattern on genaipatterns.dev
- Retrieval Refinement pattern on genaipatterns.dev
- Gao et al., "Precise Zero-Shot Dense Retrieval without Relevance Labels" (HyDE, 2022)
- LlamaIndex documentation on query transformations and node postprocessors
- Cohere Rerank documentation and benchmarks
- MTEB (Massive Text Embedding Benchmark) for evaluating retrieval and reranking models
- Langchain documentation on retrievers and document transformers