What You Are Building
The goal is a search pipeline that takes a corpus of documents, makes them searchable by meaning rather than just keywords, and generates answers grounded in the actual source material. A user asks a question, the system finds the most relevant passages, and an LLM synthesizes a response with citations pointing back to the originals. Nothing is hallucinated. Every claim traces to a chunk the retriever surfaced.
This is not a toy demo where you stuff a single PDF into a vector store and call it done. A production search pipeline handles thousands of documents, serves concurrent users, and degrades gracefully when the retriever draws a blank. It needs an ingestion side that processes new documents continuously and a query side that responds in under two seconds. The two sides share a vector index but operate on different schedules and have different performance characteristics.
What makes this a composition guide rather than a single-pattern walkthrough is that the pipeline draws on two distinct patterns working together. Basic RAG gives you the overall retrieve-then-generate architecture. Semantic Indexing replaces the naive keyword lookup with embedding-based search so the system understands what the user means, not just what they typed.
Architecture Overview
The pipeline splits into two halves that share a vector store as their interface boundary. On the ingestion side, raw documents flow through a loader, a chunker, an embedding model, and finally into the vector index. On the query side, a user question passes through preprocessing, retrieval, optional reranking, context assembly, and LLM generation before a cited response comes back.
The ingestion pipeline runs on its own schedule. You might trigger it when new documents land in a storage bucket, or run it as a nightly batch. The loader reads PDFs, HTML, markdown, or whatever your corpus contains. The chunker splits each document into passages, typically 200 to 500 tokens each with some overlap between neighbors so that no sentence gets cut in half without context. Each chunk gets embedded into a dense vector and written into the index along with metadata like the source document URL, the chunk position, and a timestamp.
The query pipeline is the latency-sensitive path. When a question comes in, you preprocess it (expand abbreviations, normalize whitespace, sometimes rephrase it for better retrieval). The preprocessed query gets embedded with the same model used during ingestion, then you run a nearest-neighbor search against the vector index. The top results, usually 5 to 20 chunks, form the candidate set. If you have a reranker, it rescores the candidates using a cross-encoder that looks at the full query-chunk pair. The surviving chunks get assembled into a prompt context window, and the LLM generates a response that references specific chunks by ID or title.
This is where the two patterns lock together. Basic RAG defines the two-pipeline shape and the generate-from-retrieved-context loop. Semantic Indexing is what makes the retrieval step actually work well. Without semantic embeddings you would be doing BM25 or TF-IDF, which fails when the user phrases a question differently from how the documents describe the answer. With embeddings, "how do I handle errors in async code" matches a chunk titled "exception management in concurrent workflows" because the concepts overlap in vector space.
Pattern Walkthrough
Basic RAG
Basic RAG provides the structural backbone. It says: do not ask the LLM to answer from its training data alone. Instead, retrieve relevant documents first, inject them into the prompt, and instruct the model to answer only from the provided context. This pattern solves the hallucination problem at the architecture level. If the model does not have a relevant chunk in its context, it should say so rather than invent an answer.
In this pipeline, Basic RAG manifests as the split between the ingestion and query paths and the strict rule that every generated answer must reference at least one retrieved chunk. The prompt template typically includes an instruction like "answer the question using only the following passages" followed by the assembled context. The output format asks the model to include bracketed citations linking each claim to a specific passage ID. This traceability is what makes the system trustworthy for users who need to verify answers against primary sources.
Semantic Indexing
Semantic Indexing is the pattern that upgrades retrieval from lexical matching to conceptual matching. In a keyword system, a query for "cancellation policy" would miss a chunk that discusses "how to terminate your subscription" because the words are different. Semantic Indexing fixes this by converting both queries and documents into dense vectors that encode meaning, then searching by vector similarity.
The indexing side runs an embedding model over every chunk and stores the resulting vectors in a purpose-built database, whether that is Pinecone, Weaviate, Qdrant, pgvector, or something else. At query time, the same embedding model converts the question into a vector, and the database returns the nearest neighbors. The quality of retrieval depends heavily on the embedding model. A model trained on your domain will outperform a generic one, and a model that understands your users' vocabulary will surface more relevant results.
Decision Points
Chunk size and overlap. Smaller chunks (150 to 250 tokens) give you more precise retrieval. The vector for a short, focused passage is a tighter match to a specific question. Larger chunks (400 to 600 tokens) give the LLM more surrounding context, which sometimes helps it generate a better answer. Overlap of 10 to 20 percent between consecutive chunks prevents hard cuts in the middle of a paragraph. There is no universally correct setting. Start with 300 tokens and 50-token overlap, measure retrieval quality, and adjust from there.
Embedding model selection. The tradeoff is between speed, quality, and language coverage. Smaller models like those in the 384-dimension range embed faster and use less storage but may miss nuanced similarity. Larger models in the 1024-dimension range capture more meaning but cost more to run and store. If your corpus is multilingual, you need a model trained on multiple languages or you will see degraded retrieval for non-English content. Run a retrieval evaluation on a sample of real queries before committing to a model.
Vector database choice. Managed services like Pinecone or Zilliz handle scaling and availability for you but come with per-query costs and vendor lock-in. Self-hosted options like Qdrant, Milvus, or pgvector give you more control and can be cheaper at scale, but you own the infrastructure. If your index is under a million vectors, pgvector on an existing Postgres instance is often the simplest starting point. Beyond that, a dedicated vector database is worth the operational overhead.
Reranking. A cross-encoder reranker takes the top candidates from vector search and rescores them by looking at the full query-passage pair jointly. This catches cases where the embedding-based first pass retrieves a passage that is superficially similar but not actually relevant. Reranking adds 50 to 200 milliseconds of latency, so it is a tradeoff. For high-stakes use cases like legal or medical search, the precision gain is worth the latency. For a chatbot answering general questions, you might skip it.
Citation strategy. Extractive citations pull exact quotes from the source chunks, which is easy to verify but can feel choppy. Abstractive citations let the model paraphrase while pointing to the source, which reads more naturally but is harder to verify. A middle ground is to have the model include both a natural answer and a list of source references with page numbers or URLs so the user can check for themselves.
Production Considerations
Caching. In most search systems, a small set of queries accounts for a large fraction of traffic. Caching the retrieval results and generated answers for frequent queries can cut your costs and latency dramatically. Use a TTL-based cache keyed on the normalized query string. Invalidate cached entries when the underlying documents are updated. A simple Redis or in-memory LRU cache works well for this.
Monitoring retrieval quality. You need to know whether the retriever is actually finding relevant passages. Track the average similarity score of the top retrieved chunks. If scores are consistently low, your embedding model may not fit your domain, or your chunking strategy may be producing fragments that do not carry enough meaning. Log a sample of queries alongside their retrieved chunks and review them periodically. Build a small evaluation set of query-answer pairs and run automated retrieval quality checks on every index update.
Index freshness. Documents change. New ones arrive, old ones get updated or deleted. Your ingestion pipeline needs to handle incremental updates without rebuilding the entire index from scratch. This means tracking document versions, deleting stale vectors when a source document changes, and re-embedding only the modified chunks. For most systems, a combination of real-time ingestion for new documents and a nightly reconciliation job to catch deletions works well.
Fallback behavior and cost optimization. When retrieval returns nothing above your similarity threshold, the system should not attempt to generate an answer from empty context. Instead, return a message explaining that no relevant information was found and suggest the user rephrase their question. This is better than a hallucinated response that looks authoritative but is wrong. On the cost side, embedding API calls and vector database queries are your main recurring expenses. Batch embedding during ingestion rather than embedding one chunk at a time. Use approximate nearest neighbor search rather than exact search once your index grows past a few hundred thousand vectors. Monitor your monthly embedding and query costs and set alerts if they spike unexpectedly.