Prompt Caching is a pattern that stores and reuses the processed representation of common prompt prefixes to avoid redundant computation. When multiple requests share the same system prompt or context, the cached prefix eliminates re-processing, reducing both latency and cost.
What problem does Prompt Caching solve?
Your application handles 100,000 requests per day. You look at the logs and notice something interesting. A significant portion of those requests are variations of the same questions. "What is your return policy?" and "How do I return an item?" and "Can I send this back?" all need essentially the same answer. Yet each one triggers a full model inference, consuming compute, adding latency, and running up your API bill.
Even beyond exact duplicates, many production workloads share structural similarity. Every request to your coding assistant starts with the same 2,000-token system prompt. Every customer service query includes the same company knowledge base in the context. You are paying to process identical token sequences thousands of times per day, and the model does the same computational work each time as if it had never seen those tokens before.
At scale, this redundancy dominates your costs. The model does not remember previous requests. Each API call starts from scratch. Without intervention, you are leaving significant cost savings and latency improvements on the table.
How does Prompt Caching work?
Prompt caching eliminates redundant computation by reusing work that has already been done. There are two fundamentally different approaches, and they operate at different levels of the stack.
Client-side caching works at the response level. You store the full response for a given prompt and serve it directly when you see the same (or sufficiently similar) prompt again. The simplest version is exact-match caching. Hash the prompt, check if you have a cached response, and return it if you do. This works well for FAQ-style workloads where the same questions recur frequently.
Semantic caching extends this idea to handle paraphrases. Instead of hashing the raw prompt text, you compute an embedding vector and search for cached responses whose prompts are semantically similar. "What is your return policy?" matches "How do returns work?" because the embeddings are close in vector space. You set a similarity threshold, and any prompt within that threshold gets the cached response. This dramatically increases your cache hit rate but requires a vector store and an embedding model.
Server-side caching works at the computation level, inside the model's inference pipeline. When a model processes a sequence of tokens, it builds up internal representations called key-value (KV) states for each token. Prefix caching, offered by providers like Anthropic and Google, saves these KV states for token sequences that appear at the start of your prompts. When a new request shares the same prefix (your system prompt, for example), the provider skips recomputing those states and starts from the cached intermediate result. You see this as reduced latency and lower per-token costs, often 50-90% cheaper for the cached portion.
The two approaches are complementary. Server-side prefix caching reduces the cost of processing shared context. Client-side semantic caching eliminates the model call entirely for repeated queries. A production system can use both.
When should you use Prompt Caching?
Start with prompt caching when your API costs are a concern and your workload has any repetition. Look at your logs. If more than 10-15% of requests are semantically similar, caching will pay for itself quickly.
Prefix caching is nearly free to adopt if your provider supports it. Many providers automatically cache prefixes beyond a certain length. You just need to structure your prompts so that the shared content (system instructions, knowledge base, few-shot examples) appears at the beginning. This is a prompt engineering change, not an infrastructure project.
Semantic caching makes sense when your users ask the same types of questions in different ways. Customer support, FAQ bots, and documentation assistants are ideal candidates. The investment is moderate: you need an embedding model and a vector store, but these are standard components in most AI stacks.
Exact-match caching is the lowest-effort option and works well for programmatic use cases where the same prompts recur literally. Batch processing jobs, automated report generation, and CI/CD pipelines that run the same analysis repeatedly all benefit from simple hash-based caching.
What are the common pitfalls?
Stale caches are the primary risk. If the correct answer changes (your return policy updates, product information changes, prices shift) but the cache still holds the old response, users get outdated information. You need a cache invalidation strategy. Time-based expiration is the simplest approach. Event-driven invalidation (clearing relevant cache entries when underlying data changes) is more precise but harder to implement.
Semantic cache matching can produce false positives. Two prompts that are semantically similar but require different answers will return the wrong cached response. "What is the price of Product A?" and "What is the price of Product B?" might have similar embeddings but need different answers. Your similarity threshold needs careful tuning, and you may need to include structured metadata (product ID, user segment) in your cache key alongside the semantic embedding.
Cache poisoning is a concern in adversarial environments. If an attacker can craft a prompt that gets cached and then served to other users, they can influence the responses those users see. This is mainly a risk with shared caches across users. Per-user caches avoid this but reduce hit rates.
Over-caching non-deterministic responses can hurt quality. If your application benefits from response variety (creative writing, brainstorming), caching kills that variety. Not every workload benefits from caching.
What are the trade-offs?
Semantic caching requires infrastructure. You need an embedding model, a vector store, and the operational overhead of maintaining both. For small-scale applications, this overhead may exceed the cost savings. Simple exact-match caching with a key-value store is often a better starting point.
Cache hit rate determines the value of the entire system. If your workload is highly diverse with few repeated patterns, your hit rate will be low and the infrastructure cost will not be justified. Measure your actual repetition rate before investing heavily in caching.
Freshness and cost savings are in tension. Shorter cache TTLs keep responses fresh but reduce hit rates. Longer TTLs maximize savings but increase the risk of serving stale data. The right balance depends on how frequently your underlying information changes.
Prefix caching is largely free of trade-offs if your provider offers it. The main consideration is prompt structure. You get maximum benefit when the shared prefix is long and the variable suffix is short. Reorganizing your prompts to front-load shared content is usually straightforward.
Goes Well With
Small Language Models pair well with caching for a layered cost optimization strategy. Cache the most common requests to avoid model calls entirely. Route the remaining requests to a smaller, cheaper model when possible. Reserve the large model for the long tail of complex queries. Each layer reduces the load on the next.
Inference Optimization addresses the same cost and latency concerns from the infrastructure side. Caching eliminates redundant calls. Inference optimization makes the remaining calls faster and cheaper. Together, they provide compounding improvements.
Basic RAG systems benefit significantly from prefix caching. The retrieved context often shares a common structure, and the system prompt plus retrieval instructions can be cached as a prefix. This reduces the per-query overhead of the RAG pipeline.
References
- Kwon, W., et al. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention. SOSP 2023.