Prompt Chaining is a pattern that breaks a complex task into a sequence of simpler LLM calls, where each call's output feeds into the next call's input. This decomposition makes each step easier to debug, validate, and optimize independently.
What problem does Prompt Chaining solve?
You have a task that involves multiple distinct operations. Maybe you need to extract information from a document, categorize it, and then generate a summary based on the categories. You try to do it all in a single prompt, with detailed instructions covering every step. The result is inconsistent. Sometimes the extraction is good but the categorization is off. Other times the summary ignores half the extracted data.
The root cause is that a single prompt asking the model to juggle multiple objectives creates competing pressures. The model is trying to be thorough with extraction, precise with classification, and concise with summarization, all at the same time. It does not have a way to focus on one step, verify the result, and then move on. Everything happens in a single pass, and the quality of each subtask suffers.
This gets worse as complexity increases. A three-step task might work in a single prompt if each step is simple. A five-step task with interdependencies almost never will. The model loses track of intermediate results, confuses instructions from different steps, and produces output that is a muddy compromise rather than a clean result.
How does Prompt Chaining work?
Prompt chaining decomposes a complex task into a pipeline of simpler prompts. Each prompt handles one well-defined step. The output of one step becomes the input to the next. Instead of one prompt doing five things poorly, you have five prompts each doing one thing well.
Consider a content moderation pipeline. Step one extracts potentially problematic phrases from a piece of text. Step two classifies each phrase according to your policy categories. Step three decides on an overall action based on the classifications. Each step has a clear input, a clear output, and a focused instruction set. The model at each stage can dedicate its full attention to a single task.
The power of this approach goes beyond just splitting work. Because each step produces an explicit intermediate output, you get natural inspection points. You can look at the extracted phrases before classification happens. If the extraction missed something, you know exactly where the problem is. With a monolithic prompt, debugging means re-reading the entire output and guessing which part of the instruction the model mishandled.
There is another advantage that is easy to overlook. Different steps in your pipeline can use different models. A cheap, fast model might handle straightforward extraction while a more capable model handles nuanced classification. You can also add non-LLM steps into the chain. A database lookup, a rules-based filter, a formatting function. The chain does not need to be LLM calls all the way through.
When should you use Prompt Chaining?
The clearest signal is when you have a task with naturally separable stages. If you can draw a flowchart of the process with distinct boxes and arrows between them, prompt chaining is likely the right approach.
Another strong signal is when different parts of the task have different reliability requirements. If extraction needs to be exhaustive but summarization can be lossy, separating them lets you tune each step independently. You might retry the extraction step if it looks incomplete without re-running the summarization.
Prompt chaining also makes sense when you need deterministic control flow. If the second step should only run when the first step finds certain conditions, you need the ability to branch. A single prompt cannot conditionally skip parts of its own execution, but a pipeline can route outputs to different next steps based on intermediate results.
Avoid prompt chaining for tasks that genuinely are atomic. If you are asking the model to translate a paragraph or answer a simple question, splitting it into steps adds latency and complexity for no benefit.
Implementation
# Using OpenAI SDK for illustration — swap client for any provider
from openai import OpenAI
client = OpenAI()
def call_llm(prompt: str) -> str:
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
)
return response.choices[0].message.content
def extract_then_summarize(article: str) -> str:
"""Two-step chain: extract key facts, then summarize them."""
# Step 1: Extract structured facts
facts = call_llm(
f"Extract the 5 most important facts from this article as a numbered list.\n\n{article}"
)
# Step 2: Summarize the extracted facts
summary = call_llm(
f"Write a 2-sentence executive summary based on these facts:\n\n{facts}"
)
return summary
# Usage
article = "OpenAI announced GPT-5 today with 2x performance improvements..."
print(extract_then_summarize(article))
What are the common pitfalls?
Error propagation is the primary risk. A mistake in step one cascades through every subsequent step. If the extraction misses a key entity, the classification step will never see it, and the summary will be incomplete. Each step trusts the output of the previous step completely, so one weak link compromises the whole chain.
Format coupling between steps is a common source of bugs. Step one needs to output data in exactly the format step two expects. If the extraction step returns entities as a comma-separated list but the classification step expects JSON, the chain breaks silently. You end up spending significant effort on the interface contracts between steps.
Latency adds up. Each step is a separate API call with its own round-trip time. A five-step chain might take five times as long as a single prompt, and if any step needs retrying, it takes even longer. For user-facing applications, this cumulative latency can push response times past acceptable limits.
Context loss is another failure mode. Each step only sees its own input, not the full original context. If step three needs information from the original input that step one did not pass through, it is simply unavailable. You need to think carefully about what context each step requires and make sure it is carried forward.
What are the trade-offs?
You trade simplicity for control. A single prompt is easy to understand and maintain. A pipeline has multiple prompts, data transformations between them, error handling at each step, and retry logic. The operational complexity is real, and it scales with the number of steps.
Cost is higher in terms of total tokens consumed but potentially lower in terms of cost per correct output. A single prompt that fails 30% of the time and needs re-running might actually cost more than a three-step chain that succeeds 95% of the time on the first try. It depends on your failure rates and retry strategies.
You gain debuggability but lose atomicity. A single prompt either works or it does not. A pipeline can partially succeed, which means you need to handle partial failures. What do you do when step three fails but steps one and two succeeded? Do you retry from step three or start over?
Development speed is slower upfront but faster for iteration. Changing one step in a pipeline does not require re-testing the entire chain from scratch. You can swap out the classification model, or adjust the extraction prompt, and only validate that specific step plus any downstream effects.
Goes Well With
Chain-of-Thought can be applied within individual steps of the chain. Each step gets its own reasoning trace, making the overall pipeline both modular and transparent. The chain handles decomposition across steps while CoT handles reasoning within each step.
Few-Shot Prompting works well at each stage of the pipeline. Because each step has a focused task, the few-shot examples can be highly targeted. The extraction step gets extraction examples, the classification step gets classification examples. This is more effective than trying to create examples that cover the entire end-to-end process.
Prompt Optimization becomes more practical with prompt chaining because you can optimize each step independently. Instead of optimizing one massive prompt, you optimize several small ones, each against its own evaluation criteria. This is a more tractable search problem and tends to converge faster.
References
- Wu, T., et al. (2022). AI Chains: Transparent and Controllable Human-AI Interaction by Chaining Large Language Model Prompts. CHI 2022.