Reflection is a pattern where an LLM critiques and iteratively improves its own output. After generating an initial response, the model evaluates it against quality criteria, identifies weaknesses, and produces a revised version, repeating until the output meets the desired standard.
What problem does Reflection solve?
A single pass through a language model is often not enough. The first response might miss a requirement, contain a factual error, lack depth in a critical area, or simply be mediocre when you need it to be good. You can see the problems when you read the output, and you know that with specific feedback the model could do better. But feeding the same prompt again just rolls the dice on a different mediocre response.
This is not a limitation of the model's knowledge. It is a limitation of single-pass generation. The model produces output left to right, one token at a time, and it cannot go back and revise earlier decisions based on how the rest of the output turned out. A human writer drafts, re-reads, notices problems, and revises. The model only drafts.
The gap becomes especially visible with complex outputs. A long technical explanation might have a solid opening but lose coherence in the middle section. A code generation response might get the algorithm right but miss an edge case. A summarization might capture the main points but omit a critical detail. In each case, the output is 80% of the way there but that last 20% requires the kind of revision that single-pass generation cannot provide.
How does Reflection work?
Reflection implements the draft-review-revise cycle that good human work goes through. The process has four stages that repeat. First, generate an initial response. Second, send that response to an evaluator that identifies specific problems, gaps, or areas for improvement. Third, use the critique to construct a revised prompt that asks the model to fix the identified issues. Fourth, generate an improved response. Repeat until the output meets your quality bar or you hit a maximum number of iterations.
The evaluator is the engine of the whole loop. It can be another LLM prompted with a rubric, a rule-based checker, an external tool that validates factual claims, or even a human reviewer inserted at a checkpoint. What matters is that it produces specific, actionable critique rather than a generic quality score. "The second paragraph incorrectly states that Python uses static typing" is useful feedback. "Score: 3 out of 5" is not, because the generator has nothing concrete to act on.
A crucial design choice is whether the evaluator is the same model as the generator. Using the same model is simpler but creates a blind spot. The model may not catch its own mistakes because it has the same biases and knowledge gaps that produced those mistakes in the first place. Using a different model, or a different evaluation approach entirely, introduces a genuinely independent perspective. A fact-checking tool that verifies claims against a database catches errors that no language model would notice through text analysis alone.
The loop converges because each iteration addresses specific identified issues rather than generating from scratch. The revised prompt carries forward what was good about the previous response and targets what was wrong. Progress is cumulative. The first iteration might fix factual errors. The second might improve structure. The third might add missing details. Each cycle makes the output strictly better along the dimensions the evaluator checks.
When should you use Reflection?
Reflection shines when output quality has a high ceiling and the first draft consistently falls short. Writing tasks, code generation, detailed analysis, and complex summarization all benefit from iterative refinement.
It is especially useful when you can define clear, checkable quality criteria. If you can say "the output must mention all five of these topics" or "the code must handle these three edge cases," those criteria become the evaluator's checklist. Each iteration brings the output closer to meeting all criteria.
Use reflection when the cost of a wrong or mediocre output exceeds the cost of multiple API calls. If you are generating a customer-facing report that will be read by executives, spending three iterations to get it right is a better investment than sending a first-draft quality output.
It is also valuable during development and prompt tuning. Running a reflection loop on sample inputs shows you what kinds of errors your prompts tend to produce, which informs how to improve the prompts themselves.
Avoid reflection for tasks where the first response is already good enough. Simple factual questions, straightforward formatting tasks, and basic classification do not need iterative refinement. The overhead is not justified.
Implementation
# Using OpenAI SDK for illustration — swap client for any provider
from openai import OpenAI
client = OpenAI()
def reflect_and_revise(task: str, max_rounds: int = 3) -> str:
"""Generate, critique, and revise in a loop."""
# Step 1: Initial generation
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": task}],
)
draft = response.choices[0].message.content
for round_num in range(max_rounds):
# Step 2: Critique the current draft
critique = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": f"Critique this response. List specific weaknesses and suggest improvements. If it's good enough, say 'NO ISSUES'.\n\nTask: {task}\nResponse:\n{draft}",
}],
)
feedback = critique.choices[0].message.content
if "NO ISSUES" in feedback:
break
# Step 3: Revise based on critique
revision = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": f"Revise this response based on the feedback.\n\nOriginal task: {task}\nCurrent draft:\n{draft}\nFeedback:\n{feedback}",
}],
)
draft = revision.choices[0].message.content
return draft
# Usage
result = reflect_and_revise("Write a clear explanation of how HTTPS works for a non-technical audience.")
print(result)
What are the common pitfalls?
The most frustrating failure mode is oscillation. The evaluator flags an issue, the generator fixes it, and in doing so introduces a different issue that the next evaluation round flags. The output bounces between two imperfect states without converging on a good one. This usually indicates that the evaluation criteria are conflicting or that the generator cannot satisfy all criteria simultaneously.
Over-iteration degrades quality instead of improving it. After a certain number of rounds, the model starts making changes for the sake of change rather than genuine improvement. It might add unnecessary hedging language, restructure paragraphs without benefit, or introduce new errors while trying to address minor critique points. Setting a maximum iteration count and stopping when improvements become marginal prevents this.
Evaluator quality is the ceiling for the entire loop. If the evaluator misses a class of errors, no amount of iteration will fix them. If the evaluator flags things that are not actually problems, the generator will waste cycles making unnecessary changes. The evaluator needs to be at least as discerning as the quality bar you are trying to meet.
Context window pressure builds with each iteration. The revised prompt needs to include the previous output plus the critique plus new instructions. After several rounds, the accumulated context can crowd out important details or push past the model's effective context length. Summarizing previous feedback rather than including it verbatim helps manage this.
What are the trade-offs?
Latency multiplies with each iteration. A three-round reflection loop takes roughly three times as long as a single generation. For real-time applications, this may be too slow. You can mitigate this by running reflection asynchronously and presenting an initial response while refinement happens in the background, but this adds complexity.
Cost scales linearly with the number of iterations, and potentially more if the evaluator is also an LLM. Each cycle involves at least one generation call and one evaluation call. For a three-round loop with separate evaluation, that is six API calls per output.
The quality ceiling depends on the weakest component. If the generator cannot produce good output even with perfect feedback, more iterations will not help. If the evaluator cannot identify the real problems, the generator has nothing useful to work with. Both components need to be good enough for the loop to converge on quality.
You gain predictable quality at the cost of predictable latency and cost. A single pass has variable quality and fixed cost. A reflection loop has more consistent quality but variable cost, since some inputs might converge in one iteration while others need the full maximum.
Goes Well With
LLM-as-Judge provides a structured, rubric-based evaluator for the reflection loop. Instead of ad-hoc critique, the judge scores the output on specific dimensions and explains where it falls short. This gives the generator precise, consistent feedback to work with.
Chain-of-Thought improves both the generator and the evaluator within the loop. The generator can reason through its revisions step by step, producing more thoughtful improvements. The evaluator can reason through its critique, catching subtle issues that a quick assessment would miss.
Prompt Optimization can tune the prompts used within the reflection loop. The evaluation prompt, the revision prompt, and even the initial generation prompt can all be optimized against a dataset. This is meta-optimization, tuning the loop itself rather than just the output.
References
- Shinn, N., et al. (2023). Reflexion: Language Agents with Verbal Reinforcement Learning. NeurIPS 2023.
- Madaan, A., et al. (2023). Self-Refine: Iterative Refinement with Self-Feedback. NeurIPS 2023.
Further Reading
- Shinn et al., "Reflexion: Language Agents with Verbal Reinforcement Learning" (2023) — Introduces a framework where agents reflect on task feedback through verbal self-critique, converting failed attempts into learning signals that improve subsequent tries. arXiv:2303.11366