How do they differ?
Both patterns exist because a single LLM call is unreliable. One pass through a model can produce a wrong answer, a shallow analysis, or a subtly flawed piece of code. Self-Consistency and Reflection both address this, but they operate on different theories of why models fail.
Self-Consistency assumes that errors are stochastic. The model knows the right answer but sometimes takes a wrong reasoning path due to sampling randomness. The fix: sample multiple times and let the majority vote win. If three out of five attempts produce the same answer, that answer is probably correct.
Reflection assumes that errors are systematic. The model produces a flawed output not because of bad luck but because it did not think carefully enough. The fix: have the model critique its own output and try again, informed by the critique. Each iteration should be better than the last because the model learns from its own mistakes.
These are different failure models, and they lead to different architectures.
| Dimension | Self-Consistency | Reflection |
|---|---|---|
| Execution model | Parallel (multiple samples at once) | Sequential (generate, critique, regenerate) |
| Latency | One round-trip (parallel calls) | Multiple round-trips (2-4 iterations typical) |
| Cost | N calls in parallel (N typically 3-7) | 2-4 calls sequentially (generation + critique per round) |
| Best for | Tasks with a clear "correct" answer | Tasks where quality is subjective or multi-dimensional |
| Error type caught | Stochastic (random reasoning failures) | Systematic (shallow analysis, missed edge cases) |
| Aggregation method | Majority vote, weighted vote, or best-of-N | Take final iteration output |
| Diminishing returns | After ~5-7 samples | After ~2-3 iterations |
| Parallelizable | Fully | Not at all (each step depends on the previous) |
The fundamental tradeoff
Self-Consistency trades cost for reliability at constant latency. You pay for N calls but they all run simultaneously, so wall-clock time barely increases. Reflection trades latency for quality. You pay for sequential round-trips, each one building on the previous output, so total time grows linearly with the number of iterations.
This means Self-Consistency fits naturally into latency-sensitive applications where you can afford the extra API cost. Reflection fits into quality-sensitive applications where users are willing to wait a few extra seconds for a better result.
When to use Self-Consistency
Self-Consistency works best when there is a verifiable or at least consensus-checkable answer.
- Math and logic problems. When the model solves a math problem, it might take a wrong step in its chain of thought. But if you sample five times and four attempts arrive at the same numerical answer, you can be confident in that answer. The original Self-Consistency paper from Wang et al. showed significant accuracy improvements on arithmetic and commonsense reasoning benchmarks.
- Code generation with test cases. Generate multiple code solutions and run them against test cases. Pick the one that passes the most tests. This is a concrete form of Self-Consistency where the "vote" is automated by test execution.
- Classification tasks. When you need the model to classify a document, label a support ticket, or categorize an intent, sampling multiple times and taking the majority label smooths out random misclassifications.
- Factual question answering. For questions with a single factual answer, multiple samples reduce the chance of a confidently wrong response.
- Extraction tasks. Pulling structured data (names, dates, amounts) from unstructured text. If three out of five extractions agree on the dollar amount, that is likely the correct value.
The pattern breaks down when there is no meaningful concept of consensus. If you ask the model to write a poem, five different poems are not "wrong" or "right." Majority voting does not apply.
Implementation considerations
The most common implementation samples N completions with temperature > 0 (typically 0.5-0.8 for good diversity) and applies majority voting on the final answer. For structured outputs, exact match voting works well. For free-text answers, you might need semantic similarity clustering to group equivalent answers that are phrased differently.
A more sophisticated variant, sometimes called Universal Self-Consistency, uses an LLM to identify the consensus answer across samples rather than relying on exact string matching. This handles paraphrases and different formatting but adds an extra LLM call.
The cost scales linearly with N. Five samples cost five times as much as one. But since all samples run in parallel, latency stays roughly constant (limited by the slowest sample). For many production systems, the cost increase is acceptable because the reliability improvement is substantial.
When to use Reflection
Reflection works best when quality comes from depth of analysis, not from consensus on a single answer.
- Long-form writing. A first draft of an essay, report, or documentation is rarely as good as a second draft where the model identifies weak arguments, missing sections, or unclear explanations and then rewrites.
- Complex code generation. For non-trivial code where edge cases matter, a reflection step that reviews the code for bugs, missing error handling, and performance issues produces better results than any single pass.
- Strategic analysis. When you ask a model to analyze a business situation, the first response often misses important angles. A critique step that asks "What did this analysis miss?" followed by a revised analysis consistently produces more thorough output.
- Prompt refinement. Using the model to critique and improve its own prompts through iterative reflection. Each iteration tests the prompt against edge cases and refines it.
- Multi-constraint satisfaction. When the output must satisfy many constraints simultaneously (tone, length, factual accuracy, audience appropriateness), a single pass often nails some constraints and misses others. Reflection lets the model fix the missed constraints in subsequent passes.
The critique quality matters enormously
The effectiveness of Reflection depends entirely on the quality of the critique step. A vague critique like "this could be better" does not help. An effective critique identifies specific issues: "The third paragraph claims revenue grew 15% but the data table shows 12%. The conclusion does not address the risk factors mentioned in the introduction."
Good implementations structure the critique with specific evaluation criteria. Instead of asking "Is this good?" you ask "Check for: factual consistency with the provided data, coverage of all required topics, appropriate tone for the target audience, and logical flow between sections."
Some implementations use a different model or a different temperature for the critique step than for the generation step. The intuition is that a model critiquing its own output with the same configuration might have the same blind spots. Varying the approach introduces useful diversity.
Can they work together?
Yes, and the combination can be powerful.
Self-Consistency within Reflection iterations. At each reflection step, generate multiple candidates and pick the best before moving to the critique phase. This gives you both breadth (multiple samples) and depth (iterative improvement).
Reflection to improve Self-Consistency voting. Instead of simple majority vote, use a reflection step to analyze the N samples, identify their strengths and weaknesses, and synthesize the best answer from the available options. This is sometimes called "meta-reasoning" over the sample set.
Parallel Reflection chains with voting. Run multiple independent Reflection chains in parallel, each producing a refined output through its own critique cycles. Then vote across the final outputs. This is expensive but produces very high quality for critical applications.
The practical question is whether the quality improvement justifies the cost and latency. For most applications, one pattern or the other is sufficient. The combination makes sense for high-stakes, low-volume tasks like generating legal analyses, medical recommendations, or complex financial models.
Common mistakes
Using Self-Consistency for open-ended generation. If you sample five creative stories and try to pick the "best" one by vote, you are not using Self-Consistency correctly. There is no consensus to find. Use Reflection instead, or use best-of-N with a scoring model.
Using Reflection for simple factual tasks. If the question has a clear right answer, iterative reflection is slower and more expensive than sampling multiple times. A model that got a math problem wrong is unlikely to find the error in its own work through self-critique. It is more likely to convince itself the wrong answer is right.
Too many Reflection iterations. Quality improvements from Reflection follow diminishing returns. The jump from one iteration to two is usually significant. From two to three is modest. Beyond three, you are often just cycling between similar outputs or even degrading quality as the model overthinks. Set a maximum iteration count and stop when the critique finds no significant issues.
Too few Self-Consistency samples. Three samples is often the minimum for meaningful consensus. With only two, a disagreement gives you no signal. Five to seven samples is the sweet spot for most tasks. Beyond that, the marginal accuracy gain per sample drops sharply.
Not distinguishing between stochastic and systematic errors. If your model consistently gets a certain type of question wrong (systematic error), adding more Self-Consistency samples will not help because all samples will make the same mistake. You need Reflection, prompt engineering, or fine-tuning. Conversely, if your model usually gets things right but occasionally stumbles (stochastic error), Reflection is overkill. Sample a few times and vote.
Ignoring the cost of parallel calls. Self-Consistency seems "free" on latency, but five parallel calls to a large model cost five times as much. For high-volume applications, this can be significant. Monitor your per-request cost and ensure the reliability gain justifies the spend.
References
- Self-Consistency pattern on genaipatterns.dev
- Reflection pattern on genaipatterns.dev
- Wang et al., "Self-Consistency Improves Chain of Thought Reasoning in Language Models" (2022)
- Shinn et al., "Reflexion: Language Agents with Verbal Reinforcement Learning" (2023)
- Madaan et al., "Self-Refine: Iterative Refinement with Self-Feedback" (2023)
- Chen et al., "Universal Self-Consistency for Large Language Model Generation" (2023)