Chain-of-Thought (CoT) prompting is a technique that improves LLM reasoning by instructing the model to show its work step by step before giving a final answer. By decomposing complex problems into intermediate reasoning steps, the model produces more accurate results on math, logic, and multi-step tasks.
What problem does Chain-of-Thought solve?
Ask a language model to solve a word problem, a logic puzzle, or a multi-step calculation and you will often get a confident, wrong answer. The model skips straight to a conclusion without working through the intermediate pieces. It is not that the model lacks the ability to reason. It is that the default generation behavior favors short, direct responses.
This matters most when the task involves several dependent steps. A math problem that requires unit conversion before multiplication. A scheduling question where constraints interact. A code debugging scenario where you need to trace execution order. In all of these cases, the model has the knowledge to get things right, but it collapses the reasoning into a single leap and lands somewhere incorrect.
The frustrating part is that if you sit down and walk through the problem yourself, each individual step is straightforward. The difficulty is not in any single step. It is in chaining them together without dropping context along the way.
How does Chain-of-Thought work?
Chain-of-Thought prompting is deceptively simple. You ask the model to show its work. That is the entire technique at its most basic level. By generating intermediate reasoning steps before arriving at a final answer, the model allocates more computation to the problem and keeps track of partial results in its own output.
There are three main variants worth knowing about. The first is zero-shot Chain-of-Thought, where you append something like "Let us think step by step" to your prompt. No examples needed. This alone can dramatically improve accuracy on reasoning tasks because it shifts the model out of its default shortcut behavior. The second variant is few-shot Chain-of-Thought. Here you provide a handful of worked examples that demonstrate the step-by-step reasoning format you want. The model picks up on the pattern and applies it to new inputs. This tends to be more reliable than zero-shot because the model has a concrete template to follow. The third variant, sometimes called auto-CoT, involves building a database of verified reasoning traces and selecting relevant ones dynamically based on the input. This is more infrastructure work but scales well when you are handling diverse problem types.
The key insight is that you are not teaching the model new reasoning skills. You are unlocking capabilities it already has by changing the generation pattern. When the model writes out "First, I need to convert miles to kilometers" before doing the conversion, it is giving itself a working memory that persists across tokens. Each step constrains the next, reducing the chance of a wrong final answer.
When should you use Chain-of-Thought?
Chain-of-Thought works best when the task has a clear sequence of logical steps. Math problems, multi-hop question answering, code analysis, and planning tasks are all good candidates. If you find yourself thinking "the model should be able to do this" but it keeps getting the answer wrong, that is a strong signal to try CoT.
It is also useful when you need to audit the model's reasoning. A bare answer gives you nothing to debug. A step-by-step trace lets you see exactly where the logic went off track, which makes it far easier to fix your prompt or identify edge cases.
You probably do not need CoT for simple factual retrieval, straightforward classification, or tasks where the model already performs well. Adding "think step by step" to a sentiment analysis prompt is unlikely to help and will just burn extra tokens.
Implementation
# Using OpenAI SDK for illustration — swap client for any provider
from openai import OpenAI
import re
client = OpenAI()
def chain_of_thought(question: str) -> str:
"""Zero-shot CoT: ask the model to think step by step."""
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": f"{question}\n\nThink step by step, then give your final answer after 'ANSWER:'."
}],
)
full_response = response.choices[0].message.content
# Extract the final answer after the reasoning
match = re.search(r"ANSWER:\s*(.+)", full_response, re.DOTALL)
return match.group(1).strip() if match else full_response
# Usage
result = chain_of_thought("If a store has 3 shelves with 8 books each, and 5 books are removed, how many remain?")
print(result) # 19
What are the common pitfalls?
The most common failure mode is verbose but wrong reasoning. The model generates a plausible-looking chain of steps that contains a subtle error early on, and then the rest of the reasoning faithfully builds on that mistake. The step-by-step format can actually make this harder to catch because the output looks so thorough and confident.
Another issue is faithfulness. The reasoning the model writes out is not necessarily the reasoning it is actually using internally. Sometimes the model arrives at an answer through pattern matching and then constructs a post-hoc justification. The steps might look logical but they are a narrative, not a computation trace.
Over-reasoning is a real problem too. For simple tasks, forcing step-by-step output can lead the model down rabbit holes. It starts considering edge cases that do not apply, second-guessing itself, and eventually producing a worse answer than it would have with a direct response.
Finally, watch out for prompt sensitivity. Small changes in how you phrase the CoT instruction can lead to very different reasoning patterns. "Think step by step" and "Break this down into steps" and "Show your work" can all produce different quality outputs depending on the model.
What are the trade-offs?
The obvious cost is token usage. A step-by-step response is typically three to ten times longer than a direct answer. That means higher latency and higher API costs. For a single query this is negligible, but at scale it adds up fast.
There is also a reliability trade-off. CoT improves average accuracy but introduces more variance in output format. You need to parse the final answer out of a longer response, which means you need either a reliable extraction step or a consistent output format.
The debugging advantage is real but comes with a caveat. You are debugging the model's stated reasoning, which may not reflect its actual reasoning process. Treat the chain of thought as a useful signal, not ground truth about the model's internals.
For latency-sensitive applications, the extra generation time may be a dealbreaker. Consider whether you can run CoT offline for prompt development and then distill the results into a more direct prompt for production.
Goes Well With
Self-Consistency is the natural companion to Chain-of-Thought. Generate multiple reasoning paths for the same problem and take the majority answer. This directly addresses the variance problem, since a single chain might go wrong but the consensus across many chains is much more reliable.
ReAct Loop extends Chain-of-Thought by interleaving reasoning steps with tool use. Instead of just thinking through a problem, the model can pause to look something up, run a calculation, or check a fact before continuing its reasoning chain.
Prompt Optimization can automate the search for the best CoT format. Rather than manually tweaking your "think step by step" instruction, let an optimizer find the phrasing and few-shot examples that produce the most accurate chains for your specific task.
References
- Wei, J., et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. NeurIPS 2022.
- Kojima, T., et al. (2022). Large Language Models are Zero-Shot Reasoners. NeurIPS 2022.
Further Reading
- Wei et al., "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models" (2022) — The original paper demonstrating that prompting models to show intermediate reasoning steps dramatically improves accuracy on arithmetic, commonsense, and symbolic reasoning benchmarks. arXiv:2201.11903