Few-Shot Prompting is a technique that provides the LLM with a small number of input-output examples before the actual query. These demonstrations teach the model the expected format, reasoning style, and task boundaries without any fine-tuning or weight updates.
What problem does Few-Shot Prompting solve?
You write a carefully worded instruction telling the model exactly what you want. The output comes back and it is close but not quite right. The format is off, or the tone is wrong, or it handles edge cases differently than you expected. So you add more instructions. More detail about the output structure. More caveats about what to include and what to skip. The prompt grows longer and the results get marginally better, but you never reach the consistency you need.
The fundamental issue is that natural language instructions are ambiguous. When you say "extract the key entities from this text," you and the model may have very different ideas about what counts as a key entity, how to format the output, and how much detail to include. You can try to specify all of this explicitly, but there are always gaps between what you mean and what the words convey.
This problem gets worse when you need a very specific output format. JSON with particular field names, a classification from a fixed set of labels, text in a particular style. Describing the format in words is both tedious and unreliable.
How does Few-Shot Prompting work?
Instead of describing what you want, show it. Include a handful of input-output pairs in your prompt that demonstrate the exact behavior you expect. The model picks up on the pattern and generalizes it to new inputs. This is called in-context learning, and it is one of the most reliable techniques for getting consistent, correctly formatted output.
Here is what makes few-shot prompting powerful. A single example communicates format, style, level of detail, and edge case handling all at once, without you having to articulate any of those things explicitly. Three to five well-chosen examples can replace paragraphs of instructions. The model learns not just what to output, but how to handle the subtle decisions that are hard to express in rules.
The examples do real work. They anchor the model's behavior in concrete instances rather than abstract descriptions. If your examples all return JSON with snake_case keys, the model will use snake_case keys. If your examples handle ambiguous inputs by returning a default value, the model will do the same. You are programming by demonstration rather than by specification.
Choosing good examples matters more than choosing many examples. Three diverse, representative examples typically outperform ten similar ones. You want your examples to cover the range of inputs the model will see, including at least one edge case or tricky input. If your task involves classification, include at least one example of each category.
When should you use Few-Shot Prompting?
Few-shot prompting is the right choice when you need consistent output formatting. If the model keeps returning results in slightly different structures, examples will lock down the format faster than instructions will.
It works well for classification tasks where you have a fixed label set. Show the model a few inputs mapped to their correct labels and it will generalize the classification logic. This is often more reliable than explaining the classification criteria in words, especially when the categories involve subjective judgment.
Style matching is another strong use case. If you need the model to write in a particular voice, match a specific tone, or follow a house style, a few examples of the desired style teach the model more effectively than a style guide would.
Information extraction benefits enormously from few-shot examples. Show the model three inputs with the entities highlighted and the output structured, and it will consistently extract the same types of information from new text.
Skip few-shot prompting when the task is straightforward enough that zero-shot instructions work reliably, or when the context window is too limited to accommodate examples alongside the actual input.
Implementation
# Using OpenAI SDK for illustration — swap client for any provider
from openai import OpenAI
client = OpenAI()
def few_shot_classify(text: str) -> str:
"""Classify sentiment using few-shot examples."""
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "Classify the sentiment as positive, negative, or neutral."},
{"role": "user", "content": "The food was absolutely incredible, best meal I've had in years."},
{"role": "assistant", "content": "positive"},
{"role": "user", "content": "Waited 45 minutes and the order was wrong. Never coming back."},
{"role": "assistant", "content": "negative"},
{"role": "user", "content": "The restaurant is located on Main Street and opens at 11am."},
{"role": "assistant", "content": "neutral"},
{"role": "user", "content": text},
],
)
return response.choices[0].message.content
# Usage
print(few_shot_classify("The service was slow but the dessert made up for it."))
What are the common pitfalls?
The biggest risk is biased or unrepresentative examples. If all your examples involve short inputs and the real data has long inputs, the model may struggle with the length difference. If your examples all fall into one category, the model may be biased toward that category. Example selection shapes behavior more than most people realize.
Order effects are real. The model pays more attention to examples that appear later in the prompt. If you put your easiest example last, the model may perform worse on hard inputs. Shuffle or deliberately order your examples if you notice inconsistent behavior.
Overfitting to surface patterns is a subtle failure mode. The model might latch onto incidental features of your examples rather than the underlying logic. If all your positive classification examples happen to contain the word "excellent," the model might use that word as a shortcut rather than understanding the actual classification criteria.
Too many examples can crowd out the actual input, especially with smaller context windows. Each example takes tokens that could be used for the input or for the model's response. There is a diminishing return curve, and you will hit it sooner than you think.
What are the trade-offs?
Token cost is the most obvious trade-off. Each example adds to your prompt length, which means higher latency and higher cost per request. For high-volume applications, this matters. Five examples at 200 tokens each adds a thousand tokens to every single request.
Maintenance is an underappreciated cost. When your requirements change, you need to update all your examples. If the output format evolves, every example needs to be revised. This is manageable with three examples but becomes a burden with more.
Few-shot prompting trades flexibility for consistency. A zero-shot prompt can handle a wider range of unexpected inputs because it relies on the model's general capabilities. A few-shot prompt constrains the model to behave like the examples, which is great when you want consistency but can be limiting when inputs diverge significantly from what the examples cover.
There is also the question of example curation effort. Finding or creating good examples takes time. For some tasks, this is trivial. For others, especially tasks requiring domain expertise, creating high-quality examples is a significant investment.
Goes Well With
Chain-of-Thought combines naturally with few-shot prompting. Instead of showing just input-output pairs, show input-reasoning-output triples. The model learns both the reasoning process and the output format from your examples. This is the few-shot CoT variant and it tends to be the most reliable form of Chain-of-Thought prompting.
Prompt Chaining benefits from few-shot examples at each step in the chain. Each focused prompt in the pipeline can have its own set of examples tailored to that specific subtask. This keeps the examples relevant and concise while the overall pipeline handles complexity.
Prompt Optimization can automate example selection. Rather than manually curating your few-shot examples, an optimization framework can search through a pool of candidate examples and find the subset that maximizes performance on your evaluation set. This removes the guesswork from example curation.
References
- Brown, T., et al. (2020). Language Models are Few-Shot Learners. NeurIPS 2020.
Further Reading
- Brown et al., "Language Models are Few-Shot Learners" (2020) — The GPT-3 paper that demonstrated in-context learning: large language models can learn tasks from a handful of examples provided in the prompt, without any gradient updates. arXiv:2005.14165