Few-Shot Prompting

Why this is the technique to learn first

If you only internalize one prompting technique, make it this one. In Schulhoff's Prompt Report taxonomy, few-shot prompting is technique #1 not by accident of ordering but because it is the most reliable lever you have. The reason is structural: large language models are pattern-completion engines. A description of a task ("classify the sentiment, but be lenient on sarcasm, and output JSON") forces the model to infer a pattern from abstract instructions. A few worked examples are the pattern. You stop telling and start showing.

The distinction matters most for anything where the desired output has a shape the model can't guess from instructions alone: a specific JSON schema, a house tone of voice, an unusual classification boundary, a domain-specific labeling convention. In these cases two or three examples routinely outperform a paragraph of careful description, and they do it with less prompt-engineering effort.

Zero-shot vs. one-shot vs. few-shot

The terminology is just a count of examples (shots) you put in the prompt:

Zero-shot: instructions only, no examples. "Classify this review as positive or negative."
One-shot: exactly one example. Often enough to pin down output format alone.
Few-shot: typically two to five examples. This is where the real gains live for nuanced tasks.

Modern instruction-tuned models are strong at zero-shot, so don't reach for examples reflexively — but the moment output format, edge-case handling, or tone is non-obvious, examples earn their tokens.

How to do it well

1. Make the format identical across every example

This is the part people get wrong most often. The model is learning from the surface form, so your examples must be rigorously consistent: same delimiters, same field order, same casing, same spacing. If one example writes Sentiment: Positive and the next writes sentiment - positive, you've taught the model that formatting is negotiable, and it will improvise. Pick one template and stamp it out.

Review: The plot dragged but the cast was excellent.
Sentiment: mixed

Review: Refund requested. Worst purchase of the year.
Sentiment: negative

Review: Arrived early and works perfectly.
Sentiment: positive

Review: It's fine. Does the job, nothing special.
Sentiment:

Note the trailing Sentiment: with no value — you're handing the model the exact slot to complete. That last detail alone often fixes "the model added a preamble" problems.

2. Choose examples that cover the decision boundary

Don't pick three easy, obviously-positive cases. Pick examples that demonstrate the hard distinctions: the sarcastic review, the mixed one, the edge case you actually care about. In the block above, "It's fine... nothing special" is deliberately ambiguous, and the preceding "mixed" example teaches the model that a middle category exists. Your examples are where you encode judgment the instructions can't.

3. Balance your label distribution

For classification, the empirical prompt-engineering literature documents a real majority-label bias: if four of your five examples are labeled "positive," the model becomes more likely to answer "positive" regardless of the input. Keep label counts roughly balanced across the classes you expect, and don't let an accidental skew in your examples become a thumb on the scale.

4. Mind the ordering

Ordering of examples has a measurable but somewhat unpredictable effect — the Prompt Report and related work note that example order can shift outputs, with a tendency for the model to weight examples near the end of the prompt more heavily (recency). The honest summary: order matters, the optimal order is task-specific, and the practical move is to avoid grouping all of one label together (e.g., don't list every positive example, then every negative). Interleave them. If a task is high-stakes, test a couple of orderings rather than assuming.

Pitfalls

Format drift. The number-one silent failure. Inconsistent examples produce inconsistent output. Treat your template like code.
Label skew. An unbalanced set of example labels biases predictions toward the majority class.
Too few or too many. One example may only fix format; piling on twenty examples burns tokens, can dilute the signal, and rarely beats a well-chosen five. Start at two to five.
Examples that leak the answer. If your examples are too similar to the live input, you're testing memorization, not generalization. Cover the space, don't hand over the test case.
Stale examples. Few-shot examples are part of your prompt's maintenance surface. When requirements change, the examples have to change with them — an out-of-date example actively teaches the wrong thing.

One counterintuitive finding worth knowing: research on few-shot classification has shown models can still benefit from examples even when some of the labels in those examples are wrong — suggesting the model leans heavily on the demonstrated format and label space, not only on correct input-to-output mappings. Don't take this as license to use sloppy labels; take it as a reminder that format consistency is doing more work than you might assume.

The practical loop

Start zero-shot. If the output is wrong in shape (format, tone, schema) or in judgment (it mishandles the edge cases you care about), add two to four consistent, balanced, boundary-covering examples and test again. This trial-and-error loop — change one thing, measure, keep what works — is the whole empirical philosophy of prompt engineering in miniature, and few-shot is where it pays off fastest.

Extracting structured data from support tickets

Why it's better: The bad prompt describes the schema in prose, so the model guesses the JSON key names, value casing, and how to score urgency — and it will guess differently across runs. The good prompt demonstrates the exact schema, shows all three urgency levels (balanced labels across the boundary), keeps every example byte-for-byte consistent, and ends with a trailing open slot so the model completes rather than preambles.

Enforcing a brand voice in generated copy

Why it's better: "Playful but not cheesy" is an instruction the model cannot reliably operationalize — every model has a different idea of where that line sits. The examples define the voice concretely: short, one emoji max, a light twist in the second sentence. The model now imitates a demonstrated style instead of interpreting an adjective, and the trailing label slot tells it exactly what to produce.