Technique #1 of 15

Few-Shot Prompting

Show the model two to five concrete examples of the input-output pattern you want instead of describing it in prose.

Why this is the technique to learn first

If you only internalize one prompting technique, make it this one. In Schulhoff's Prompt Report taxonomy, few-shot prompting is technique #1 not by accident of ordering but because it is the most reliable lever you have. The reason is structural: large language models are pattern-completion engines. A description of a task ("classify the sentiment, but be lenient on sarcasm, and output JSON") forces the model to infer a pattern from abstract instructions. A few worked examples are the pattern. You stop telling and start showing.

The distinction matters most for anything where the desired output has a shape the model can't guess from instructions alone: a specific JSON schema, a house tone of voice, an unusual classification boundary, a domain-specific labeling convention. In these cases two or three examples routinely outperform a paragraph of careful description, and they do it with less prompt-engineering effort.

Zero-shot vs. one-shot vs. few-shot

The terminology is just a count of examples (shots) you put in the prompt:

  • Zero-shot: instructions only, no examples. "Classify this review as positive or negative."
  • One-shot: exactly one example. Often enough to pin down output format alone.
  • Few-shot: typically two to five examples. This is where the real gains live for nuanced tasks.

Modern instruction-tuned models are strong at zero-shot, so don't reach for examples reflexively — but the moment output format, edge-case handling, or tone is non-obvious, examples earn their tokens.

How to do it well

1. Make the format identical across every example

This is the part people get wrong most often. The model is learning from the surface form, so your examples must be rigorously consistent: same delimiters, same field order, same casing, same spacing. If one example writes Sentiment: Positive and the next writes sentiment - positive, you've taught the model that formatting is negotiable, and it will improvise. Pick one template and stamp it out.

Review: The plot dragged but the cast was excellent.
Sentiment: mixed

Review: Refund requested. Worst purchase of the year.
Sentiment: negative

Review: Arrived early and works perfectly.
Sentiment: positive

Review: It's fine. Does the job, nothing special.
Sentiment:

Note the trailing Sentiment: with no value — you're handing the model the exact slot to complete. That last detail alone often fixes "the model added a preamble" problems.

2. Choose examples that cover the decision boundary

Don't pick three easy, obviously-positive cases. Pick examples that demonstrate the hard distinctions: the sarcastic review, the mixed one, the edge case you actually care about. In the block above, "It's fine... nothing special" is deliberately ambiguous, and the preceding "mixed" example teaches the model that a middle category exists. Your examples are where you encode judgment the instructions can't.

3. Balance your label distribution

For classification, the empirical prompt-engineering literature documents a real majority-label bias: if four of your five examples are labeled "positive," the model becomes more likely to answer "positive" regardless of the input. Keep label counts roughly balanced across the classes you expect, and don't let an accidental skew in your examples become a thumb on the scale.

4. Mind the ordering

Ordering of examples has a measurable but somewhat unpredictable effect — the Prompt Report and related work note that example order can shift outputs, with a tendency for the model to weight examples near the end of the prompt more heavily (recency). The honest summary: order matters, the optimal order is task-specific, and the practical move is to avoid grouping all of one label together (e.g., don't list every positive example, then every negative). Interleave them. If a task is high-stakes, test a couple of orderings rather than assuming.

Pitfalls

  • Format drift. The number-one silent failure. Inconsistent examples produce inconsistent output. Treat your template like code.
  • Label skew. An unbalanced set of example labels biases predictions toward the majority class.
  • Too few or too many. One example may only fix format; piling on twenty examples burns tokens, can dilute the signal, and rarely beats a well-chosen five. Start at two to five.
  • Examples that leak the answer. If your examples are too similar to the live input, you're testing memorization, not generalization. Cover the space, don't hand over the test case.
  • Stale examples. Few-shot examples are part of your prompt's maintenance surface. When requirements change, the examples have to change with them — an out-of-date example actively teaches the wrong thing.

One counterintuitive finding worth knowing: research on few-shot classification has shown models can still benefit from examples even when some of the labels in those examples are wrong — suggesting the model leans heavily on the demonstrated format and label space, not only on correct input-to-output mappings. Don't take this as license to use sloppy labels; take it as a reminder that format consistency is doing more work than you might assume.

The practical loop

Start zero-shot. If the output is wrong in shape (format, tone, schema) or in judgment (it mishandles the edge cases you care about), add two to four consistent, balanced, boundary-covering examples and test again. This trial-and-error loop — change one thing, measure, keep what works — is the whole empirical philosophy of prompt engineering in miniature, and few-shot is where it pays off fastest.

Extracting structured data from support tickets

✕ Weaker

Read this support ticket and pull out the customer's issue, the product they're talking about, and how urgent it seems. Return it as JSON.

Ticket: "Hi, my Model X600 router keeps dropping the connection every few minutes and I have a client call in an hour. Please help ASAP."

✓ Stronger

Extract fields from support tickets. Match this format exactly.

Ticket: "The login page won't load on the dashboard app, been down all morning, blocking my whole team."
{"product": "dashboard", "issue": "login page won't load", "urgency": "high"}

Ticket: "Quick question — does the X600 router support 5GHz? No rush."
{"product": "X600 router", "issue": "5GHz support question", "urgency": "low"}

Ticket: "Billing charged me twice this month, would like one refunded when you get a chance."
{"product": "billing", "issue": "duplicate charge", "urgency": "medium"}

Ticket: "Hi, my Model X600 router keeps dropping the connection every few minutes and I have a client call in an hour. Please help ASAP."

Why it's better: The bad prompt describes the schema in prose, so the model guesses the JSON key names, value casing, and how to score urgency — and it will guess differently across runs. The good prompt demonstrates the exact schema, shows all three urgency levels (balanced labels across the boundary), keeps every example byte-for-byte consistent, and ends with a trailing open slot so the model completes rather than preambles.

Enforcing a brand voice in generated copy

✕ Weaker

Write a push notification announcing our new dark mode feature. Keep it short, friendly, and on-brand — playful but not cheesy.

✓ Stronger

Write push notifications in our house voice. Match the style of these examples.

Feature: Offline mode
Notification: "No signal? No problem. Your stuff works on the subway now. 🚇"

Feature: Faster sync
Notification: "Sync just got quicker. You won't notice the wait, because there isn't one."

Feature: Group folders
Notification: "Folders, but make it teamwork. Share a folder, share the chaos. 📂"

Feature: Dark mode
Notification:

Why it's better: "Playful but not cheesy" is an instruction the model cannot reliably operationalize — every model has a different idea of where that line sits. The examples define the voice concretely: short, one emoji max, a light twist in the second sentence. The model now imitates a demonstrated style instead of interpreting an adjective, and the trailing label slot tells it exactly what to produce.

Key takeaways

  • Show, don't tell: a few worked examples encode a pattern more reliably than a paragraph describing it — this is the highest-leverage technique you have.
  • Format consistency is non-negotiable. The model learns from surface form, so identical delimiters, field order, and casing across every example are what make few-shot work.
  • Balance your label distribution. Skewed example labels create a majority-label bias that pushes predictions toward whichever class you over-represented.
  • Cover the decision boundary, not the easy cases — pick examples that demonstrate the hard distinctions and edge cases your instructions can't express.
  • Start zero-shot; add 2-5 examples only when output shape or judgment is wrong. Order matters (recency-weighted), so interleave labels rather than grouping them.

Further reading