Error Handling & Edge Cases

Why this matters

When you write a prompt, you naturally write it for the input you have in front of you. But in production the model will see inputs you never tested: an empty field, a customer message in a language you didn't expect, a question that has nothing to do with the task, a document that's truncated halfway through. A model with no instructions for these cases does not stop — it does its best to complete the pattern, which usually means inventing a plausible answer. A summarizer handed an empty document will write a summary of nothing. A classifier handed an out-of-scope item will force it into one of your existing labels. This is the failure mode that turns a demo that works into a system that quietly corrupts data.

The empirical prompt-engineering literature, including Schulhoff's Prompt Report, is consistent on one uncomfortable point: models do not reliably abstain on their own. Left to default behavior, they tend toward confident completion over honest uncertainty. Error handling is the technique of closing that gap by making the failure paths explicit, enumerated, and machine-readable.

How to do it

Treat your prompt like a function and ask: what are the inputs that violate my assumptions, and what should the output be for each? There are three categories worth handling explicitly.

1. Missing or malformed input

State what counts as valid input and what to emit when it's absent. Don't leave the empty case to inference.

If the review_text field is empty, contains only whitespace, or is not a product review, do not attempt a sentiment score. Return exactly: {"status": "no_input"}.

2. Ambiguity

Decide in advance whether ambiguity should trigger a clarifying question or a best-effort answer with a flagged caveat. In an interactive chat assistant, asking is often right. In a batch pipeline where no human is in the loop, a clarifying question is a dead end — there's no one to answer it, so you want a flagged best-effort result instead. Specify which world you're in.

3. Out-of-scope and unknown

This is where the "if you do not know, say so" instruction lives — and where it's weakest if you're vague. "Don't make things up" is nearly useless because the model has no operational definition of making things up. Give it a concrete trigger and a concrete output. The strongest pattern is to tie abstention to the source of truth, not to the model's internal confidence, which is poorly calibrated:

Answer using only the provided context. If the context does not contain the information needed to answer, respond with exactly: I don't have that information in the provided documents. Do not use prior knowledge to fill gaps.

Make the failure output structured

An error path that returns prose is hard for downstream code to detect. Reserve a distinct, parseable shape for every failure so your application can branch on it. A worked example for an extraction task:

Extract the invoice total and due date from the text below.

Return JSON: {"total": number, "due_date": "YYYY-MM-DD", "status": "ok"}

Error handling:
- If the text is not an invoice: {"status": "not_an_invoice"}
- If a required field is present but unreadable: set that field
  to null and use {"status": "partial"}
- If the document appears truncated: {"status": "truncated"}
- Never guess a value. A guessed total is worse than a null.

That last line matters more than it looks. You are explicitly telling the model the cost asymmetry: a missing value is recoverable, a fabricated one silently poisons the dataset. Models respond to that framing better than to a bare "be accurate."

Pitfalls

Over-refusal. Aggressive abstention instructions can swing too far, making the model decline reasonable inputs ("I don't have enough information") when it actually does. This is a real tension, and the evidence is mixed on where the line sits — it depends on your model and domain. Test both kinds of error: false answers and false refusals. Don't tune for one and ignore the other.
Confidence is not calibration. Asking the model "are you sure?" or "rate your confidence 0–100" produces a number, but that number is not a reliable probability. Prefer grounding abstention in verifiable conditions (is the fact in the context? is the field non-empty?) over the model's self-assessment.
Unenumerated categories. A classifier told to pick from {positive, negative, neutral} has nowhere to put garbage input, so it picks one. Always add an explicit escape label like out_of_scope or unclear — otherwise your error cases get laundered into valid-looking labels.
Vague error instructions. "Handle errors gracefully" gives the model nothing to act on. Every failure path needs a trigger condition and a literal output. If you can't write the exact string the model should return, you haven't specified the case.
Untested paths. Error handling is only real if you've fed it the bad inputs. Build a small test set of empties, gibberish, off-topic, and truncated cases and run them deliberately. The happy path will take care of itself; these won't.

Done well, error handling is what separates a prompt that works in a notebook from one you can put behind an API. The model will still encounter inputs you didn't imagine — but a good failure policy means it fails into a known, detectable state instead of a confident wrong answer.

RAG question-answering with abstention

Why it's better: The weak version invites the model to answer from its parametric knowledge whenever the context is thin, which is the classic source of grounded-RAG hallucination. The strong version restricts the source of truth to the context, gives a literal abstention string the application can detect, and names the two specific failure behaviors (outside knowledge, guessing) instead of a vague 'be accurate.'

Support-ticket classifier with an escape hatch

Why it's better: The weak prompt has no home for empty, spam, or ambiguous tickets, so the model picks the nearest of four labels and your metrics silently absorb the errors. The strong prompt adds explicit unclear and out_of_scope escapes, tells the model not to force-fit, and returns structured output so the pipeline can route low-confidence and off-topic items to a human instead of acting on a fabricated label.