Prompt Chaining

Why chaining matters

The instinct of most people writing a hard prompt is to keep adding to it: extract the data, then validate it, then summarize it, then format it as JSON, then tailor the tone — all in one giant instruction block. This mega-prompt works in demos and degrades in production. When a model has to hold many sub-goals in working memory at once, it skips steps, blends instructions, and produces output that is wrong in ways you cannot localize. You get a bad answer and no idea which of the eight requirements broke it.

Prompt chaining is the alternative: decompose the workflow into discrete stages, give each its own prompt, and feed the output of one stage as the input to the next. Instead of asking one prompt to do five things, you run five prompts that each do one thing well. Schulhoff's Prompt Report and the broader empirical prompt-engineering literature consistently find that decomposition-style techniques improve accuracy on multi-step tasks — and just as importantly, they make failures observable.

The core trade-off

A mega-prompt optimizes for fewest API calls and least orchestration code. A chain optimizes for reliability and debuggability. The two properties chaining buys you:

Reliability. Each stage faces a narrower task with a smaller instruction surface, so it follows instructions more faithfully. You can also pick a different model, temperature, or output format per stage.
Debuggability. When the final output is wrong, you can inspect every intermediate artifact and pinpoint the failing stage. With a mega-prompt the intermediate reasoning is invisible, so you are reduced to re-rolling and hoping.

How to build a chain

Map the workflow into stages. Write the task as a sequence of verbs: extract → classify → draft → critique → revise. Each verb that has its own success criterion is a candidate stage.
Define the contract between stages. Decide the exact shape of data passing between them — usually structured (JSON, a delimited list, a typed field). A clean contract is what lets you test a stage in isolation.
Write a focused prompt per stage. Each prompt assumes its input is already in the agreed shape and produces output in the next agreed shape. Apply the other techniques (few-shot, role context, explicit output format) within a stage.
Add validation between stages. Cheap deterministic checks — does the JSON parse? is the required field present? — catch a broken stage before it poisons everything downstream.
Orchestrate in code, not in the prompt. The loop, the branching, the retries live in your application. The model handles judgment; your code handles control flow.

A worked example: a support-ticket triage pipeline

Suppose you want to turn an inbound support email into a routed, drafted reply. As a mega-prompt this is "read this email, decide the category, judge urgency, look up the relevant policy, and write a reply in our brand voice." As a chain:

Stage 1 — Extract. Input: raw email. Output: {"product": ..., "issue_summary": ..., "customer_sentiment": ...}. One job: pull structured facts.
Stage 2 — Classify & route. Input: the extracted JSON. Output: {"category": "billing|bug|how-to", "priority": "P1|P2|P3"}. This stage never sees the prose, only clean fields, so its decision is consistent.
Stage 3 — Retrieve. Your code (not the model) uses category to fetch the right policy snippet from a knowledge base.
Stage 4 — Draft. Input: summary + category + retrieved policy. Output: a reply in brand voice.
Stage 5 — Critique & revise. A separate prompt checks the draft against a checklist (no promises about refunds, correct customer name) and rewrites if needed.

If replies start citing the wrong policy, you know the bug is in Stage 2 or 3 — you can log and replay just those. In a mega-prompt the same failure is a black box.

Chaining vs. a single mega-prompt

Dimension	Mega-prompt	Chain
Instruction-following on multi-step tasks	Degrades as steps pile up	Holds up; each stage is narrow
Failure localization	Opaque — whole output is suspect	Intermediate artifacts are inspectable
Cost & latency	Lower (one call)	Higher (multiple calls, more tokens)
Per-step tuning	One model, one config	Different model/format per stage

This is a genuine trade-off, not a free win. Chains cost more tokens and add latency and orchestration code. For a simple task a mega-prompt is the right call. Reach for chaining when the task has multiple distinct decisions, when correctness matters enough to justify the overhead, or when you are already firefighting an unreliable single prompt.

Pitfalls

Error propagation. A mistake in an early stage cascades. Garbage extracted in Stage 1 produces a confident, well-formatted, wrong reply in Stage 5. Validate early stages hardest.
Loose contracts. If a stage emits prose where the next expects JSON, the chain silently breaks. Enforce structured output and parse-check between stages.
Over-decomposition. Splitting a two-step task into seven stages adds cost and failure points for no accuracy gain. Decompose only where there is a real, separable decision.
Context loss between stages. Each stage only sees what you pass it. If Stage 4 needs the original tone of the email, you must carry it forward — it is not implicitly remembered.
Debugging without logs. The whole benefit of chaining is observability, which you forfeit if you do not persist intermediate outputs. Log every stage's input and output.

Rule of thumb: if you cannot say which stage of your prompt is responsible for a given failure, you do not have a chain — you have a mega-prompt with line breaks.

Resume screening: mega-prompt vs. chain

Why it's better: The mega-prompt forces one call to extract, judge, decide, and write — when the email tone is off or the score seems arbitrary, you cannot tell whether extraction missed a skill or the scoring logic misfired. The chain makes the decision rule explicit and code-checkable in Stage 3 (consistent across candidates instead of vibes), and isolates writing tone to Stage 4. If a strong candidate gets rejected, you inspect Stage 1's extracted skills first — the failure is localizable.

Document Q&A: carrying context through the chain

Why it's better: The single prompt asks the model to search 40 pages and reason about the answer in one pass, where retrieval errors and reasoning errors are indistinguishable. The chain separates retrieval from answering: you can verify Stage 1 actually surfaced the right sections before trusting the answer, and Stage 2 is constrained to only the retrieved text, which sharply reduces fabricated citations. Note the deliberate context hand-off — Stage 2 receives the clauses explicitly because it never saw the full document.