Technique #11 of 15

Prompt Chaining

Decompose a workflow into a pipeline of focused prompts so each stage is reliable, inspectable, and independently fixable.

Why chaining matters

The instinct of most people writing a hard prompt is to keep adding to it: extract the data, then validate it, then summarize it, then format it as JSON, then tailor the tone — all in one giant instruction block. This mega-prompt works in demos and degrades in production. When a model has to hold many sub-goals in working memory at once, it skips steps, blends instructions, and produces output that is wrong in ways you cannot localize. You get a bad answer and no idea which of the eight requirements broke it.

Prompt chaining is the alternative: decompose the workflow into discrete stages, give each its own prompt, and feed the output of one stage as the input to the next. Instead of asking one prompt to do five things, you run five prompts that each do one thing well. Schulhoff's Prompt Report and the broader empirical prompt-engineering literature consistently find that decomposition-style techniques improve accuracy on multi-step tasks — and just as importantly, they make failures observable.

The core trade-off

A mega-prompt optimizes for fewest API calls and least orchestration code. A chain optimizes for reliability and debuggability. The two properties chaining buys you:

  • Reliability. Each stage faces a narrower task with a smaller instruction surface, so it follows instructions more faithfully. You can also pick a different model, temperature, or output format per stage.
  • Debuggability. When the final output is wrong, you can inspect every intermediate artifact and pinpoint the failing stage. With a mega-prompt the intermediate reasoning is invisible, so you are reduced to re-rolling and hoping.

How to build a chain

  1. Map the workflow into stages. Write the task as a sequence of verbs: extract → classify → draft → critique → revise. Each verb that has its own success criterion is a candidate stage.
  2. Define the contract between stages. Decide the exact shape of data passing between them — usually structured (JSON, a delimited list, a typed field). A clean contract is what lets you test a stage in isolation.
  3. Write a focused prompt per stage. Each prompt assumes its input is already in the agreed shape and produces output in the next agreed shape. Apply the other techniques (few-shot, role context, explicit output format) within a stage.
  4. Add validation between stages. Cheap deterministic checks — does the JSON parse? is the required field present? — catch a broken stage before it poisons everything downstream.
  5. Orchestrate in code, not in the prompt. The loop, the branching, the retries live in your application. The model handles judgment; your code handles control flow.

A worked example: a support-ticket triage pipeline

Suppose you want to turn an inbound support email into a routed, drafted reply. As a mega-prompt this is "read this email, decide the category, judge urgency, look up the relevant policy, and write a reply in our brand voice." As a chain:

  • Stage 1 — Extract. Input: raw email. Output: {"product": ..., "issue_summary": ..., "customer_sentiment": ...}. One job: pull structured facts.
  • Stage 2 — Classify & route. Input: the extracted JSON. Output: {"category": "billing|bug|how-to", "priority": "P1|P2|P3"}. This stage never sees the prose, only clean fields, so its decision is consistent.
  • Stage 3 — Retrieve. Your code (not the model) uses category to fetch the right policy snippet from a knowledge base.
  • Stage 4 — Draft. Input: summary + category + retrieved policy. Output: a reply in brand voice.
  • Stage 5 — Critique & revise. A separate prompt checks the draft against a checklist (no promises about refunds, correct customer name) and rewrites if needed.

If replies start citing the wrong policy, you know the bug is in Stage 2 or 3 — you can log and replay just those. In a mega-prompt the same failure is a black box.

Chaining vs. a single mega-prompt

DimensionMega-promptChain
Instruction-following on multi-step tasksDegrades as steps pile upHolds up; each stage is narrow
Failure localizationOpaque — whole output is suspectIntermediate artifacts are inspectable
Cost & latencyLower (one call)Higher (multiple calls, more tokens)
Per-step tuningOne model, one configDifferent model/format per stage

This is a genuine trade-off, not a free win. Chains cost more tokens and add latency and orchestration code. For a simple task a mega-prompt is the right call. Reach for chaining when the task has multiple distinct decisions, when correctness matters enough to justify the overhead, or when you are already firefighting an unreliable single prompt.

Pitfalls

  • Error propagation. A mistake in an early stage cascades. Garbage extracted in Stage 1 produces a confident, well-formatted, wrong reply in Stage 5. Validate early stages hardest.
  • Loose contracts. If a stage emits prose where the next expects JSON, the chain silently breaks. Enforce structured output and parse-check between stages.
  • Over-decomposition. Splitting a two-step task into seven stages adds cost and failure points for no accuracy gain. Decompose only where there is a real, separable decision.
  • Context loss between stages. Each stage only sees what you pass it. If Stage 4 needs the original tone of the email, you must carry it forward — it is not implicitly remembered.
  • Debugging without logs. The whole benefit of chaining is observability, which you forfeit if you do not persist intermediate outputs. Log every stage's input and output.
Rule of thumb: if you cannot say which stage of your prompt is responsible for a given failure, you do not have a chain — you have a mega-prompt with line breaks.

Resume screening: mega-prompt vs. chain

✕ Weaker

You are an expert recruiter. Read the resume below and the job description, score the candidate's fit from 1-10, list their top 3 strengths and top 3 gaps, decide whether to advance them, and write a personalized rejection or interview-invite email in a warm tone. Resume: <<<...>>> Job: <<<...>>>

✓ Stronger

STAGE 1 (extract): From the resume below, output JSON only: {"years_experience": int, "skills": [string], "most_recent_role": string}. Resume: <<<...>>>

STAGE 2 (score): Given this candidate JSON and the job requirements JSON, output {"fit_score": 1-10, "matched_requirements": [string], "missing_requirements": [string]}. Do not write prose. Candidate: {stage1_output} Job: {parsed_job}

STAGE 3 (decide): Given {stage2_output}, apply the rule: advance if fit_score >= 7 AND no missing requirement is tagged 'must-have'. Output {"decision": "advance|reject", "reason": string}.

STAGE 4 (write): Given {decision} and {reason}, write the corresponding email in our warm-but-direct voice.

Why it's better: The mega-prompt forces one call to extract, judge, decide, and write — when the email tone is off or the score seems arbitrary, you cannot tell whether extraction missed a skill or the scoring logic misfired. The chain makes the decision rule explicit and code-checkable in Stage 3 (consistent across candidates instead of vibes), and isolates writing tone to Stage 4. If a strong candidate gets rejected, you inspect Stage 1's extracted skills first — the failure is localizable.

Document Q&A: carrying context through the chain

✕ Weaker

Here is a 40-page contract. Answer the user's question about termination clauses, and make sure your answer is grounded only in the document and cites the relevant section numbers.

✓ Stronger

STAGE 1 (retrieve): From the contract, return every clause mentioning termination, notice periods, or cancellation, each as {"section": string, "text": string}. Return [] if none. 

STAGE 2 (answer): Using ONLY the clauses below, answer the user's question. Quote section numbers from the provided clauses. If the clauses do not cover the question, say 'Not specified in the provided clauses.' Clauses: {stage1_output} Question: {user_question}

Why it's better: The single prompt asks the model to search 40 pages and reason about the answer in one pass, where retrieval errors and reasoning errors are indistinguishable. The chain separates retrieval from answering: you can verify Stage 1 actually surfaced the right sections before trusting the answer, and Stage 2 is constrained to only the retrieved text, which sharply reduces fabricated citations. Note the deliberate context hand-off — Stage 2 receives the clauses explicitly because it never saw the full document.

Key takeaways

  • A mega-prompt fails opaquely; a chain fails at a stage you can name, log, and re-run in isolation.
  • Decompose only where there is a real separable decision — over-decomposition adds cost and failure points without accuracy.
  • Enforce a structured contract between stages and parse-check it; loose hand-offs are where chains silently break.
  • Early-stage errors cascade into confident, well-formatted, wrong final output — validate the front of the pipeline hardest.
  • Chaining trades tokens, latency, and orchestration code for reliability and debuggability; pay it when correctness justifies the overhead.

Further reading