Technique #10 of 15

Multi-Step Reasoning Chains

Chain-of-thought makes non-reasoning models think out loud — but forcing it on reasoning models is often a mistake.

◷ 7 min read Lesson 12 of 19

Why reasoning chains matter

A language model that answers immediately commits to the first token of its response before it has "thought" about anything. For a multi-step problem — a word problem, a logic puzzle, a multi-condition eligibility check, a code-tracing question — that first token is a guess. Chain-of-thought (CoT) prompting fixes this by instructing the model to produce its intermediate reasoning before the final answer. Because each token is conditioned on the tokens before it, the written-out reasoning becomes scratch space the model can attend to when it finally commits to an answer.

The empirical prompt-engineering literature — including Schulhoff's Prompt Report — consistently finds that on arithmetic, symbolic, and multi-hop reasoning benchmarks, CoT produces large accuracy gains on standard (non-reasoning) models. The effect is most pronounced on larger models and on tasks where the answer genuinely requires several dependent steps. On simple lookup or classification tasks, the gain is small or absent; there is nothing to reason through.

How to elicit reasoning

There are two reliable ways to get a chain of thought.

Zero-shot CoT

Append an instruction that asks the model to reason step by step. The well-known trigger phrase is "Let's think step by step," but any clear instruction works. The mechanism matters more than the exact words: you are reserving output space for reasoning before the answer.

Question: A team ships 3 features per sprint. Sprints are 2 weeks.
They have 14 weeks and lose 1 sprint to holidays. How many features ship?

Think through this step by step, then give the final number on its own line.

The model now writes: 14 weeks ÷ 2 = 7 sprints; minus 1 holiday sprint = 6 sprints; 6 × 3 = 18 features. A model asked for "just the number" frequently fumbles one of those steps and answers 21 or 15.

Few-shot CoT

Provide one or more worked examples that show the reasoning, not just the answer. This is more powerful and more controllable than zero-shot because the exemplars teach the model the format and granularity of reasoning you want — the units to track, the checks to perform, the structure of the final answer. Use it when the reasoning style matters (e.g., you always want the model to state assumptions, or to verify against a constraint before answering).

Separate the reasoning from the answer

Whichever variant you use, make the final answer machine-extractable. Ask for the answer on its own line, in a tagged block, or as JSON after the prose. Otherwise you will parse the reasoning by accident. A robust pattern: "Reason in a <scratchpad> block, then output the answer as JSON outside it."

The critical nuance: reasoning models

This is where most teams get it wrong in 2026. Modern reasoning models (the o3-style family, and their counterparts) already perform an internal chain of thought before they answer — that deliberation is the entire point of the model. Forcing explicit "think step by step" instructions on top of that is, at best, redundant. At worst it is counterproductive:

It can degrade quality. Schulhoff and others have observed that hand-written CoT instructions can interfere with the model's own, more sophisticated internal reasoning process — you are overriding a trained behavior with a cruder one.
It wastes tokens and latency. You pay for visible reasoning the model was already doing internally.
It conflates two controls. On reasoning models the lever you actually want is the reasoning effort / budget setting, not a prompt incantation.

Rule of thumb: prompt CoT into non-reasoning models; get out of the way of reasoning models. With a reasoning model, give it a clear problem and a clear output spec, and let it deliberate. Reach for explicit step-by-step instructions only if you observe it skipping steps.

The honest caveat: the boundary is blurry and model-specific. Some "reasoning" models still benefit from light structural guidance (e.g., "verify your answer satisfies all three constraints"), and behavior changes between model versions. This is exactly the kind of claim you should test on your own task rather than take on faith — which is the empirical philosophy running through this whole curriculum.

Pitfalls

Trusting the stated reasoning as a faithful explanation. The written chain is a performance that improves the answer; it is not a guaranteed-accurate account of why the model answered as it did. Do not use it as an audit trail for high-stakes decisions.
Reasoning on trivial tasks. Forcing CoT on simple classification or extraction adds latency and cost for no gain, and occasionally talks the model out of a correct snap answer.
Letting reasoning leak into the output. If a downstream system parses the response, unseparated reasoning breaks it. Always isolate the final answer.
Confusing CoT with self-consistency. CoT generates one reasoning path. If you need higher reliability, sample several independent chains and take a majority vote (self-consistency) — that is a separate, additive technique.

Multi-step word problem on a standard model

✕ Weaker

A SaaS plan costs $40/user/month with a 15% annual discount if paid yearly. A company has 25 users and pays annually. What's the yearly cost? Answer with just the number.

✓ Stronger

A SaaS plan costs $40/user/month with a 15% annual discount if paid yearly. A company has 25 users and pays annually. Work through this step by step: (1) monthly cost for all users, (2) un-discounted annual cost, (3) apply the 15% discount. Then output the final figure on its own line prefixed with 'ANSWER:'.

Why it's better: The 'just the number' version forces the model to commit before computing the dependent steps, and it frequently applies the discount to the monthly figure or forgets the 12x. Scaffolding the three steps and isolating the answer line gives both higher accuracy and a clean value to parse.

The same task on a reasoning model

✕ Weaker

You are an expert mathematician. Let's think step by step, very carefully, showing every single intermediate calculation in detail. A SaaS plan costs $40/user/month with a 15% annual discount if paid yearly, 25 users, paid annually — what is the yearly cost?

✓ Stronger

A SaaS plan costs $40/user/month with a 15% annual discount paid yearly. Company has 25 users, pays annually. Return the yearly cost as JSON: {"yearly_cost_usd": <number>}. Use high reasoning effort.

Why it's better: On a reasoning model the verbose 'think step by step, show every calculation' wrapper overrides the model's own trained deliberation and burns latency. The strong version states the problem cleanly, specifies a parseable output, and uses the actual lever — reasoning effort — instead of a prompt incantation.

Few-shot CoT to control reasoning format

✕ Weaker

Is a customer eligible for the loyalty refund? Rules: account >12 months old, no refunds in last 90 days, lifetime spend >$500. Customer: 18 months, last refund 200 days ago, spent $430.

✓ Stronger

Decide loyalty-refund eligibility. Rules: account >12 months, no refund in last 90 days, lifetime spend >$500. For each rule, state pass/fail with the value, then the verdict.

Example:
Customer: 8 months, last refund 400 days ago, spent $900.
- Age >12mo: FAIL (8)
- No refund in 90d: PASS (400)
- Spend >$500: PASS ($900)
Verdict: NOT ELIGIBLE (failed age)

Now evaluate:
Customer: 18 months, last refund 200 days ago, spent $430.

Why it's better: The bad prompt invites a one-word yes/no that often ignores one of the three conditions. The worked example forces a per-rule check with the actual value shown, so the model can't skip a condition, and the structured trace is auditable and consistent across cases.

Key takeaways

Chain-of-thought reserves output space for reasoning before the answer — it gives the model scratch space it can attend to, which is why multi-step accuracy jumps on standard models.
Use zero-shot CoT ('think step by step') for quick wins; use few-shot CoT with worked examples when you need to control the reasoning format and granularity.
On o3-style reasoning models, do NOT force explicit step-by-step prompting — it's redundant at best and degrades the model's own internal reasoning at worst. Tune reasoning effort instead.
Always separate the reasoning from the final answer (tagged block or JSON) so downstream parsing doesn't choke on the scratch work.
The stated chain improves answers but is not a faithful explanation — never treat it as an audit trail for high-stakes decisions. And test the reasoning-vs-not boundary on your own task; it's model-specific.

Why reasoning chains matter

How to elicit reasoning

Zero-shot CoT

Few-shot CoT

Separate the reasoning from the answer

The critical nuance: reasoning models

Pitfalls

Multi-step word problem on a standard model

✕ Weaker

✓ Stronger

The same task on a reasoning model

✕ Weaker

✓ Stronger

Few-shot CoT to control reasoning format

✕ Weaker

✓ Stronger

Key takeaways

Further reading