Ensembling

Why ensembling matters

A single LLM generation is a sample, not a verdict. Decoding is stochastic, and on a hard reasoning problem the model will sometimes take a wrong turn — drop a sign, miscount, hallucinate a step — even when it can reach the right answer. If you ask once, you are betting on one roll of the dice. Ensembling changes the bet: you draw several independent samples and aggregate them, so that the consensus reflects the model's central tendency rather than the noise of any single path.

This is one of the most empirically robust ideas in prompting. Across the empirical prompt-engineering literature — and emphasized in Schulhoff's Prompt Report — ensembling consistently improves accuracy on tasks that have a checkable answer (arithmetic, multi-step logic, structured classification). It is not magic and not universal, but on the right problem it is one of the few techniques that buys real, measurable reliability rather than vibes.

The core method: self-consistency

The canonical form is self-consistency, introduced as an improvement on chain-of-thought. The recipe:

Take a chain-of-thought prompt (one that elicits step-by-step reasoning).
Sample the model multiple times at non-zero temperature so the reasoning paths diverge.
Extract the final answer from each path.
Take a majority vote over the final answers, discarding the reasoning.

The key insight is that there are many correct reasoning paths to a right answer but the wrong answers tend to be scattered and idiosyncratic. So when you marginalize over reasoning and count only the conclusions, the correct answer accumulates votes while errors split the remainder.

Self-consistency only works with temperature above zero. At temperature 0 every sample is (near) identical and the vote is meaningless — you are just paying N times for one answer.

A worked example

Suppose you ask a model to solve a word problem: "A store sold 3 boxes of 12 apples on Monday and twice as many on Tuesday. On Wednesday it sold 9 fewer than Tuesday. How many apples total?" You sample five chain-of-thought paths at temperature 0.7. Three paths reason cleanly and land on 135 (36 + 72 + 63). One path misreads "twice as many" as twice the apples-per-box and lands on 144. One drops the Wednesday subtraction and lands on 144 as well. A naive single call had a real chance of returning 144. The majority vote returns 135 — the two error paths failed in different-enough ways that neither alone outvoted the correct cluster. That is ensembling earning its cost.

Beyond majority vote: mixtures of reasoning

Self-consistency varies the sampling while holding the prompt fixed. You can also vary the prompt itself and ensemble across variants — sometimes called a mixture-of-reasoning or prompt-ensemble approach:

Different exemplars: run the task with several distinct few-shot example sets, then vote. This reduces sensitivity to any one unlucky exemplar ordering.
Different phrasings or personas: ask "solve this step by step" vs. "verify each constraint before answering" and aggregate. Diverse prompts produce diverse error profiles, which is exactly what makes voting effective.
Different models: ensemble across two or three model families. Their mistakes are less correlated than samples from one model, so the consensus is often stronger per dollar than 3x sampling one model.

Aggregation does not have to be a raw vote. For free-form outputs where answers won't match exactly, you can have a final LLM call read all candidate responses and synthesize or select the best — a "judge" or "fuse" step. This is more flexible than counting but reintroduces a single point of failure, so use it deliberately.

The trade-off: cost vs. reliability

Ensembling's cost is linear in the number of samples and its accuracy gains are sharply diminishing. The jump from 1 to ~5 samples typically captures most of the benefit; going from 5 to 40 buys progressively less while multiplying your bill and latency. There is no fixed right number — it is an empirical knob you tune against your own eval set and your tolerance for cost.

Dimension	Single call	Ensemble (N samples)
Cost / latency	1x	~Nx (parallelizable)
Reliability on checkable tasks	Baseline	Higher, diminishing past ~5
Best fit	Cheap, low-stakes, or subjective	High-stakes, verifiable answer

Pitfalls and honest limits

It needs an aggregatable answer. Voting requires comparable outputs. For open-ended generation (essays, code with many valid forms) majority vote is ill-defined; you fall back to a judge step, which is weaker evidence.
Correlated errors defeat it. If every sample makes the same mistake — a systematic misreading of the prompt — the majority will confidently vote for the wrong answer. Ensembling fixes random error, not bias. If your single-call answers are consistently wrong in one direction, fix the prompt, don't ensemble it.
Temperature zero kills self-consistency. Worth repeating because it is the most common implementation bug.
Diminishing returns are real. Don't reach for N=20 reflexively. Measure where your accuracy curve flattens.
Cost can dominate. At production scale, 5x inference on every request is a serious budget line. Reserve ensembling for the slice of traffic where correctness justifies it — high-stakes decisions, ambiguous inputs, eval-time gold-answer generation — and serve a single call elsewhere.

Used with discipline, ensembling is the cleanest lever you have for trading compute against reliability. Used reflexively, it is just a 5x bill for an answer that was already wrong.

Self-consistency on a financial calculation

Why it's better: The weak prompt asks once for a single number on a multi-step calculation where one dropped step yields a confidently wrong total. The strong prompt forces explicit, isolated steps (so error paths diverge rather than share a mistake) and emits a parseable FINAL line, making programmatic majority voting across five samples trivial. The orchestration — five samples at non-zero temperature, vote on FINAL — is what delivers the reliability, not the wording alone.

Prompt-variant ensemble for ambiguous classification

Why it's better: Ambiguous tickets (e.g. 'I was charged for a feature that crashes') are exactly where one framing biases the answer. The weak prompt commits to a single phrasing and inherits its blind spot. The strong prompt ensembles across three deliberately different framings whose errors are uncorrelated, then votes — and uses a three-way disagreement as a useful signal to route to a human rather than guessing.