From Chat to Production

Why the distinction matters

Most prompting advice quietly assumes a human is in the loop — you type something, read the answer, and fix it if it's wrong. That assumption is doing enormous work. The moment a prompt runs unattended inside a product, the human safety net disappears, and techniques that felt adequate in a chat window become liabilities. Schulhoff's Prompt Report draws this line sharply: the techniques and the rigor you need scale with the cost of a bad output and the absence of someone to catch it.

So before optimizing anything, answer one question: who reads each output, and what happens if it's wrong? Your answer places you in one of three modes.

Mode 1: Casual conversational use

You are the reviewer. A wrong answer costs you a few seconds. Here, heavy machinery — formal eval sets, structured templates — is wasted effort. Use a tight loop:

Ask plainly. State the task, the context, and the format you want.
Read the output critically. Where is it wrong, vague, or off-format?
Refine by adding the missing constraint or an example of what "good" looks like.
Repeat until it's good enough, then stop.

This is empirical prompting in miniature: you are running trial-and-error with yourself as the evaluator. The skill is reading outputs honestly and naming the specific defect rather than re-rolling and hoping. Do not over-invest — the whole point of this mode is that iteration is cheap because you're standing right there.

Mode 2: High-stakes production

No one reads each output, or the cost of a wrong one is real — a customer-facing answer, a database write, a financial extraction. The four-step loop is now dangerous, because "looks good to me on three tries" is not evidence about the next ten thousand calls. Production prompting is three phases:

1. Establish a baseline

Build a labeled test set before you tune anything: 20–100 representative inputs with known-good outputs, deliberately including the hard and weird cases. Write the simplest prompt that could work, run it across the set, and score it. This number is your baseline. Without it you cannot tell improvement from noise — and the literature is full of changes that felt better but weren't.

2. Improve systematically

Change one thing at a time and re-score against the same set. Add few-shot examples; add a chain-of-thought instruction; tighten the output schema. Keep changes that move the score, discard those that don't, regardless of how clever they felt. Resist stacking five techniques at once — you'll never know which one helped, and some interact badly.

3. Monitor and eval in production

Offline scores drift from live behavior. Log inputs and outputs, sample them for human review, and run automated checks — schema validation, an LLM-as-judge grader, regression tests on your golden set in CI. Model providers update models underneath you, so a prompt frozen today can silently degrade. Treat the eval set as a living asset: every production failure becomes a new test case.

A worked example

Suppose you extract {name, email, intent} from support emails. In a chat you'd paste an email, eyeball the JSON, and move on. In production you build 50 labeled emails (including ones with no email address, two people, or sarcasm), baseline a one-line prompt at, say, mostly-correct, then test improvements: a strict JSON schema fixes malformed output; two few-shot examples of the "no email present" case fix hallucinated addresses; an explicit null rule fixes the rest. Each gain is measured, not felt. In production you validate every output against the schema and alert when intent falls outside your allowed set.

Mode 3: Migrating between modes

The dangerous moment is promotion — a prompt that worked beautifully in your chat experiments gets pasted into a service. The chat version was implicitly co-authored by you: your follow-ups, your clarifications, your willingness to retry. None of that ships. To migrate safely:

Make implicit context explicit. Everything you "just knew" must now live in the prompt — role, constraints, edge-case rules.
Pin the variables. Fix the model version, temperature, and system prompt. "It worked yesterday" is not a spec.
Build the eval set retroactively. Turn the examples you tried by hand — especially the failures — into your first labeled set.
Add the guardrails the human used to provide: input validation, output validation, a fallback path, and a cap on retries.

The shipping checklist

A labeled test set exists, including hard and adversarial cases.
You have a baseline score and measured each change against it.
Output format is validated programmatically, not by eye.
Model, temperature, and prompt version are pinned and tracked in version control.
A failure path exists for malformed or refused outputs.
Live outputs are logged and sampled; the eval set runs in CI.
There is an owner who gets paged when quality regresses.

Pitfalls

Honest caveats, because the evidence here is mixed. Vibes-based tuning is the big one — without a held-out set you optimize to a handful of inputs and overfit. Technique stacking obscures causality; the prompt-engineering literature repeatedly finds that a method that helps one task or model hurts another, so generic "always do X" rules don't survive contact with your data. LLM-as-judge graders are cheap and scalable but biased and gameable, so calibrate them against human labels before trusting them. And never let prompting substitute for engineering: the most reliable guardrails — schema validation, allow-lists, deterministic post-processing — live in your code, not in your prose.

Promoting a chat prompt to a production service

Why it's better: The bad prompt relies on a human reading prose and inferring intent — fine in chat, unusable in a pipeline. The good prompt pins the output to a validatable schema, enumerates allowed values so you can reject anything else, and specifies the empty-input edge case the human used to handle implicitly. It is also testable: every field can be scored against a labeled set.

Improving systematically vs. guessing

Why it's better: Stacking changes makes it impossible to know what helped, and the empirical literature shows techniques interact unpredictably across tasks and models. Isolating one variable at a time against a fixed eval set turns 'seems better' into a measured, defensible improvement — and surfaces the changes that quietly hurt.