From Chat to Production
A prompt you tweak by hand in a chat window and a prompt that runs ten thousand times a day are two different engineering problems — know which one you're solving.
Why the distinction matters
Most prompting advice quietly assumes a human is in the loop — you type something, read the answer, and fix it if it's wrong. That assumption is doing enormous work. The moment a prompt runs unattended inside a product, the human safety net disappears, and techniques that felt adequate in a chat window become liabilities. Schulhoff's Prompt Report draws this line sharply: the techniques and the rigor you need scale with the cost of a bad output and the absence of someone to catch it.
So before optimizing anything, answer one question: who reads each output, and what happens if it's wrong? Your answer places you in one of three modes.
Mode 1: Casual conversational use
You are the reviewer. A wrong answer costs you a few seconds. Here, heavy machinery — formal eval sets, structured templates — is wasted effort. Use a tight loop:
- Ask plainly. State the task, the context, and the format you want.
- Read the output critically. Where is it wrong, vague, or off-format?
- Refine by adding the missing constraint or an example of what "good" looks like.
- Repeat until it's good enough, then stop.
This is empirical prompting in miniature: you are running trial-and-error with yourself as the evaluator. The skill is reading outputs honestly and naming the specific defect rather than re-rolling and hoping. Do not over-invest — the whole point of this mode is that iteration is cheap because you're standing right there.
Mode 2: High-stakes production
No one reads each output, or the cost of a wrong one is real — a customer-facing answer, a database write, a financial extraction. The four-step loop is now dangerous, because "looks good to me on three tries" is not evidence about the next ten thousand calls. Production prompting is three phases:
1. Establish a baseline
Build a labeled test set before you tune anything: 20–100 representative inputs with known-good outputs, deliberately including the hard and weird cases. Write the simplest prompt that could work, run it across the set, and score it. This number is your baseline. Without it you cannot tell improvement from noise — and the literature is full of changes that felt better but weren't.
2. Improve systematically
Change one thing at a time and re-score against the same set. Add few-shot examples; add a chain-of-thought instruction; tighten the output schema. Keep changes that move the score, discard those that don't, regardless of how clever they felt. Resist stacking five techniques at once — you'll never know which one helped, and some interact badly.
3. Monitor and eval in production
Offline scores drift from live behavior. Log inputs and outputs, sample them for human review, and run automated checks — schema validation, an LLM-as-judge grader, regression tests on your golden set in CI. Model providers update models underneath you, so a prompt frozen today can silently degrade. Treat the eval set as a living asset: every production failure becomes a new test case.
A worked example
Suppose you extract {name, email, intent} from support emails. In a chat you'd paste an email, eyeball the JSON, and move on. In production you build 50 labeled emails (including ones with no email address, two people, or sarcasm), baseline a one-line prompt at, say, mostly-correct, then test improvements: a strict JSON schema fixes malformed output; two few-shot examples of the "no email present" case fix hallucinated addresses; an explicit null rule fixes the rest. Each gain is measured, not felt. In production you validate every output against the schema and alert when intent falls outside your allowed set.
Mode 3: Migrating between modes
The dangerous moment is promotion — a prompt that worked beautifully in your chat experiments gets pasted into a service. The chat version was implicitly co-authored by you: your follow-ups, your clarifications, your willingness to retry. None of that ships. To migrate safely:
- Make implicit context explicit. Everything you "just knew" must now live in the prompt — role, constraints, edge-case rules.
- Pin the variables. Fix the model version, temperature, and system prompt. "It worked yesterday" is not a spec.
- Build the eval set retroactively. Turn the examples you tried by hand — especially the failures — into your first labeled set.
- Add the guardrails the human used to provide: input validation, output validation, a fallback path, and a cap on retries.
The shipping checklist
- A labeled test set exists, including hard and adversarial cases.
- You have a baseline score and measured each change against it.
- Output format is validated programmatically, not by eye.
- Model, temperature, and prompt version are pinned and tracked in version control.
- A failure path exists for malformed or refused outputs.
- Live outputs are logged and sampled; the eval set runs in CI.
- There is an owner who gets paged when quality regresses.
Pitfalls
Honest caveats, because the evidence here is mixed. Vibes-based tuning is the big one — without a held-out set you optimize to a handful of inputs and overfit. Technique stacking obscures causality; the prompt-engineering literature repeatedly finds that a method that helps one task or model hurts another, so generic "always do X" rules don't survive contact with your data. LLM-as-judge graders are cheap and scalable but biased and gameable, so calibrate them against human labels before trusting them. And never let prompting substitute for engineering: the most reliable guardrails — schema validation, allow-lists, deterministic post-processing — live in your code, not in your prose.
Promoting a chat prompt to a production service
✕ Weaker
Summarize this customer email and tell me what they want.
✓ Stronger
You are a support-triage classifier. Given a customer email, return ONLY valid JSON matching: {"intent": one of ["billing","bug","feature_request","other"], "summary": string (max 200 chars), "urgent": boolean}. If the email is empty or unintelligible, return {"intent":"other","summary":"","urgent":false}. Do not add fields. Examples:
[two worked examples, including one empty email]
Email: {{email_body}}
Why it's better: The bad prompt relies on a human reading prose and inferring intent — fine in chat, unusable in a pipeline. The good prompt pins the output to a validatable schema, enumerates allowed values so you can reject anything else, and specifies the empty-input edge case the human used to handle implicitly. It is also testable: every field can be scored against a labeled set.
Improving systematically vs. guessing
✕ Weaker
This extraction prompt feels unreliable, so I added chain-of-thought, three examples, a JSON schema, and a sterner tone all at once. It seems better now.
✓ Stronger
Baseline the current prompt on the 50-email labeled set and record the score. Then test changes one at a time, re-scoring after each: (1) add the JSON schema; (2) add two few-shot examples of the 'no email present' case; (3) add an explicit null rule. Keep only the changes that raised the score; discard the rest.
Why it's better: Stacking changes makes it impossible to know what helped, and the empirical literature shows techniques interact unpredictably across tasks and models. Isolating one variable at a time against a fixed eval set turns 'seems better' into a measured, defensible improvement — and surfaces the changes that quietly hurt.
Key takeaways
- Pick your mode by asking who reads each output and what a wrong one costs — that single question determines how much rigor you need.
- Casual use is a 4-step loop (ask, read, refine, repeat) with you as the evaluator; don't over-engineer it.
- Production prompting is baseline, then one-change-at-a-time improvement measured against a labeled set, then live monitoring and evals.
- Migration is where prompts break: the human's implicit context, retries, and judgment don't ship — make them explicit and put guardrails in code.
- Reliability comes from engineering (schema validation, allow-lists, pinned versions), not from cleverer wording.
Further reading
- Sander Schulhoff et al., 'The Prompt Report: A Systematic Survey of Prompting Techniques' (2024)
- Sander Schulhoff on Lenny's Podcast — prompt engineering, the 15 core techniques, and the empirical/trial-and-error philosophy
- Learn Prompting (learnprompting.org) — documentation on few-shot prompting, chain-of-thought, and prompt evaluation
- Industry practice on LLM evaluation: golden test sets, LLM-as-judge grading, and regression testing in CI