Constitutional / Principle-Based Prompting
Give the model a short set of governing principles and an explicit step to check its own output against them before it answers.
Why it matters
Most prompts tell the model what task to do. Constitutional prompting tells it what rules it must never break while doing the task — and then forces it to check its own work against those rules before it commits to an answer. The technique borrows directly from Anthropic's Constitutional AI research, where a model is trained to critique and revise its outputs against a written set of principles ("a constitution") instead of relying solely on human feedback for every case. The same loop works at inference time, with no training: you supply the principles in the prompt and ask for a self-critique-then-revise pass.
This matters for production systems for one practical reason: policy becomes a readable artifact. Instead of behavior being smeared across a dozen ad-hoc instructions ("don't be rude", "don't give legal advice", "always cite sources"), you collect the governing rules into one named block. That block can be reviewed by a compliance or legal stakeholder, version-controlled, diffed, and reused across every prompt in your stack. It turns "the model usually behaves" into "here are the eight rules the model is told to enforce on itself, and here is the step where it does so."
How to do it
There are two layers, and you can use either or both.
- State the constitution. Write a short, numbered list of principles — typically 4 to 10. Each principle should be specific enough to be checkable. "Be helpful" is useless; "Do not state a medical dosage; instead direct the user to a licensed professional" is checkable.
- Add an explicit self-critique-and-revise step. Tell the model to produce a draft, then evaluate that draft against each principle, then produce a revised final answer. This is the part that actually does the work. A constitution with no enforcement step is just a longer system prompt.
A minimal structure looks like this:
You are a support agent for a consumer banking app.
Operate under these principles (your constitution):
1. Never request, repeat, or store a full card number, CVV, or password.
2. Never promise refunds, fee reversals, or timelines you cannot verify.
3. If the user describes possible fraud, prioritize directing them to
freeze the card over troubleshooting.
4. Do not give tax or investment advice; refer to a licensed advisor.
5. Match the user's language and keep responses under 120 words.
Process:
- Draft a reply.
- Critique the draft against principles 1-5. List any violations.
- Output ONLY the corrected final reply.
Worked example. A user writes: "I think someone stole my card — it's 4539 8821 0034 1190 — can you confirm and refund the $400?" A naive agent might echo the card number back ("I see your card ending 1190...") and reassure them about a refund. Under the constitution above, the critique step catches two violations: principle 1 (the draft repeated the full number) and principle 2 (it implied a refund). The revised final reply freezes the card per principle 3, never restates the number, and says the refund will be reviewed rather than promised. The principles did real work that a generic "be careful with PII" instruction often misses, because the model was forced to check rather than just remember.
Single-pass vs. two-model setups
For interactive products, the cheapest version is a single prompt that does draft-critique-revise internally and emits only the final answer. For higher-stakes pipelines, split it: one model generates, a second model (the "critic") is given only the constitution and the draft and asked to flag violations, and a third pass revises. The split version is more auditable and harder to jailbreak in a single shot, at the cost of latency and tokens.
Pitfalls and honest caveats
- Vague principles do nothing. The empirical prompt-engineering literature, including Schulhoff's Prompt Report, consistently shows that specific, testable instructions outperform abstract ones. A constitution is only as good as the checkability of each line.
- Self-critique is not self-correction. A model that confidently violated a rule in the draft can confidently declare "no violations" in the critique. The technique reduces failures; it does not guarantee them away. For anything safety-critical, keep a deterministic check (regex for card numbers, a deny-list) outside the model. Do not let the model be the only line of defense.
- It is not a jailbreak shield. Principle-based prompting raises the bar for accidental policy violations and mildly adversarial users. A determined attacker can still override or ignore in-context principles. Treat the constitution as a behavioral default, not a security boundary.
- Too many principles dilute attention. Past roughly ten rules, models start trading one off against another or quietly dropping some. Keep the list short and ordered by priority, and state explicitly which principles win when two conflict.
- Cost. The critique-revise loop roughly doubles or triples token usage for a turn. Measure whether the quality gain justifies it for your traffic, and consider applying it only to flagged or high-risk inputs.
Constitutional prompting moves your policy out of the model's head and onto the page. That is its real value: not that the model becomes perfectly aligned, but that its rules become explicit, reviewable, and testable.
Customer support agent with a safety policy
✕ Weaker
You are a helpful banking support agent. Be careful with sensitive information and don't make promises you can't keep. Help the user with their problem.
✓ Stronger
You are a banking support agent. Operate under these principles: 1. Never repeat, confirm, or store a full card number, CVV, or password. 2. Never promise a refund, reversal, or timeline you cannot verify. 3. If the user describes possible fraud, prioritize freezing the card over troubleshooting. Process: (a) draft a reply; (b) check the draft against principles 1-3 and list any violations; (c) output ONLY the corrected final reply, under 120 words.
Why it's better: The weak version states the goal abstractly ('be careful', 'don't make promises') and gives the model no mechanism to verify it followed through, so it routinely echoes PII or implies refunds. The strong version makes each rule specific and testable and adds an explicit critique-then-revise pass, which is the step that actually catches violations before they reach the user.
Content moderation / classification with appeal to principles
✕ Weaker
Decide whether this user comment should be removed. Be fair and reasonable.
✓ Stronger
You moderate a developer forum. Apply this policy: 1. Remove only for: targeted harassment, doxxing, or malware sharing. 2. Do NOT remove for: strong language, criticism of the product, or off-topic-but-harmless posts. 3. When uncertain, default to KEEP and explain the closest principle. For the comment below: draft a decision, then audit it against principles 1-3 (cite the principle by number), then output the final decision as KEEP or REMOVE plus one sentence citing the governing principle.
Why it's better: 'Be fair and reasonable' gives no shared standard, so decisions are inconsistent and unauditable across runs and reviewers. The constitution defines exactly what is and isn't actionable, forces the model to cite the specific principle behind each call, and sets an explicit tie-breaker (default KEEP), making outputs consistent and reviewable by a human policy owner.
Key takeaways
- A constitution is a short, numbered list of specific, checkable principles plus an explicit step that makes the model critique and revise its own draft against them.
- The self-critique-and-revise step is what does the work — principles with no enforcement step are just a longer system prompt.
- Specificity is everything: 'be safe' fails, 'never repeat a full card number' works. Order principles by priority and state which wins on conflict.
- Self-critique reduces violations but does not eliminate them. For safety-critical rules, keep a deterministic check outside the model — never let the model be the only guard.
- The durable payoff is auditability: your policy becomes a versioned, reviewable artifact rather than behavior smeared across scattered instructions.
Further reading
- Bai et al., 'Constitutional AI: Harmlessness from AI Feedback' (Anthropic, 2022)
- Schulhoff et al., 'The Prompt Report: A Systematic Survey of Prompting Techniques'
- Learn Prompting — Self-Criticism / Self-Refine prompting techniques
- Sander Schulhoff on Lenny's Podcast — empirical prompt-engineering framework