Constitutional / Principle-Based Prompting

Why it matters

Most prompts tell the model what task to do. Constitutional prompting tells it what rules it must never break while doing the task — and then forces it to check its own work against those rules before it commits to an answer. The technique borrows directly from Anthropic's Constitutional AI research, where a model is trained to critique and revise its outputs against a written set of principles ("a constitution") instead of relying solely on human feedback for every case. The same loop works at inference time, with no training: you supply the principles in the prompt and ask for a self-critique-then-revise pass.

This matters for production systems for one practical reason: policy becomes a readable artifact. Instead of behavior being smeared across a dozen ad-hoc instructions ("don't be rude", "don't give legal advice", "always cite sources"), you collect the governing rules into one named block. That block can be reviewed by a compliance or legal stakeholder, version-controlled, diffed, and reused across every prompt in your stack. It turns "the model usually behaves" into "here are the eight rules the model is told to enforce on itself, and here is the step where it does so."

How to do it

There are two layers, and you can use either or both.

State the constitution. Write a short, numbered list of principles — typically 4 to 10. Each principle should be specific enough to be checkable. "Be helpful" is useless; "Do not state a medical dosage; instead direct the user to a licensed professional" is checkable.
Add an explicit self-critique-and-revise step. Tell the model to produce a draft, then evaluate that draft against each principle, then produce a revised final answer. This is the part that actually does the work. A constitution with no enforcement step is just a longer system prompt.

A minimal structure looks like this:

You are a support agent for a consumer banking app.

Operate under these principles (your constitution):
1. Never request, repeat, or store a full card number, CVV, or password.
2. Never promise refunds, fee reversals, or timelines you cannot verify.
3. If the user describes possible fraud, prioritize directing them to
   freeze the card over troubleshooting.
4. Do not give tax or investment advice; refer to a licensed advisor.
5. Match the user's language and keep responses under 120 words.

Process:
- Draft a reply.
- Critique the draft against principles 1-5. List any violations.
- Output ONLY the corrected final reply.

Worked example. A user writes: "I think someone stole my card — it's 4539 8821 0034 1190 — can you confirm and refund the $400?" A naive agent might echo the card number back ("I see your card ending 1190...") and reassure them about a refund. Under the constitution above, the critique step catches two violations: principle 1 (the draft repeated the full number) and principle 2 (it implied a refund). The revised final reply freezes the card per principle 3, never restates the number, and says the refund will be reviewed rather than promised. The principles did real work that a generic "be careful with PII" instruction often misses, because the model was forced to check rather than just remember.

Single-pass vs. two-model setups

For interactive products, the cheapest version is a single prompt that does draft-critique-revise internally and emits only the final answer. For higher-stakes pipelines, split it: one model generates, a second model (the "critic") is given only the constitution and the draft and asked to flag violations, and a third pass revises. The split version is more auditable and harder to jailbreak in a single shot, at the cost of latency and tokens.

Pitfalls and honest caveats

Vague principles do nothing. The empirical prompt-engineering literature, including Schulhoff's Prompt Report, consistently shows that specific, testable instructions outperform abstract ones. A constitution is only as good as the checkability of each line.
Self-critique is not self-correction. A model that confidently violated a rule in the draft can confidently declare "no violations" in the critique. The technique reduces failures; it does not guarantee them away. For anything safety-critical, keep a deterministic check (regex for card numbers, a deny-list) outside the model. Do not let the model be the only line of defense.
It is not a jailbreak shield. Principle-based prompting raises the bar for accidental policy violations and mildly adversarial users. A determined attacker can still override or ignore in-context principles. Treat the constitution as a behavioral default, not a security boundary.
Too many principles dilute attention. Past roughly ten rules, models start trading one off against another or quietly dropping some. Keep the list short and ordered by priority, and state explicitly which principles win when two conflict.
Cost. The critique-revise loop roughly doubles or triples token usage for a turn. Measure whether the quality gain justifies it for your traffic, and consider applying it only to flagged or high-risk inputs.

Constitutional prompting moves your policy out of the model's head and onto the page. That is its real value: not that the model becomes perfectly aligned, but that its rules become explicit, reviewable, and testable.

Customer support agent with a safety policy

Why it's better: The weak version states the goal abstractly ('be careful', 'don't make promises') and gives the model no mechanism to verify it followed through, so it routinely echoes PII or implies refunds. The strong version makes each rule specific and testable and adds an explicit critique-then-revise pass, which is the step that actually catches violations before they reach the user.

Content moderation / classification with appeal to principles

Why it's better: 'Be fair and reasonable' gives no shared standard, so decisions are inconsistent and unauditable across runs and reviewers. The constitution defines exactly what is and isn't actionable, forces the model to cite the specific principle behind each call, and sets an explicit tie-breaker (default KEEP), making outputs consistent and reviewable by a human policy owner.

Constitutional / Principle-Based Prompting

Why it matters

How to do it

Single-pass vs. two-model setups

Pitfalls and honest caveats

Customer support agent with a safety policy

✕ Weaker

✓ Stronger

Content moderation / classification with appeal to principles

✕ Weaker

✓ Stronger

Key takeaways

Further reading