Structured Output Formatting

Why format is a first-class concern

The moment an LLM stops talking to a human and starts feeding another program, its output stops being prose and becomes an interface. A summarizer can ramble and still be useful; an extraction step that returns {"amount": 1499} on Monday and The amount is $14.99. on Tuesday will take your pipeline down. Structured output formatting is the set of techniques for making the model emit something your parser can consume every time — not most of the time.

This is one of the highest-leverage techniques in the core set precisely because it is unglamorous. It does not make the model smarter. It makes the model legible to code, which is what you actually need when you are chaining steps, populating a database, or routing on a field. Schulhoff's framing in The Prompt Report is worth internalizing: most production failures are not reasoning failures, they are formatting and parsing failures.

The three levers

You have three escalating tools, and you should reach for the weakest one that works.

1. Delimiters and tags

The cheapest move is to wrap the answer in a fence the model is unlikely to produce by accident. XML-style tags work especially well with Claude-family models, which were trained heavily on tagged structure:

Extract the action items. Put each one inside its own
<item>...</item> tag. Put nothing outside the tags.

Transcript:
"""
{transcript}
"""

You then parse by regex on <item>(.*?)</item>. Tags survive partial output, are robust to the model adding a stray sentence, and are trivial to extract. Triple-quote or triple-backtick delimiters around the input also matter: they stop the model from confusing instructions with data, a small but real defense against prompt injection.

2. A described schema (JSON / markdown table)

When you need multiple typed fields, describe the schema explicitly and show one filled example. Do not just say "return JSON":

Return ONLY a JSON object, no prose, matching this schema:
{
  "sentiment": "positive" | "neutral" | "negative",
  "score": number between 0 and 1,
  "themes": string[]   // at most 3
}
Example output:
{"sentiment":"negative","score":0.18,"themes":["billing","wait time"]}

The example does more work than the prose description — it pins down quoting, key order, and the absence of a markdown fence. Markdown tables are the right format when a human will also read the output; JSON is the right format when only code will.

3. Tool / function schemas (constrained decoding)

The strongest lever is to stop asking nicely and instead bind the output to a machine-readable schema through the model's tool/function-calling API. Look at the tool definitions shipped in this library — for example Manus Agent Tools & Prompt/tools.json, where each tool is declared as a JSON Schema object with properties, types, enum constraints, and a required array:

{ "type": "function", "function": {
    "name": "message_ask_user",
    "parameters": { "type": "object",
      "properties": {
        "text": { "type": "string" },
        "suggest_user_takeover": {
          "type": "string", "enum": ["none", "browser"] }
      },
      "required": ["text"] } } }

When you pass a schema like this, providers can enforce it during decoding so the returned arguments are valid JSON conforming to the schema — the enum genuinely constrains the field to one of its allowed values. This is categorically more reliable than asking for JSON in a prompt, because validity is a property of the decoder, not a behavior you are hoping for. The Augment Code/claude-4-sonnet-tools.json and gpt-5-tools.json files in this repo are real, production examples of the same pattern across vendors. If your output feeds code and the schema is fixed, use this path.

A worked example

Say you are extracting line items from a pasted invoice. A naive prompt — "extract the line items as JSON" — will, across a few hundred calls, return: a markdown-fenced block (```json ... ```), a leading "Here is the JSON:", a trailing explanation, a number written as "1,499.00" with a thousands separator, and occasionally a perfectly valid object. Your json.loads throws on the first three and your downstream math breaks on the fourth.

The fix is layered: (a) define a tool record_line_items whose schema types unit_price as a number and marks description, quantity, unit_price as required; (b) in the instruction, state "prices are plain numbers without currency symbols or separators"; (c) on the parse side, still wrap json.loads in a try/except and reprompt on failure. The schema kills the structural errors; the instruction kills the semantic ones; the try/except catches the long tail.

Pitfalls

Markdown fences around JSON. The single most common parse break. Either strip a leading/trailing fence defensively, or use tool calling, which sidesteps it.
Format demands degrade reasoning. The empirical prompt-engineering literature reports mixed and task-dependent results here: forcing rigid output (especially strict JSON) can sometimes lower answer quality versus letting the model reason in prose first. The mitigation is to let it think, then format — have it reason inside a <scratchpad> and emit the structured object only at the end, or do reasoning and formatting in two separate calls.
Trusting "ONLY" without validating. "Return only JSON" reduces stray prose; it does not eliminate it. Always validate against the schema and reprompt or repair on failure. Treat the model as an unreliable narrator of its own format.
Over-nesting. Deeply nested schemas have more places to go wrong. Flatten where you can; a list of flat objects parses more reliably than a tree.
Long enums and free-text where an enum belongs. If a field has a fixed set of valid values, encode it as an enum in the schema rather than describing it in prose — constrained decoding can then guarantee it.

Rule of thumb: tags and delimiters for cheap, human-adjacent output; a described JSON schema with one worked example for typed multi-field output; tool/function schemas when validity must be guaranteed and the structure is fixed. Always validate, always have a repair path.

Extracting typed fields from support tickets

Why it's better: The bad prompt leaves field names, allowed values, and fencing entirely to the model — you will get inconsistent keys, free-text priorities like 'high', and a markdown-fenced block that breaks json.loads. The good prompt pins key names and order, constrains priority/product_area to enums, shows a filled example that demonstrates the no-fence format, and delimits the input so ticket text can't be read as an instruction. In production you'd promote this to a tool schema so the enums are decoder-enforced.

Reasoning plus structure without sacrificing quality

Why it's better: Cramming reasoning into a JSON string field forces the model to escape newlines and quotes mid-thought, which both corrupts the JSON and degrades the reasoning — the mixed evidence in the literature on format-induced quality loss shows up exactly here. Separating a free-form <scratchpad> from a minimal final JSON object lets the model reason naturally, then emit a tiny, easy-to-parse result. The parser only ever has to read the part after </scratchpad>.