Schema best practices

Let an LLM draft it for you

Writing a schema from scratch is slow. Paste a sample document into Claude or ChatGPT and ask it to produce a JSON schema for the fields you care about. You'll get a working first draft in seconds, and can refine from there.

Example prompt

You are helping me build a JSON schema for an AI document extraction API.

Here is a sample document:
[paste or attach the document text]

Produce a JSON schema (as used by Vindonissa) that extracts:
- invoice number, date, vendor, total amount, currency
- a line_items array with description, quantity, unit_price

For every field, include a clear "description" that tells the extractor
what to look for (synonyms, units, formats). Prefer strict types
(string, number, array) and add enum or format where the value is constrained.

Tip

Attach two or three varied samples to the prompt. The LLM will generalise across layouts and catch optional fields you would have missed.

Write meaningful field descriptions

The description on each field is read by the extraction model and directly shapes results. Vague descriptions produce vague extractions. Include units, formats, and the terms a document might actually use.

Weak — nothing tells the model where to look or what format to return:

Avoid

{
  "total": { "type": "number", "description": "total" },
  "date":  { "type": "string", "description": "date" }
}

Strong — names the units, the format, and the synonyms found on real documents:

Prefer

{
  "total_amount": {
    "type": "number",
    "description": "Total amount due in CHF, including VAT. Look for 'Total', 'Grand Total', 'Montant dû'."
  },
  "issue_date": {
    "type": "string",
    "format": "date",
    "description": "Date the invoice was issued, in YYYY-MM-DD format. Not the due date."
  }
}

Iterate with the `schema` request override

POST /process_file accepts an inline schema field that overrides the stored schema for that single request. Use it to test a tweak against a real document without re-registering the document type.

One-off override

curl -X POST https://api.helvetii.ai/process_file \
  -H "Authorization: Bearer vnd_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "file_base64": "'$(base64 -i invoice.pdf)'",
    "project": "my_project",
    "document_type": "invoice",
    "schema": {
      "invoice_number": {
        "type": "string",
        "description": "Unique invoice identifier printed near the top of the page"
      },
      "total_amount": {
        "type": "number",
        "description": "Total amount due in CHF, including VAT"
      }
    }
  }'

Iterate freely until the output is right, then promote the schema to the stored document type so every future call uses it by default.

Start minimal, expand gradually

Fewer, well-described fields extract more reliably than dozens of optional ones. Begin with the fields that matter most for your downstream system, verify them, then add the rest one at a time.

A strong first draft

{
  "invoice_number": {
    "type": "string",
    "description": "Unique invoice identifier printed near the top of the page"
  },
  "total_amount": {
    "type": "number",
    "description": "Total amount due, including VAT"
  },
  "currency": {
    "type": "string",
    "enum": ["CHF", "EUR", "USD"],
    "description": "ISO 4217 currency code"
  }
}

Prefer explicit types (string, number, array). Use enum when the value is drawn from a known set (currencies, statuses) and format for dates, emails, and URLs. Reserve optional fields for values that are genuinely absent from some documents.

Note

Once a schema is stable and you see repeated patterns, register it on the document type so every request uses it by default. Overrides stay useful for experiments and one-off variations.

Web client

API reference

Full schema and override documentation.

Overview

API

Let an LLM draft it for you

Write meaningful field descriptions

Iterate with the `schema` request override

Start minimal, expand gradually

Related

Web client

API reference

Overview

API

Let an LLM draft it for you

Write meaningful field descriptions

Iterate with the schema request override

Start minimal, expand gradually

Related

Web client

API reference

Iterate with the `schema` request override