Schema best practices
Schemas drive every extraction. A few small habits make the difference between a schema that almost works and one that reliably returns clean data.
Let an LLM draft it for you
Writing a schema from scratch is slow. Paste a sample document into Claude or ChatGPT and ask it to produce a JSON schema for the fields you care about. You'll get a working first draft in seconds, and can refine from there.
You are helping me build a JSON schema for an AI document extraction API.
Here is a sample document:
[paste or attach the document text]
Produce a JSON schema (as used by Vindonissa) that extracts:
- invoice number, date, vendor, total amount, currency
- a line_items array with description, quantity, unit_price
For every field, include a clear "description" that tells the extractor
what to look for (synonyms, units, formats). Prefer strict types
(string, number, array) and add enum or format where the value is constrained.Tip
Attach two or three varied samples to the prompt. The LLM will generalise across layouts and catch optional fields you would have missed.
Write meaningful field descriptions
The description on each field is read by the extraction model and directly shapes results. Vague descriptions produce vague extractions. Include units, formats, and the terms a document might actually use.
Weak — nothing tells the model where to look or what format to return:
{
"total": { "type": "number", "description": "total" },
"date": { "type": "string", "description": "date" }
}Strong — names the units, the format, and the synonyms found on real documents:
{
"total_amount": {
"type": "number",
"description": "Total amount due in CHF, including VAT. Look for 'Total', 'Grand Total', 'Montant dû'."
},
"issue_date": {
"type": "string",
"format": "date",
"description": "Date the invoice was issued, in YYYY-MM-DD format. Not the due date."
}
}Iterate with the schema request override
POST /process_file accepts an inline schema field that overrides the stored schema for that single request. Use it to test a tweak against a real document without re-registering the document type.
curl -X POST https://api.helvetii.ai/process_file \
-H "Authorization: Bearer vnd_your_api_key" \
-H "Content-Type: application/json" \
-d '{
"file_base64": "'$(base64 -i invoice.pdf)'",
"project": "my_project",
"document_type": "invoice",
"schema": {
"invoice_number": {
"type": "string",
"description": "Unique invoice identifier printed near the top of the page"
},
"total_amount": {
"type": "number",
"description": "Total amount due in CHF, including VAT"
}
}
}'Iterate freely until the output is right, then promote the schema to the stored document type so every future call uses it by default.
Start minimal, expand gradually
Fewer, well-described fields extract more reliably than dozens of optional ones. Begin with the fields that matter most for your downstream system, verify them, then add the rest one at a time.
{
"invoice_number": {
"type": "string",
"description": "Unique invoice identifier printed near the top of the page"
},
"total_amount": {
"type": "number",
"description": "Total amount due, including VAT"
},
"currency": {
"type": "string",
"enum": ["CHF", "EUR", "USD"],
"description": "ISO 4217 currency code"
}
}Prefer explicit types (string, number, array). Use enum when the value is drawn from a known set (currencies, statuses) and format for dates, emails, and URLs. Reserve optional fields for values that are genuinely absent from some documents.
Note
Once a schema is stable and you see repeated patterns, register it on the document type so every request uses it by default. Overrides stay useful for experiments and one-off variations.