Skip to content
Get started

Schema Design for Accurate Extraction

The schema you pass to /extract/json and /generate/json is the most important factor in extraction quality. Learn the patterns that produce reliable results.

The schema you pass to /extract/json and /generate/json is the most important factor in extraction quality. It’s not just a shape definition; it’s a set of instructions for the AI doing the extraction.

This guide covers the patterns that produce reliable results and the common mistakes that produce noise.


Descriptions are instructions, not documentation

Section titled “Descriptions are instructions, not documentation”

Every property description you add tells the extraction AI what to look for and how to interpret it. Without descriptions, the AI has only the property name to go on, and names are ambiguous.

// Ambiguous — AI guesses what 'price' means
json_schema: {
type: 'object',
properties: {
price: { type: 'number' }
}
}
// Specific — AI knows exactly what to extract
json_schema: {
type: 'object',
properties: {
price: {
type: 'number',
description: 'Monthly price in USD as a number. Null if pricing requires contacting sales.'
}
}
}

The second version extracts correctly even when the price is presented as "$49/mo", a range, or missing entirely. The description tells the AI how to handle each case.


Pages don’t always have every field. Tell the AI what to return when data is absent.

properties: {
annual_discount: {
type: ['number', 'null'],
description: 'Percentage discount for annual billing (e.g. 20 for 20% off). Null if no annual option is offered.'
},
trial_days: {
type: ['number', 'null'],
description: 'Length of free trial in days. Null if no trial is available.'
}
}

This produces consistent null values instead of missing fields, empty strings, or the AI inventing values.


Match your schema depth to the page structure

Section titled “Match your schema depth to the page structure”

If the page has a two-level hierarchy (categories containing products), your schema should reflect that:

json_schema: {
type: 'object',
properties: {
categories: {
type: 'array',
description: 'Top-level product categories on the page',
items: {
type: 'object',
properties: {
name: { type: 'string', description: 'Category heading' },
products: {
type: 'array',
items: {
type: 'object',
properties: {
name: { type: 'string', description: 'Product name' },
price: { type: 'number', description: 'Price in USD' }
}
}
}
}
}
}
}
}

A flat schema trying to capture a nested page will produce unreliable results: items get merged and hierarchy is lost.


For fields with a known set of values, enum dramatically improves consistency:

properties: {
billing_period: {
type: 'string',
enum: ['monthly', 'annual', 'one-time', 'unknown'],
description: 'How often the plan is billed. Use unknown if unclear.'
},
tier: {
type: 'string',
enum: ['free', 'starter', 'pro', 'enterprise'],
description: 'Plan tier. Use enterprise for anything requiring sales contact.'
}
}

Without enum, you’ll get "monthly", "Monthly", "per month", "billed monthly": all meaning the same thing but inconsistent to process downstream.


Broad array item schemas produce noisy results. The AI will include anything that loosely matches.

// Too broad — captures navigation links, footer links, ads
links: {
type: 'array',
items: { type: 'string' }
}
// Tight — captures only what you want
documentation_links: {
type: 'array',
description: 'Links to API documentation pages only — not navigation, marketing, or footer links',
items: {
type: 'object',
properties: {
title: { type: 'string', description: 'Link text' },
url: { type: 'string', description: 'Absolute URL' }
}
}
}

Add context about the page in a top-level description

Section titled “Add context about the page in a top-level description”

You can add a description at the top level of your schema to give the AI page-level context:

json_schema: {
type: 'object',
description: 'Pricing information from a SaaS product pricing page. Focus on subscription plans only — ignore one-time add-ons and enterprise custom pricing blocks.',
properties: {
plans: { /* ... */ }
}
}

This is especially useful when a page has multiple sections and you want to scope extraction to a specific one.


If results are incomplete or inaccurate:

  1. Add more specific descriptions. This is the most common fix.
  2. Upgrade effort to max. Content may not be in the initial HTML.
  3. Simplify the schema. Remove fields you don’t need; fewer fields means less for the AI to get wrong.
  4. Add nocache: true. This confirms the issue isn’t a stale cached result.
  5. Check for JS rendering. Open the URL and disable JavaScript in your browser to see what the extractor sees at min/standard.