Developer's Guide

How to Extract JSON Schemas

Introduction

Extracting structured data from the web is a common engineering task, but it comes with a significant bottleneck: defining the data's structure. Manually writing and maintaining JSON schemas to match complex web pages is tedious, error-prone, and scales poorly.

The TABS API /v1/extract/json/schema endpoint solves this problem. Instead of requiring you to define a schema, it analyzes a web page and automatically generates a high-quality, standards-compliant JSON Schema for you.

This generated schema can then be fed directly into the /v1/extract/json endpoint, allowing you to create a robust, automated data extraction pipeline in minutes, not days.

Key Capabilities:

Automatic schema generation from any public URL
AI-powered structure analysis to identify lists, objects, and data types
Custom instructions to guide and refine the generated schema
Built-in caching for improved performance on repeated requests
Standards-compliant JSON Schema output

Prerequisites & Authentication

Before you begin, you'll need a TABS API key. You can get yours by signing up at https://tabstack.ai.

The API uses Bearer token authentication. We strongly recommend storing your key as an environment variable rather than hardcoding it in your application.

First, set the variable in your terminal session.

export TABS_API_KEY="your-api-key-here"

export TABS_API_KEY=...: This Bash command sets an environment variable named TABS_API_KEY for your current session. Your application code (e.g., in Python or Node.js) can then access this variable, keeping your secret key out of your source code.

Understanding JSON Schema

While the API generates the schema for you, it's helpful to understand what it's creating.

What is JSON Schema?

JSON Schema is a specification for defining the structure of JSON data. It's a "schema for your schema" that allows you to describe and validate your data. You can specify:

Data types (string, number, object, array)
Which fields are required
Format constraints (e.g., "email", "uri")
Nested objects and arrays

Example Schema

Here is a simple JSON Schema that defines an "article" object.

{
  "type": "object",
  "properties": {
    "title": {
      "type": "string",
      "description": "Article title"
    },
    "author": {
      "type": "string",
      "description": "Author name"
    },
    "published": {
      "type": "string",
      "description": "Publication date"
    }
  },
  "required": ["title", "author"],
  "additionalProperties": false
}

Let's walk through this schema.

"type": "object": The root of our data must be a JSON object.
"properties": { ... }: This block defines the keys (fields) allowed within the object.
"title": { "type": "string", ... }: Defines a field named title. It must be a string and has an optional description.
"required": ["title", "author"]: Specifies that the title and author fields must be present for the JSON to be valid. The published field is optional.
"additionalProperties": false: Prohibits any fields that are not explicitly defined in the properties block. This helps enforce a strict data structure.

Learn More

For a complete guide, we recommend visiting the official documentation at json-schema.org.

Basic Usage

Let's generate our first schema. The process is simple: send a POST request with the URL you want to analyze.

Endpoint Details

URL: https://api.tabstack.ai/v1/extract/json/schema
Method: POST
Authentication: Bearer <your-api-key> (required)
Content-Type: application/json

Minimal Request Example

This example demonstrates the most basic request: generating a schema for the Hacker News homepage.

curl
JavaScript
Python

curl -X POST https://api.tabstack.ai/v1/extract/json/schema \
  -H "Authorization: Bearer $TABS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://news.ycombinator.com"
  }'

This command executes curl with the POST HTTP method to the endpoint URL. The -H "Authorization: Bearer $TABS_API_KEY" flag sets the required Authorization header, reading the API key from the $TABS_API_KEY environment variable you set in the prerequisites. The -H "Content-Type: application/json" header tells the server that the data we are sending (-d) is in JSON format. Finally, the -d '{...}' flag provides the data payload (body) of the request, which is a JSON object containing the single required parameter, "url".

How to Run:

Ensure you have set the TABS_API_KEY environment variable.
Paste this command directly into your terminal and press Enter.

// Requires Node.js
async function fetchJsonSchema() {
  const apiKey = process.env.TABS_API_KEY;
  const targetUrl = 'https://news.ycombinator.com';

  const response = await fetch('https://api.tabstack.ai/v1/extract/json/schema', {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${apiKey}`,
      'Content-Type': 'application/json'
    },
    body: JSON.stringify({
      url: targetUrl
    })
  });

  if (!response.ok) {
    const errorData = await response.json();
    console.error(`Error: ${response.status}`, errorData);
    return;
  }
  
  const schema = await response.json();
  console.log(JSON.stringify(schema, null, 2));
  return schema;
}

fetchJsonSchema();

In Node.js, the line const apiKey = process.env.TABS_API_KEY; reads the API key from the environment variable. We then use the built-in fetch API with await fetch(...) to make an asynchronous network request. The method: 'POST' property sets the HTTP method, while the headers: { ... } object sets the Authorization and Content-Type headers. The body: JSON.stringify({ ... }) converts the JavaScript request object (containing the url) into a JSON string to be sent as the request body. Once the response comes back, const schema = await response.json(); parses the JSON response from the API into a JavaScript object. Finally, console.log(...) pretty-prints the generated schema to the console.

How to Run:

Save this code as getSchema.js.
Ensure you have Node.js (v18+) installed.
Ensure you have set the TABS_API_KEY environment variable.
Run the script from your terminal: node getSchema.js

import requests
import os
import json

def fetch_json_schema():
    api_key = os.environ.get("TABS_API_KEY")
    target_url = 'https://news.ycombinator.com'
    endpoint_url = 'https://api.tabstack.ai/v1/extract/json/schema'

    headers = {
        'Authorization': f'Bearer {api_key}',
        'Content-Type': 'application/json'
    }
    
    payload = {
        'url': target_url
    }

    try:
        response = requests.post(endpoint_url, headers=headers, json=payload)
        response.raise_for_status()  # Raises an HTTPError for bad responses (4xx or 5xx)

        schema = response.json()
        print(json.dumps(schema, indent=2))
        return schema

    except requests.exceptions.HTTPError as http_err:
        print(f"HTTP error occurred: {http_err}")
    except Exception as err:
        print(f"An error occurred: {err}")

if __name__ == "__main__":
    fetch_json_schema()

The script begins by importing the necessary libraries: requests for HTTP calls, os to read environment variables, and json to format the output. Next, api_key = os.environ.get("TABS_API_KEY") reads the API key from the environment. We then create a headers dictionary containing the request headers and a payload dictionary for the request body. The response = requests.post(...) line sends the POST request, where the json=payload argument automatically converts the Python dictionary to a JSON string and sets the Content-Type header. Following this, response.raise_for_status() is a best practice that checks if the request was successful and raises an error if not. The line schema = response.json() parses the JSON response into a Python dictionary, and finally print(json.dumps(schema, indent=2)) pretty-prints the schema to the console.

How to Run:

You must have the requests library installed: pip install requests
Save this code as get_schema.py.
Ensure you have set the TABS_API_KEY environment variable.
Run the script from your terminal: python get_schema.py

Basic Response Structure

A successful request will return a JSON Schema object. For the Hacker News example, the API will identify the list of stories and generate a schema similar to this:

{
  "type": "object",
  "properties": {
    "stories": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "title": {
            "type": "string",
            "description": "Story title"
          },
          "points": {
            "type": "number",
            "description": "Story points/score"
          },
          "author": {
            "type": "string",
            "description": "Story author username"
          },
          "comments": {
            "type": "number",
            "description": "Number of comments"
          }
        },
        "required": ["title", "points", "author", "comments"],
        "additionalProperties": false
      }
    }
  },
  "required": ["stories"],
  "additionalProperties": false
}

Response Explanation:

The API analyzed the page structure and generated a complete schema. It identified a stories array where each story has four fields with appropriate types. The required array indicates which fields are always present. This schema is now ready to use with /v1/extract/json to extract actual data from similar pages.

Request Parameters

You can control the schema generation with these body parameters:

`url` (required)

Type: string (URI format)
Description: The full URL of the web page you want to analyze.
Validation:
- Must be a valid, absolute URL (e.g., https://example.com).
- Must be publicly accessible.
- Cannot be localhost or a private/internal IP address.

Example:

{
  "url": "https://news.ycombinator.com"
}

`instructions` (optional)

Type: string (max 1000 characters)
Description: Plain-text instructions to guide the AI. Use this to focus the schema on specific data, name fields, or clarify ambiguous structures.

Example:

{
  "url": "https://news.ycombinator.com",
  "instructions": "extract only the top stories, for each story include the title, points, author, and comment count"
}

When to use instructions:

To focus on a specific part of a page (e.g., "extract the product details, ignore reviews").
To exclude unwanted data (e.g., "do not include advertisements").
To provide explicit field names (e.g., "name the list of articles 'posts'").
To clarify relationships (e.g., "the author and date are part of each post").

`nocache` (optional)

Type: boolean
Default: false
Description: Bypasses the API's internal cache and forces a fresh analysis of the URL.

Example:

{
  "url": "https://example.com/data",
  "nocache": true
}

When to use nocache:

When a web page's structure has just changed and you need to regenerate the schema.
When you are iterating on new instructions and want to ensure your new prompt is being used for a fresh analysis.
When debugging schema generation for dynamic content.

Note: Using nocache: true will result in slower response times, as the content must be fetched and analyzed from scratch.

Working with Responses

Generating a schema is the first step. Here's how to build a complete workflow.

1. Saving and Using Generated Schemas

This is the most common pattern: generate a schema, then immediately use it with the /v1/extract/json endpoint to get the data.

JavaScript
Python

// Requires Node.js
const fs = require('fs').promises;

async function fetchAndUseSchema() {
  const apiKey = process.env.TABS_API_KEY;
  const targetUrl = 'https://news.ycombinator.com';

  // Step 1: Generate the schema
  console.log('Generating schema...');
  const schemaResponse = await fetch('https://api.tabstack.ai/v1/extract/json/schema', {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${apiKey}`,
      'Content-Type': 'application/json'
    },
    body: JSON.stringify({
      url: targetUrl,
      instructions: 'extract the top stories with title, points, and author'
    })
  });
  const schema = await schemaResponse.json();

  // Step 2: Save the schema to a file (optional but recommended)
  await fs.writeFile('news-schema.json', JSON.stringify(schema, null, 2));
  console.log('Schema saved to news-schema.json');

  // Step 3: Use the schema to extract data
  console.log('Extracting data using the new schema...');
  const extractResponse = await fetch('https://api.tabstack.ai/v1/extract/json', {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${apiKey}`,
      'Content-Type': 'application/json'
    },
    body: JSON.stringify({
      url: targetUrl,
      json_schema: schema // Pass the generated schema object here
    })
  });

  const data = await extractResponse.json();
  console.log('Extracted Data:', JSON.stringify(data, null, 2));
  return data;
}

fetchAndUseSchema();

In Step 1, the first fetch call hits the /v1/extract/json/schema endpoint with a URL and instructions, storing the result in the schema variable. Step 2 uses fs.writeFile to save the schema object to a local file news-schema.json, which is great for caching, debugging, or re-using the schema later. In Step 3, a second fetch call is made, this time to the /v1/extract/json endpoint. Step 4 involves passing the same url and, critically, the schema object we just generated in the json_schema field within the body of this second request. Finally, in Step 5, the API uses this schema to extract and return the structured data.

How to Run:

Save this code as runWorkflow.js.
Run it from your terminal: node runWorkflow.js.
Check your directory for news-schema.json and see the "Extracted Data" output in your console.

import requests
import os
import json

def fetch_and_use_schema():
    api_key = os.environ.get("TABS_API_KEY")
    target_url = 'https://news.ycombinator.com'
    schema_endpoint = 'https://api.tabstack.ai/v1/extract/json/schema'
    extract_endpoint = 'https://api.tabstack.ai/v1/extract/json'

    headers = {
        'Authorization': f'Bearer {api_key}',
        'Content-Type': 'application/json'
    }

    # Step 1: Generate the schema
    print('Generating schema...')
    schema_payload = {
        'url': target_url,
        'instructions': 'extract the top stories with title, points, and author'
    }
    schema_response = requests.post(schema_endpoint, headers=headers, json=schema_payload)
    schema_response.raise_for_status()
    schema = schema_response.json()

    # Step 2: Save the schema to a file (optional but recommended)
    with open('news-schema.json', 'w') as f:
        json.dump(schema, f, indent=2)
    print('Schema saved to news-schema.json')

    # Step 3: Use the schema to extract data
    print('Extracting data using the new schema...')
    extract_payload = {
        'url': targetUrl,
        'json_schema': schema  # Pass the generated schema dictionary here
    }
    extract_response = requests.post(extract_endpoint, headers=headers, json=extract_payload)
    extract_response.raise_for_status()
    
    data = extract_response.json()
    print(f"Extracted Data: {json.dumps(data, indent=2)}")
    return data

if __name__ == "__main__":
    fetch_and_use_schema()

In Step 1, the first requests.post call hits the /v1/extract/json/schema endpoint with a URL and instructions, storing the result in the schema variable. Step 2 uses with open(...) to save the schema dictionary to a local file news-schema.json. In Step 3, a second requests.post call is made to the /v1/extract/json endpoint. Step 4 involves the extract_payload, where we pass the same url and the schema dictionary we just generated. Finally, in Step 5, the API uses this schema to extract and return the structured data.

How to Run:

Install requests: pip install requests
Save this code as run_workflow.py.
Run it from your terminal: python run_workflow.py.
Check your directory for news-schema.json and see the "Extracted Data" output in your console.

2. Validating Schema Structure

The API always returns a valid schema, but if you are modifying it or loading it from a file, you may want to validate it. You can use standard JSON Schema validators for this.

JavaScript
Python

// Requires: npm install ajv
const Ajv = require('ajv');
const ajv = new Ajv();

async function validateSchema(schema) {
  try {
    const validate = ajv.compile(schema);
    console.log('Schema is valid JSON Schema');
    return true;
  } catch (error) {
    console.error('Invalid schema:', error.message);
    return false;
  }
}

// Example usage with a generated schema
// (Assumes you have a function `generateSchema` from a previous example)
/*
const mySchema = await generateSchema('https://example.com/data');
validateSchema(mySchema);
*/

This function uses the ajv library to validate whether a schema is properly structured. The compile() method checks for syntax errors or invalid JSON Schema keywords. Catching compilation errors prevents your app from crashing when working with generated schemas.

How to Run:

Install ajv: npm install ajv
Integrate this validateSchema function into your Node.js application. You can pass any schema object to it (e.g., one loaded from a file or received from the API).

# Requires: pip install jsonschema
import jsonschema

def validate_schema(schema):
    try:
        # Checks if the schema itself is a valid JSON Schema
        jsonschema.Draft7Validator.check_schema(schema)
        print('Schema is valid JSON Schema')
        return True
    except jsonschema.SchemaError as e:
        print(f'Invalid schema: {e.message}')
        return False

# Example usage with a generated schema
# (Assumes you have a function `generate_schema` from a previous example)
"""
my_schema = generate_schema('https://example.com/data')
if my_schema:
    validate_schema(my_schema)
"""

The Python version uses the jsonschema library's check_schema() method to validate the schema structure. It raises a SchemaError if the schema doesn't conform to the JSON Schema specification. This validation step helps catch issues before you try to use the schema for extraction.

How to Run:

Install jsonschema: pip install jsonschema
Integrate this validate_schema function into your Python application.

3. Iterating on Schema Generation

You won't always get the perfect schema on the first try. The key to refinement is using instructions and nocache: true.

JavaScript
Python

async function iterateOnSchema() {
  const apiKey = process.env.TABS_API_KEY;
  const targetUrl = 'https://example.com/products';

  // First attempt - generate a basic schema
  console.log('Generating basic schema...');
  let response = await fetch('https://api.tabstack.ai/v1/extract/json/schema', {
    method: 'POST',
    headers: { 'Authorization': `Bearer ${apiKey}`, 'Content-Type': 'application/json' },
    body: JSON.stringify({ url: targetUrl })
  });
  let schema1 = await response.json();
  console.log('Basic schema:', JSON.stringify(schema1, null, 2));

  // Refined attempt - add specific instructions and bypass cache
  console.log('Generating refined schema...');
  const instructions = 'extract only product name, price, and availability. exclude reviews.';
  
  response = await fetch('https://api.tabstack.ai/v1/extract/json/schema', {
    method: 'POST',
    headers: { 'Authorization': `Bearer ${apiKey}`, 'Content-Type': 'application/json' },
    body: JSON.stringify({
      url: targetUrl,
      instructions: instructions,
      nocache: true  // Force fresh analysis
    })
  });

  let schema2 = await response.json();
  console.log('Refined schema:', JSON.stringify(schema2, null, 2));
}

iterateOnSchema();

In the first attempt, a basic request is made with only the url. The API returns a general-purpose schema, which might include data you don't want (like reviews). For the refined attempt, a second request is made for the same URL. This time, we provide specific instructions to focus the AI. The nocache: true parameter is critical here—it tells the API to ignore its cached response from the first request and perform a new analysis using our instructions.

How to Run:

Save as iterate.js and run node iterate.js.
Observe the console output to see how the "Basic schema" and "Refined schema" differ.

import requests
import os
import json

def iterate_on_schema():
    api_key = os.environ.get("TABS_API_KEY")
    target_url = 'https://example.com/products'
    endpoint_url = 'https://api.tabstack.ai/v1/extract/json/schema'
    headers = {'Authorization': f'Bearer {api_key}', 'Content-Type': 'application/json'}

    # First attempt - generate a basic schema
    print('Generating basic schema...')
    response1 = requests.post(endpoint_url, headers=headers, json={'url': target_url})
    schema1 = response1.json()
    print(f'Basic schema: {json.dumps(schema1, indent=2)}')

    # Refined attempt - add specific instructions and bypass cache
    print('Generating refined schema...')
    instructions = 'extract only product name, price, and availability. exclude reviews.'
    
    payload = {
        'url': target_url,
        'instructions': instructions,
        'nocache': True  # Force fresh analysis
    }
    response2 = requests.post(endpoint_url, headers=headers, json=payload)
    schema2 = response2.json()
    print(f'Refined schema: {json.dumps(schema2, indent=2)}')

if __name__ == "__main__":
    iterate_on_schema()

In the first attempt, a basic requests.post is made with only the url. For the refined attempt, a second requests.post is made for the same URL. The payload = { ... } now includes instructions to guide the AI. The 'nocache': True parameter is critical—it ensures the API does not return the cached result from the first request and instead performs a fresh analysis.

How to Run:

Install requests: pip install requests
Save as iterate.py and run python iterate.py.
Compare the two schema outputs in your console.

Error Handling

Building a robust integration requires handling potential errors. The API uses standard HTTP status codes to indicate success or failure.

Common Error Status Codes

Status Code	Error Message	Description
400	`url is required`	The request body was missing the `url` parameter.
400	`invalid json schema format`	An internal error occurred where the generated schema was invalid.
401	`Unauthorized - Invalid token`	Your API key is missing, invalid, or expired.
422	`url is invalid`	The provided URL is malformed, private, or could not be processed.
500	`failed to fetch url`	The target server for the `url` could not be reached or returned an error.
500	`web page is too large`	The target page's content exceeds the processing limit.
500	`failed to generate JSON schema`	An unexpected server error occurred during schema generation.

All error responses return a JSON object with an error key:

{
  "error": "failed to generate JSON schema"
}

Error Handling Examples

Here is a more robust function that includes comprehensive error handling.

JavaScript
Python

async function generateSchemaRobust(url, instructions = null) {
  const apiKey = process.env.TABS_API_KEY;
  
  if (!apiKey) {
    throw new Error('TABS_API_KEY environment variable is not set.');
  }

  try {
    const payload = { url };
    if (instructions) {
      payload.instructions = instructions;
    }

    const response = await fetch('https://api.tabstack.ai/v1/extract/json/schema', {
      method: 'POST',
      headers: {
        'Authorization': `Bearer ${apiKey}`,
        'Content-Type': 'application/json'
      },
      body: JSON.stringify(payload)
    });

    const data = await response.json();

    if (!response.ok) {
      // Use the error message from the API response
      const errorMsg = data.error || `Request failed with status ${response.status}`;
      
      switch (response.status) {
        case 400:
          throw new Error(`Bad Request: ${errorMsg}`);
        case 401:
          throw new Error('Authentication Failed. Check your TABS_API_KEY.');
        case 422:
          throw new Error(`Invalid URL: ${errorMsg}`);
        case 500:
          throw new Error(`Server Error: ${errorMsg}`);
        default:
          throw new Error(errorMsg);
      }
    }

    return data; // This is the generated schema

  } catch (error) {
    // Catches network errors or errors thrown from the block above
    console.error('Error generating schema:', error.message);
    throw error;
  }
}

// --- Example Usage ---
(async () => {
  try {
    // Test with a valid URL
    const schema = await generateSchemaRobust('https://news.ycombinator.com');
    console.log('Generated schema:', JSON.stringify(schema, null, 2));

    // Test with an invalid URL
    // await generateSchemaRobust('not-a-real-url');

  } catch (error) {
    console.error('--- Operation Failed ---');
    // Error is already logged by the function, but we catch it here
    // to prevent the script from crashing.
  }
})();

A top-level try...catch block catches network errors (like DNS failure) or errors we throw. The line const data = await response.json(); is important because we always parse the JSON, as even error responses contain a JSON body with an error message. The if (!response.ok) statement is the primary check for HTTP errors (4xx, 5xx). Within this check, we use switch (response.status) to examine specific status codes (400, 401, etc.) to provide more specific, helpful error messages to the user. When we encounter an error, throw new Error(...) creates and throws a new Error object, which is then caught by the outer catch block.

How to Run:

Save this as handleErrors.js.
Run node handleErrors.js.
To test the error paths, uncomment the line with 'not-a-real-url' to see the 422 error, or change your API key to see a 401.

import requests
import os
import json

def generate_schema_robust(url, instructions=None):
    api_key = os.environ.get("TABS_API_KEY")
    endpoint_url = 'https://api.tabstack.ai/v1/extract/json/schema'

    if not api_key:
        raise ValueError("TABS_API_KEY environment variable is not set.")

    headers = {
        'Authorization': f'Bearer {api_key}',
        'Content-Type': 'application/json'
    }
    
    payload = {'url': url}
    if instructions:
        payload['instructions'] = instructions

    try:
        response = requests.post(endpoint_url, headers=headers, json=payload, timeout=30)
        
        # This will raise an HTTPError if the response was an error
        response.raise_for_status() 

        return response.json() # This is the generated schema

    except requests.exceptions.HTTPError as http_err:
        # Handle specific HTTP errors
        status_code = http_err.response.status_code
        error_msg = http_err.response.json().get('error', 'Unknown error')
        
        if status_code == 400:
            print(f"Bad Request: {error_msg}")
        elif status_code == 401:
            print("Authentication Failed. Check your TABS_API_KEY.")
        elif status_code == 422:
            print(f"Invalid URL: {error_msg}")
        elif status_code == 500:
            print(f"Server Error: {error_msg}")
        else:
            print(f"HTTP error: {error_msg}")
        
    except requests.exceptions.Timeout:
        print("Request timed out.")
    except requests.exceptions.RequestException as e:
        # Catches network-level errors
        print(f"An error occurred: {e}")
        
    return None # Return None on failure

# --- Example Usage ---
if __name__ == "__main__":
    # Test with a valid URL
    schema = generate_schema_robust('https://news.ycombinator.com')
    if schema:
        print(f"Generated schema: {json.dumps(schema, indent=2)}")

    print("\n--- Testing invalid URL ---")
    # Test with an invalid URL
    generate_schema_robust('not-a-real-url')

We use a try block to wrap the request and multiple except blocks to catch different error types. The key line is response.raise_for_status(), which causes requests to automatically throw an HTTPError if the status code is 4xx or 5xx. The except requests.exceptions.HTTPError block catches the HTTPError, allowing us to inspect http_err.response.status_code to provide specific feedback. The except requests.exceptions.Timeout block catches timeouts, while except requests.exceptions.RequestException serves as a catch-all for other requests-related errors (like DNS failure).

How to Run:

Install requests: pip install requests
Save this as handle_errors.py.
Run python handle_errors.py.
The script will first succeed, then run the second test which will fail and print a "Invalid URL" error.

Best Practices

Follow these principles to get the most out of the schema endpoint.

1. Start Simple, Then Refine

Always generate a schema with no instructions first. This "base schema" shows you everything the AI can see on the page. From there, you can add instructions to remove fields, rename keys, or focus on specific sections. This iterative process is far more effective than trying to write perfect instructions from scratch.

2. Be Specific in Your Instructions

Vague instructions lead to vague schemas. Be as specific as possible.

Vague: "get the articles"
Better: "extract articles as an array, each with title, summary, and url"
Best: ` Extract blog posts as an array called "posts". Each post must have:
- title (string, required)
- author (object, optional)
- publishedDate (string, optional) Exclude all comments, sidebars, and footer links. `

3. Test Generated Schemas Immediately

The schema you generate is only useful if it correctly extracts the data you want. As shown in the "Saving and Using Generated Schemas" section, you should immediately pipe your new schema into the /v1/extract/json endpoint to verify the results. If the data is missing or malformed, refine your instructions and try again.

4. Use `nocache` When Iterating

The API's cache is aggressive. If you are refining your instructions to get a better schema, you must include "nocache": true in your request. Otherwise, the API will just return the old, cached schema generated from your previous (or non-existent) instructions.

5. Document and Store Your Schemas

Don't generate a new schema every single time you want to extract data. The best practice is to:

Use the schema endpoint to generate a high-quality schema.
Test and refine it until it's perfect.
Save this final schema as a JSON file (e.g., product-page-schema.json) in your application's codebase or in a database.
In your production code, read the schema from that file and send it to the /v1/extract/json endpoint.

This makes your application faster (1 fewer API call), more stable (not dependent on schema generation at runtime), and allows you to version-control your schemas.

Introduction​

Prerequisites & Authentication​

Understanding JSON Schema​

What is JSON Schema?​

Example Schema​

Basic Usage​

Endpoint Details​

Minimal Request Example​

Basic Response Structure​

Request Parameters​

url (required)​

instructions (optional)​

nocache (optional)​

Working with Responses​

1. Saving and Using Generated Schemas​

2. Validating Schema Structure​

3. Iterating on Schema Generation​

Error Handling​

Common Error Status Codes​

Error Handling Examples​

Best Practices​

1. Start Simple, Then Refine​

2. Be Specific in Your Instructions​

3. Test Generated Schemas Immediately​

4. Use nocache When Iterating​

5. Document and Store Your Schemas​

Related Resources​

Introduction

Prerequisites & Authentication

Understanding JSON Schema

What is JSON Schema?

Example Schema

Basic Usage

Endpoint Details

Minimal Request Example

Basic Response Structure

Request Parameters

`url` (required)

`instructions` (optional)

`nocache` (optional)

Working with Responses

1. Saving and Using Generated Schemas

2. Validating Schema Structure

3. Iterating on Schema Generation

Error Handling

Common Error Status Codes

Error Handling Examples

Best Practices

1. Start Simple, Then Refine

2. Be Specific in Your Instructions

3. Test Generated Schemas Immediately

4. Use `nocache` When Iterating

5. Document and Store Your Schemas

Related Resources