Skip to main content

Developer's Guide

How to Extract JSON Schemas

Introduction

Extracting structured data from the web is a common engineering task, but it comes with a significant bottleneck: defining the data's structure. Manually writing and maintaining JSON schemas to match complex web pages is tedious, error-prone, and scales poorly.

The TABS API /v1/extract/json/schema endpoint solves this problem. Instead of requiring you to define a schema, it analyzes a web page and automatically generates a high-quality, standards-compliant JSON Schema for you.

This generated schema can then be fed directly into the /v1/extract/json endpoint, allowing you to create a robust, automated data extraction pipeline in minutes, not days.

Key Capabilities:

  • Automatic schema generation from any public URL
  • AI-powered structure analysis to identify lists, objects, and data types
  • Custom instructions to guide and refine the generated schema
  • Built-in caching for improved performance on repeated requests
  • Standards-compliant JSON Schema output

Prerequisites & Authentication

Before you begin, you'll need a TABS API key. You can get yours by signing up at https://tabstack.ai.

The API uses Bearer token authentication. We strongly recommend storing your key as an environment variable rather than hardcoding it in your application.

First, set the variable in your terminal session.

export TABS_API_KEY="your-api-key-here"
  • export TABS_API_KEY=...: This Bash command sets an environment variable named TABS_API_KEY for your current session. Your application code (e.g., in Python or Node.js) can then access this variable, keeping your secret key out of your source code.

Understanding JSON Schema

While the API generates the schema for you, it's helpful to understand what it's creating.

What is JSON Schema?

JSON Schema is a specification for defining the structure of JSON data. It's a "schema for your schema" that allows you to describe and validate your data. You can specify:

  • Data types (string, number, object, array)
  • Which fields are required
  • Format constraints (e.g., "email", "uri")
  • Nested objects and arrays

Example Schema

Here is a simple JSON Schema that defines an "article" object.

{
"type": "object",
"properties": {
"title": {
"type": "string",
"description": "Article title"
},
"author": {
"type": "string",
"description": "Author name"
},
"published": {
"type": "string",
"description": "Publication date"
}
},
"required": ["title", "author"],
"additionalProperties": false
}

Let's walk through this schema.

  • "type": "object": The root of our data must be a JSON object.
  • "properties": { ... }: This block defines the keys (fields) allowed within the object.
  • "title": { "type": "string", ... }: Defines a field named title. It must be a string and has an optional description.
  • "required": ["title", "author"]: Specifies that the title and author fields must be present for the JSON to be valid. The published field is optional.
  • "additionalProperties": false: Prohibits any fields that are not explicitly defined in the properties block. This helps enforce a strict data structure.
Learn More

For a complete guide, we recommend visiting the official documentation at json-schema.org.

Basic Usage

Let's generate our first schema. The process is simple: send a POST request with the URL you want to analyze.

Endpoint Details

  • URL: https://api.tabstack.ai/v1/extract/json/schema
  • Method: POST
  • Authentication: Bearer <your-api-key> (required)
  • Content-Type: application/json

Minimal Request Example

This example demonstrates the most basic request: generating a schema for the Hacker News homepage.

curl -X POST https://api.tabstack.ai/v1/extract/json/schema \
-H "Authorization: Bearer $TABS_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://news.ycombinator.com"
}'

This command executes curl with the POST HTTP method to the endpoint URL. The -H "Authorization: Bearer $TABS_API_KEY" flag sets the required Authorization header, reading the API key from the $TABS_API_KEY environment variable you set in the prerequisites. The -H "Content-Type: application/json" header tells the server that the data we are sending (-d) is in JSON format. Finally, the -d '{...}' flag provides the data payload (body) of the request, which is a JSON object containing the single required parameter, "url".

How to Run:

  • Ensure you have set the TABS_API_KEY environment variable.

  • Paste this command directly into your terminal and press Enter.

Basic Response Structure

A successful request will return a JSON Schema object. For the Hacker News example, the API will identify the list of stories and generate a schema similar to this:

{
"type": "object",
"properties": {
"stories": {
"type": "array",
"items": {
"type": "object",
"properties": {
"title": {
"type": "string",
"description": "Story title"
},
"points": {
"type": "number",
"description": "Story points/score"
},
"author": {
"type": "string",
"description": "Story author username"
},
"comments": {
"type": "number",
"description": "Number of comments"
}
},
"required": ["title", "points", "author", "comments"],
"additionalProperties": false
}
}
},
"required": ["stories"],
"additionalProperties": false
}

Response Explanation:

The API analyzed the page structure and generated a complete schema. It identified a stories array where each story has four fields with appropriate types. The required array indicates which fields are always present. This schema is now ready to use with /v1/extract/json to extract actual data from similar pages.

Request Parameters

You can control the schema generation with these body parameters:

url (required)

  • Type: string (URI format)
  • Description: The full URL of the web page you want to analyze.
  • Validation:
    • Must be a valid, absolute URL (e.g., https://example.com).
    • Must be publicly accessible.
    • Cannot be localhost or a private/internal IP address.

Example:

{
"url": "https://news.ycombinator.com"
}

instructions (optional)

  • Type: string (max 1000 characters)
  • Description: Plain-text instructions to guide the AI. Use this to focus the schema on specific data, name fields, or clarify ambiguous structures.

Example:

{
"url": "https://news.ycombinator.com",
"instructions": "extract only the top stories, for each story include the title, points, author, and comment count"
}

When to use instructions:

  • To focus on a specific part of a page (e.g., "extract the product details, ignore reviews").
  • To exclude unwanted data (e.g., "do not include advertisements").
  • To provide explicit field names (e.g., "name the list of articles 'posts'").
  • To clarify relationships (e.g., "the author and date are part of each post").

nocache (optional)

  • Type: boolean
  • Default: false
  • Description: Bypasses the API's internal cache and forces a fresh analysis of the URL.

Example:

{
"url": "https://example.com/data",
"nocache": true
}

When to use nocache:

  • When a web page's structure has just changed and you need to regenerate the schema.
  • When you are iterating on new instructions and want to ensure your new prompt is being used for a fresh analysis.
  • When debugging schema generation for dynamic content.

Note: Using nocache: true will result in slower response times, as the content must be fetched and analyzed from scratch.

Working with Responses

Generating a schema is the first step. Here's how to build a complete workflow.

1. Saving and Using Generated Schemas

This is the most common pattern: generate a schema, then immediately use it with the /v1/extract/json endpoint to get the data.

// Requires Node.js
const fs = require('fs').promises;

async function fetchAndUseSchema() {
const apiKey = process.env.TABS_API_KEY;
const targetUrl = 'https://news.ycombinator.com';

// Step 1: Generate the schema
console.log('Generating schema...');
const schemaResponse = await fetch('https://api.tabstack.ai/v1/extract/json/schema', {
method: 'POST',
headers: {
'Authorization': `Bearer ${apiKey}`,
'Content-Type': 'application/json'
},
body: JSON.stringify({
url: targetUrl,
instructions: 'extract the top stories with title, points, and author'
})
});
const schema = await schemaResponse.json();

// Step 2: Save the schema to a file (optional but recommended)
await fs.writeFile('news-schema.json', JSON.stringify(schema, null, 2));
console.log('Schema saved to news-schema.json');

// Step 3: Use the schema to extract data
console.log('Extracting data using the new schema...');
const extractResponse = await fetch('https://api.tabstack.ai/v1/extract/json', {
method: 'POST',
headers: {
'Authorization': `Bearer ${apiKey}`,
'Content-Type': 'application/json'
},
body: JSON.stringify({
url: targetUrl,
json_schema: schema // Pass the generated schema object here
})
});

const data = await extractResponse.json();
console.log('Extracted Data:', JSON.stringify(data, null, 2));
return data;
}

fetchAndUseSchema();

In Step 1, the first fetch call hits the /v1/extract/json/schema endpoint with a URL and instructions, storing the result in the schema variable. Step 2 uses fs.writeFile to save the schema object to a local file news-schema.json, which is great for caching, debugging, or re-using the schema later. In Step 3, a second fetch call is made, this time to the /v1/extract/json endpoint. Step 4 involves passing the same url and, critically, the schema object we just generated in the json_schema field within the body of this second request. Finally, in Step 5, the API uses this schema to extract and return the structured data.

How to Run:

  1. Save this code as runWorkflow.js.
  2. Run it from your terminal: node runWorkflow.js.
  3. Check your directory for news-schema.json and see the "Extracted Data" output in your console.

2. Validating Schema Structure

The API always returns a valid schema, but if you are modifying it or loading it from a file, you may want to validate it. You can use standard JSON Schema validators for this.

// Requires: npm install ajv
const Ajv = require('ajv');
const ajv = new Ajv();

async function validateSchema(schema) {
try {
const validate = ajv.compile(schema);
console.log('Schema is valid JSON Schema');
return true;
} catch (error) {
console.error('Invalid schema:', error.message);
return false;
}
}

// Example usage with a generated schema
// (Assumes you have a function `generateSchema` from a previous example)
/*
const mySchema = await generateSchema('https://example.com/data');
validateSchema(mySchema);
*/

This function uses the ajv library to validate whether a schema is properly structured. The compile() method checks for syntax errors or invalid JSON Schema keywords. Catching compilation errors prevents your app from crashing when working with generated schemas.

How to Run:

  1. Install ajv: npm install ajv
  2. Integrate this validateSchema function into your Node.js application. You can pass any schema object to it (e.g., one loaded from a file or received from the API).

3. Iterating on Schema Generation

You won't always get the perfect schema on the first try. The key to refinement is using instructions and nocache: true.

async function iterateOnSchema() {
const apiKey = process.env.TABS_API_KEY;
const targetUrl = 'https://example.com/products';

// First attempt - generate a basic schema
console.log('Generating basic schema...');
let response = await fetch('https://api.tabstack.ai/v1/extract/json/schema', {
method: 'POST',
headers: { 'Authorization': `Bearer ${apiKey}`, 'Content-Type': 'application/json' },
body: JSON.stringify({ url: targetUrl })
});
let schema1 = await response.json();
console.log('Basic schema:', JSON.stringify(schema1, null, 2));

// Refined attempt - add specific instructions and bypass cache
console.log('Generating refined schema...');
const instructions = 'extract only product name, price, and availability. exclude reviews.';

response = await fetch('https://api.tabstack.ai/v1/extract/json/schema', {
method: 'POST',
headers: { 'Authorization': `Bearer ${apiKey}`, 'Content-Type': 'application/json' },
body: JSON.stringify({
url: targetUrl,
instructions: instructions,
nocache: true // Force fresh analysis
})
});

let schema2 = await response.json();
console.log('Refined schema:', JSON.stringify(schema2, null, 2));
}

iterateOnSchema();

In the first attempt, a basic request is made with only the url. The API returns a general-purpose schema, which might include data you don't want (like reviews). For the refined attempt, a second request is made for the same URL. This time, we provide specific instructions to focus the AI. The nocache: true parameter is critical here—it tells the API to ignore its cached response from the first request and perform a new analysis using our instructions.

How to Run:

  1. Save as iterate.js and run node iterate.js.
  2. Observe the console output to see how the "Basic schema" and "Refined schema" differ.

Error Handling

Building a robust integration requires handling potential errors. The API uses standard HTTP status codes to indicate success or failure.

Common Error Status Codes

Status CodeError MessageDescription
400url is requiredThe request body was missing the url parameter.
400invalid json schema formatAn internal error occurred where the generated schema was invalid.
401Unauthorized - Invalid tokenYour API key is missing, invalid, or expired.
422url is invalidThe provided URL is malformed, private, or could not be processed.
500failed to fetch urlThe target server for the url could not be reached or returned an error.
500web page is too largeThe target page's content exceeds the processing limit.
500failed to generate JSON schemaAn unexpected server error occurred during schema generation.

All error responses return a JSON object with an error key:

{
"error": "failed to generate JSON schema"
}

Error Handling Examples

Here is a more robust function that includes comprehensive error handling.

async function generateSchemaRobust(url, instructions = null) {
const apiKey = process.env.TABS_API_KEY;

if (!apiKey) {
throw new Error('TABS_API_KEY environment variable is not set.');
}

try {
const payload = { url };
if (instructions) {
payload.instructions = instructions;
}

const response = await fetch('https://api.tabstack.ai/v1/extract/json/schema', {
method: 'POST',
headers: {
'Authorization': `Bearer ${apiKey}`,
'Content-Type': 'application/json'
},
body: JSON.stringify(payload)
});

const data = await response.json();

if (!response.ok) {
// Use the error message from the API response
const errorMsg = data.error || `Request failed with status ${response.status}`;

switch (response.status) {
case 400:
throw new Error(`Bad Request: ${errorMsg}`);
case 401:
throw new Error('Authentication Failed. Check your TABS_API_KEY.');
case 422:
throw new Error(`Invalid URL: ${errorMsg}`);
case 500:
throw new Error(`Server Error: ${errorMsg}`);
default:
throw new Error(errorMsg);
}
}

return data; // This is the generated schema

} catch (error) {
// Catches network errors or errors thrown from the block above
console.error('Error generating schema:', error.message);
throw error;
}
}

// --- Example Usage ---
(async () => {
try {
// Test with a valid URL
const schema = await generateSchemaRobust('https://news.ycombinator.com');
console.log('Generated schema:', JSON.stringify(schema, null, 2));

// Test with an invalid URL
// await generateSchemaRobust('not-a-real-url');

} catch (error) {
console.error('--- Operation Failed ---');
// Error is already logged by the function, but we catch it here
// to prevent the script from crashing.
}
})();

A top-level try...catch block catches network errors (like DNS failure) or errors we throw. The line const data = await response.json(); is important because we always parse the JSON, as even error responses contain a JSON body with an error message. The if (!response.ok) statement is the primary check for HTTP errors (4xx, 5xx). Within this check, we use switch (response.status) to examine specific status codes (400, 401, etc.) to provide more specific, helpful error messages to the user. When we encounter an error, throw new Error(...) creates and throws a new Error object, which is then caught by the outer catch block.

How to Run:

  1. Save this as handleErrors.js.
  2. Run node handleErrors.js.
  3. To test the error paths, uncomment the line with 'not-a-real-url' to see the 422 error, or change your API key to see a 401.

Best Practices

Follow these principles to get the most out of the schema endpoint.

1. Start Simple, Then Refine

Always generate a schema with no instructions first. This "base schema" shows you everything the AI can see on the page. From there, you can add instructions to remove fields, rename keys, or focus on specific sections. This iterative process is far more effective than trying to write perfect instructions from scratch.

2. Be Specific in Your Instructions

Vague instructions lead to vague schemas. Be as specific as possible.

  • Vague: "get the articles"
  • Better: "extract articles as an array, each with title, summary, and url"
  • Best: ` Extract blog posts as an array called "posts". Each post must have:
    • title (string, required)
    • author (object, optional)
    • publishedDate (string, optional) Exclude all comments, sidebars, and footer links. `

3. Test Generated Schemas Immediately

The schema you generate is only useful if it correctly extracts the data you want. As shown in the "Saving and Using Generated Schemas" section, you should immediately pipe your new schema into the /v1/extract/json endpoint to verify the results. If the data is missing or malformed, refine your instructions and try again.

4. Use nocache When Iterating

The API's cache is aggressive. If you are refining your instructions to get a better schema, you must include "nocache": true in your request. Otherwise, the API will just return the old, cached schema generated from your previous (or non-existent) instructions.

5. Document and Store Your Schemas

Don't generate a new schema every single time you want to extract data. The best practice is to:

  1. Use the schema endpoint to generate a high-quality schema.
  2. Test and refine it until it's perfect.
  3. Save this final schema as a JSON file (e.g., product-page-schema.json) in your application's codebase or in a database.
  4. In your production code, read the schema from that file and send it to the /v1/extract/json endpoint.

This makes your application faster (1 fewer API call), more stable (not dependent on schema generation at runtime), and allows you to version-control your schemas.