Developer's Guide

How to Extract JSON Data

The Challenge: Unstructured Data

Extracting structured data from the web is often a brittle, time-consuming process. You build a custom scraper, only for it to break the moment a website changes its layout. You're left maintaining complex CSS selectors and parsing messy HTML, all for what you really need: clean, predictable JSON.

The TABS API JSON extraction endpoint solves this. It's an intelligent service that accepts a URL and a JSON schema you provide. It then fetches the page, analyzes its content, and returns clean, structured JSON that perfectly matches your schema.

This endpoint turns any web page into a structured API, making it ideal for:

Web scraping with a consistent, reliable data structure
Extracting product information from e-commerce sites
Gathering news articles and blog posts
Monitoring competitor pricing and product changes
Building data aggregation pipelines
Collecting structured data for analysis or AI model training

Core Features:

Schema-Based Extraction: You define the "what," and the API handles the "how."
Consistent Output: The returned data is validated against your schema.
Intelligent Parsing: Works even with complex, dynamic, or JavaScript-heavy pages.
Built-in Caching: Improves performance for frequently accessed pages (and can be bypassed when needed).

Prerequisites

Before you can start, you'll need a few things:

A valid TABS API key: Sign up at https://tabstack.ai to get your free key.
Authentication: The endpoint uses Bearer token authentication.
A JSON schema: This defines the data structure you want to extract. You can write one by hand or generate one automatically using our JSON Schema endpoint.

We strongly recommend storing your API key as an environment variable to avoid hard-coding it in your scripts.

First, set your API key as an environment variable in your terminal.

export TABS_API_KEY="your-api-key-here"

Explanation:
- export TABS_API_KEY=...: This command creates an environment variable named TABS_API_KEY and assigns your key to it. Our code examples (in Python, JavaScript, and curl) are configured to read this variable.
How to Run:
- Copy this command, replace "your-api-key-here" with your actual API key, and run it in the terminal session where you'll be executing your scripts.

Your First Extraction: A Step-by-Step Guide

Let's walk through a practical example: extracting the top stories from Hacker News.

Our goal is to get a list of stories, and for each story, we want its title and points.

To do this, we will send a POST request to the https://api.tabstack.ai/v1/extract/json endpoint. The body of our request will be a JSON object containing two required properties:

"url": The page we want to scrape (https://news.ycombinator.com).
"json_schema": The data structure we want back.

Here is the complete request using curl, JavaScript, and Python.

curl
JavaScript
Python

This example uses curl to make a direct request from your terminal.

curl -X POST https://api.tabstack.ai/v1/extract/json \
  -H "Authorization: Bearer $TABS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://news.ycombinator.com",
    "json_schema": {
      "type": "object",
      "properties": {
        "stories": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "title": {"type": "string"},
              "points": {"type": "number"}
            }
          }
        }
      }
    }
  }'

This sends a POST request to the extraction endpoint with authentication and a JSON payload. The key part is the json_schema — it defines exactly what data structure you want back. Here, we're asking for an object with a stories array, where each story has a title (string) and points (number). The API will find matching data on the page and return it in this format.

How to Run:
- Make sure you have set your TABS_API_KEY environment variable in the same terminal session.
- Copy and paste the entire command into your terminal and press Enter.

This example uses Node.js and the native fetch API.

async function extractJson() {
  const apiKey = process.env.TABS_API_KEY;
  const targetUrl = 'https://news.ycombinator.com';

  const schema = {
    type: 'object',
    properties: {
      stories: {
        type: 'array',
        items: {
          type: 'object',
          properties: {
            title: { type: 'string' },
            points: { type: 'number' }
          }
        }
      }
    }
  };

  const response = await fetch('https://api.tabstack.ai/v1/extract/json', {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${apiKey}`,
      'Content-Type': 'application/json'
    },
    body: JSON.stringify({
      url: targetUrl,
      json_schema: schema
    })
  });

  const data = await response.json();
  console.log(JSON.stringify(data, null, 2));
  return data;
}

// Call the function
extractJson();

The code makes an authenticated POST request with your schema and target URL. The important part is the schema definition—it's the same structure we used in the curl example, just as a JavaScript object. The fetch call handles authentication with your API key and serializes everything to JSON. When the response comes back, you get clean, structured data matching your schema.

How to Run:
- Save this code as extract.js.
- Make sure you have set your TABS_API_KEY environment variable.
- Run the script from your terminal: node extract.js.

This example uses the popular requests library in Python.

import requests
import os
import json

def extract_json():
    api_key = os.environ.get("TABS_API_KEY")
    api_url = 'https://api.tabstack.ai/v1/extract/json'
    target_url = 'https://news.ycombinator.com'
    
    schema = {
        'type': 'object',
        'properties': {
            'stories': {
                'type': 'array',
                'items': {
                    'type': 'object',
                    'properties': {
                        'title': {'type': 'string'},
                        'points': {'type': 'number'}
                    }
                }
            }
        }
    }

    headers = {
        'Authorization': f'Bearer {api_key}',
        'Content-Type': 'application/json'
    }
    
    payload = {
        'url': target_url,
        'json_schema': schema
    }

    response = requests.post(api_url, headers=headers, json=payload)

    data = response.json()
    print(json.dumps(data, indent=2))
    return data

# Call the function
if __name__ == "__main__":
    extract_json()

The Python version follows the same pattern: define your schema, make an authenticated POST request, and parse the response. The requests library makes this straightforward—pass your schema dictionary to the json parameter and it handles serialization automatically. The response comes back as clean, structured data matching your schema.

How to Run:
- If you don't have the requests library, install it: pip install requests.
- Save this code as extract.py.
- Make sure you have set your TABS_API_KEY environment variable.
- Run the script from your terminal: python extract.py.

Understanding the Response

A successful request (after a few moments of processing) will return a 200 OK status. The response body will contain the clean, structured data you asked for.

Here’s the high-level purpose of this code block: It shows a sample successful response from the API, based on the schema we provided in the request examples.

{
  "stories": [
    {
      "title": "New AI Model Released",
      "points": 342
    },
    {
      "title": "Database Performance Tips",
      "points": 156
    },
    {
      "title": "Understanding Distributed Systems",
      "points": 89
    }
  ]
}

The response structure exactly matches your schema—that's the power of this endpoint. Instead of parsing HTML yourself, you get clean JSON with proper types (strings and numbers, not everything as text). This data is ready to use immediately in your application.

API Parameters Reference

The request body is a JSON object with the following properties:

`url` (required)

Type: string (URI format)
Description: The fully qualified, publicly accessible URL of the web page you want to fetch and extract data from.
Validation:
- Must be a valid URL format (e.numeric_leading_zeros_tp, https://example.com).
- Cannot access internal/private resources (e.g., localhost, 127.0.0.1, or private IPs).

`json_schema` (required)

Type: object
Description: A JSON schema definition that describes the structure of the data you want to extract. The API will use this schema to guide its extraction and parsing.
Tips for creating schemas:
- Best Practice: Use the JSON Schema endpoint to generate a schema automatically. It's much faster and more accurate than writing one manually.
- Be specific about required vs. optional fields using the required keyword.
- Include description fields in your properties to give the AI extractor hints about what data to find.

Here is a more complex schema example for a blog post:

{
  "json_schema": {
    "type": "object",
    "properties": {
      "title": {
        "type": "string",
        "description": "The main title of the blog post"
      },
      "articles": {
        "type": "array",
        "items": {
          "type": "object",
          "properties": {
            "headline": {"type": "string"},
            "author": {"type": "string"},
            "date": {"type": "string"}
          },
          "required": ["headline"]
        }
      }
    },
    "required": ["title", "articles"]
  }
}

This schema demonstrates two useful techniques. First, the description field helps the API distinguish between similar elements—here, finding the main title versus other headings. Second, the required array at different levels controls data quality: setting required: ["headline"] in items filters out incomplete entries, while the root-level required ensures critical fields are present or the extraction fails entirely.

`nocache` (optional)

Type: boolean
Default: false
Description: Bypasses the cache and forces a fresh fetch and extraction of the URL.

By default, the API caches responses for a short period to improve performance and reduce redundant fetches. Setting nocache to true is useful for:

Getting real-time data from frequently updated pages.
Debugging extraction issues.
Forcing a re-scrape after a page's structure has changed.

This payload demonstrates how to use the nocache parameter.

{
  "url": "https://example.com/products",
  "json_schema": { ... },
  "nocache": true
}

Setting nocache: true forces a fresh extraction, bypassing the cache. This is useful for real-time data but will be slower since nothing can be reused from previous requests.

Real-World Examples

Example 1: E-commerce Product Extraction

Here, we'll extract product data from a hypothetical e-commerce category page.

This is the request payload. We are defining a schema to capture a list of products with their name, price, stock status, and rating.

{
  "url": "https://shop.example.com/category/laptops",
  "json_schema": {
    "type": "object",
    "properties": {
      "products": {
        "type": "array",
        "items": {
          "type": "object",
          "properties": {
            "name": {"type": "string"},
            "price": {"type": "number"},
            "currency": {"type": "string"},
            "inStock": {"type": "boolean"},
            "rating": {"type": "number"}
          },
          "required": ["name", "price"]
        }
      }
    }
  }
}

The schema defines a products array with properties of different types—strings, numbers, and booleans. Making name and price required prevents partial entries; products without these critical fields won't be included in the response.

This is a potential response from the API.

{
  "products": [
    {
      "name": "Pro Laptop 15\"",
      "price": 1299.99,
      "currency": "USD",
      "inStock": true,
      "rating": 4.5
    },
    {
      "name": "Business Ultrabook",
      "price": 899.99,
      "currency": "USD",
      "inStock": false,
      "rating": 4.2
    }
  ]
}

Notice the API handles type conversion automatically—prices become numbers (not strings like "$899.99"), and stock status becomes a proper boolean. Optional fields like rating are included when found but omitted when missing, keeping your data clean.

Example 2: News Article Extraction

This example shows how to gather a list of articles from a news homepage.

This is the request payload. We want a list of articles, each with a title, summary, URL, and publication date.

{
  "url": "https://news.example.com",
  "json_schema": {
    "type": "object",
    "properties": {
      "articles": {
        "type": "array",
        "items": {
          "type": "object",
          "properties": {
            "title": {"type": "string"},
            "summary": {"type": "string"},
            "url": {"type": "string"},
            "publishedAt": {"type": "string"},
            "category": {"type": "string"}
          },
          "required": ["title", "url"]
        }
      }
    }
  }
}

This schema extracts article metadata including URLs. The API intelligently identifies article links on the page. Making title and url required ensures you only get complete article data.

This is a potential response from the API.

{
  "articles": [
    {
      "title": "Global Climate Summit Reaches Agreement",
      "summary": "World leaders agree on new emissions targets",
      "url": "https://news.example.com/climate-summit",
      "publishedAt": "2024-01-15T10:30:00Z",
      "category": "Environment"
    }
  ]
}

The API extracted the complete article data, including properly formatted dates in ISO 8601 format when available on the source page.

Putting It to Work: Processing and Saving Data

Getting the JSON is just the first step. Here’s how you can immediately process or save that data.

Processing Extracted Data

Once you have the JSON response, you can use standard programming-language features to filter, sort, and analyze it. This example takes the product data from our e-commerce example and finds only the in-stock products, sorted by price.

JavaScript
Python

This code shows how to filter and sort the extracted data in JavaScript.

async function processExtractedData() {
  // ... [API fetch logic from "Your First Extraction" example] ...
  // Assume 'data' is the JSON response:
  // const data = await response.json();
  
  const data = {
    "products": [
      {"name": "Pro Laptop 15\"", "price": 1299.99, "inStock": true},
      {"name": "Business Ultrabook", "price": 899.99, "inStock": false},
      {"name": "Gamer Rig", "price": 1799.99, "inStock": true}
    ]
  };

  // Filter and process the data
  const availableProducts = data.products
    .filter(p => p.inStock)
    .sort((a, b) => a.price - b.price);

  console.log('Available products (lowest price first):');
  availableProducts.forEach(product => {
    console.log(`${product.name}: $${product.price}`);
  });

  return availableProducts;
}

// Call the function
processExtractedData();

Once you have the extracted data (here we're using mock data for demonstration), processing it is straightforward. The example chains .filter() to get only in-stock products, then .sort() to order by price. This is standard JavaScript array manipulation—the API gives you clean data, you process it however you need.

How to Run:
- You can add this logic to the extract.js script you created earlier, right after const data = await response.json().

This code shows how to filter and sort the extracted data in Python.

import json

def process_extracted_data():
    # ... [API requests logic from "Your First Extraction" example] ...
    # Assume 'data' is the JSON response:
    # data = response.json()
    
    data = {
        "products": [
            {"name": "Pro Laptop 15\"", "price": 1299.99, "inStock": True},
            {"name": "Business Ultrabook", "price": 899.99, "inStock": False},
            {"name": "Gamer Rig", "price": 1799.99, "inStock": True}
        ]
    }

    # Filter and process the data
    # Use .get('inStock') to avoid errors if the key is missing
    available_products = [
        p for p in data['products'] if p.get('inStock')
    ]
    
    # Sort the filtered list
    available_products.sort(key=lambda x: x['price'])

    print('Available products (lowest price first):')
    for product in available_products:
        print(f"{product['name']}: ${product['price']}")

    return available_products

# Call the function
if __name__ == "__main__":
    process_extracted_data()

The Python version uses list comprehension to filter and sort() with a key function to order by price. Note the use of .get('inStock') instead of direct key access—this prevents errors if a product is missing that field. Once filtered and sorted, the data is ready to use.

How to Run:
- You can add this logic to the extract.py script you created earlier, right after data = response.json().

Saving Data to Files

You can also easily save your extracted data to a file, like a JSON file for later use or a CSV file for analysis in a spreadsheet.

JavaScript
Python

This script fetches data and saves it to both extracted-data.json and extracted-data.csv.

const fs = require('fs').promises; // Using the promises API of the 'fs' module

async function saveExtractedData() {
  // ... [API fetch logic] ...
  // const data = await response.json();
  
  // Using Hacker News example data
  const data = {
    "stories": [
      {"title": "New AI Model Released", "points": 342},
      {"title": "Database Performance Tips", "points": 156},
      {"title": "Understanding Distributed Systems", "points": 89}
    ]
  };

  try {
    // Save as JSON
    await fs.writeFile('extracted-data.json', JSON.stringify(data, null, 2));
    console.log('Data saved to extracted-data.json');

    // Save as CSV
    const csvHeader = 'Title,Points\n';
    const csvRows = data.stories.map(s => {
      // Ensure title is CSV-safe (e.g., contains a comma)
      const title = `"${s.title.replace(/"/g, '""')}"`;
      return `${title},${s.points}`;
    });
    
    const csv = [csvHeader, ...csvRows].join('\n');
    await fs.writeFile('extracted-data.csv', csv);
    console.log('Data saved to extracted-data.csv');

  } catch (error) {
    console.error('Error saving data:', error);
  }
}

// Call the function
saveExtractedData();

This example shows saving data in two formats. For JSON, JSON.stringify() with formatting creates a readable file. For CSV, the code maps each story to a CSV row, properly escaping quotes with "" to handle titles containing commas or quotes. The key technique here is .map() to transform objects into CSV rows, then .join('\n') to create the final file content.

How to Run:
- Save this as save.js.
- Run from your terminal: node save.js.
- Check your directory for two new files: extracted-data.json and extracted-data.csv.

This script fetches data and saves it using Python's built-in json and csv modules.

import json
import csv
import os
# import requests # Not used in this snippet, but would be in real code

def save_extracted_data():
    # ... [API requests logic] ...
    # data = response.json()
    
    # Using Hacker News example data
    data = {
        "stories": [
            {"title": "New AI Model Released", "points": 342},
            {"title": "Database Performance Tips", "points": 156},
            {"title": "Understanding \"Distributed\" Systems", "points": 89}
        ]
    }

    try:
        # Save as JSON
        with open('extracted-data.json', 'w', encoding='utf-8') as f:
            json.dump(data, f, indent=2)
        print('Data saved to extracted-data.json')

        # Save as CSV
        with open('extracted-data.csv', 'w', newline='', encoding='utf-8') as f:
            writer = csv.writer(f)
            # Write the header
            writer.writerow(['Title', 'Points'])
            
            # Write the data rows
            for story in data['stories']:
                writer.writerow([story['title'], story['points']])
        
        print('Data saved to extracted-data.csv')

    except Exception as e:
        print(f'Error saving data: {e}')

# Call the function
if __name__ == "__main__":
    save_extracted_data()

Python makes file saving straightforward with built-in modules. The json.dump() function handles JSON serialization with proper formatting. For CSV, the csv.writer object handles all the formatting details—quoting, escaping, and special characters—automatically. The with blocks ensure files are properly closed even if errors occur.

How to Run:
- Save this as save.py.
- Run from your terminal: python save.py.
- Check your directory for extracted-data.json and extracted-data.csv.

Error Handling

A robust application must handle potential failures. The API uses standard HTTP status codes to indicate the success or failure of a request.

Common Error Status Codes

Status Code	Error Message	Description
400	`url is required`	The request body is missing the required `url` parameter.
400	`json schema is required`	The request body is missing the required `json_schema` parameter.
400	`json schema must be a valid object`	The `json_schema` provided is not valid.
400	`invalid JSON request body`	The request body itself is malformed JSON.
401	`Unauthorized - Invalid token`	Your `Authorization` header is missing or your API key is invalid.
422	`url is invalid`	The provided URL is malformed or cannot be processed.
500	`failed to fetch URL`	The server encountered an error trying to access the target URL.
500	`web page is too large`	The target page's content exceeds the processing size limit.
500	`failed to generate JSON`	The server failed to extract data matching your schema. This can happen if the page structure is vastly different from what the schema implies.

Error Response Format

All error responses return a JSON object with a single error field.

{
  "error": "json schema is required"
}

Error Handling Examples

Here’s how to build robust error handling into your application.

JavaScript
Python
curl

This function wraps the API call in a try/catch block and checks the response.ok status.

async function extractWithErrorHandling(url, jsonSchema) {
  try {
    const response = await fetch('https://api.tabstack.ai/v1/extract/json', {
      method: 'POST',
      headers: {
        'Authorization': `Bearer ${process.env.TABS_API_KEY}`,
        'Content-Type': 'application/json'
      },
      body: JSON.stringify({
        url: url,
        json_schema: jsonSchema
      })
    });

    // We must parse the body to get the error message, even for non-200 responses
    const data = await response.json();

    if (!response.ok) {
      // The request failed
      const statusCode = response.status;
      const errorMessage = data.error || `HTTP error ${statusCode}`;
      
      // Handle specific error codes
      switch (statusCode) {
        case 400:
          throw new Error(`Bad request: ${errorMessage}`);
        case 401:
          throw new Error('Authentication failed. Check your API key.');
        case 422:
          throw new Error(`Invalid URL: ${errorMessage}`);
        case 500:
          throw new Error(`Server error: ${errorMessage}`);
        default:
          throw new Error(`Request failed: ${errorMessage}`);
      }
    }

    // If response.ok is true, we have data
    return data;

  } catch (error) {
    // This catches network errors or errors thrown from our check
    console.error('Error extracting JSON:', error.message);
    throw error;
  }
}

// --- Usage Example ---
const schema = {type: 'object', properties: {title: {type: 'string'}}};

// Test success
extractWithErrorHandling('https://example.com', schema)
  .then(data => console.log('Success:', data))
  .catch(error => console.error('Failed:', error.message));

// Test failure (missing schema)
extractWithErrorHandling('https://example.com', null)
  .then(data => console.log('Success:', data))
  .catch(error => console.error('Failed:', error.message));

This error handling pattern checks response.ok to detect HTTP errors before processing data. The switch statement maps status codes to meaningful error messages. Always parse the response body first—even error responses contain useful information in their error field. The try/catch wrapper catches both network failures and the errors we throw for bad status codes.

How to Run:
- Save this as error_handling.js.
- Run node error_handling.js to see both the "Success" and "Failed" example logs.

The requests library can be configured to automatically raise exceptions for bad responses.

import requests
import os
import json

def extract_with_error_handling(url, json_schema):
    try:
        response = requests.post(
            'https://api.tabstack.ai/v1/extract/json',
            headers={
                'Authorization': f'Bearer {os.environ.get("TABS_API_KEY")}',
                'Content-Type': 'application/json'
            },
            json={
                'url': url,
                'json_schema': json_schema
            },
            timeout=30 # Set a 30-second timeout
        )

        # This method raises an HTTPError for bad responses (4xx or 5xx)
        response.raise_for_status()

        # If no error was raised, we have data
        return response.json()

    except requests.exceptions.HTTPError as http_err:
        # Handle specific HTTP errors
        status_code = http_err.response.status_code
        try:
            # Try to get the JSON error message from the API
            error_msg = http_err.response.json().get('error', 'Unknown HTTP error')
        except json.JSONDecodeError:
            error_msg = http_err.response.text

        if status_code == 400:
            print(f'Error: Bad request: {error_msg}')
        elif status_code == 401:
            print('Error: Authentication failed. Check your API key.')
        elif status_code == 422:
            print(f'Error: Invalid URL: {error_msg}')
        elif status_code >= 500:
            print(f'Error: Server error: {error_msg}')
        else:
            print(f'Error: {http_err}')
            
    except requests.exceptions.Timeout:
        print('Error: Request timed out')
    except requests.exceptions.RequestException as e:
        # For other network-related errors (DNS, connection, etc.)
        print(f'Error: Network error: {e}')
    
    return None

# --- Usage Example ---
schema = {'type': 'object', 'properties': {'title': {'type': 'string'}}}

# Test success
print("--- Testing success ---")
data = extract_with_error_handling('https://example.com', schema)
if data:
    print('Success:', json.dumps(data, indent=2))

# Test failure (invalid schema)
print("\n--- Testing failure ---")
data = extract_with_error_handling('https://example.com', "not-a-schema")
if data is None:
    print('Failure handled correctly.')

The Python version uses raise_for_status() to automatically convert HTTP errors into exceptions. This triggers the appropriate except block based on error type—HTTP errors, timeouts, or general network issues. Setting a timeout prevents indefinitely hanging requests. The code extracts error messages from the API response when available, providing helpful debugging information.

How to Run:
- Save this as error_handling.py.
- Run python error_handling.py to see both success and failure cases.

This bash script demonstrates how to check the HTTP status code when using curl.

#!/bin/bash

# -s: silent mode
# -w "\n%{http_code}": appends the http_code to the output
response=$(curl -s -w "\n%{http_code}" -X POST https://api.tabstack.ai/v1/extract/json \
  -H "Authorization: Bearer $TABS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "json_schema": {
      "type": "object",
      "properties": {
        "title": {"type": "string"}
      }
    }
  }')

# Split response body and status code
http_code=$(echo "$response" | tail -n1)
response_body=$(echo "$response" | sed '$d')

if [ "$http_code" -eq 200 ]; then
  echo "Success:"
  # Use 'jq' to pretty-print the JSON
  echo "$response_body" | jq .
else
  echo "Error (HTTP $http_code):"
  # Try to parse the error with 'jq'
  echo "$response_body" | jq .error
  exit 1
fi

This bash script captures both the response body and HTTP status code by using curl's -w flag to append the status code. The script then splits them apart using tail and sed. It checks the status code and uses jq to format the output—either pretty-printing success responses or extracting error messages from failures.

How to Run:
- You may need to install jq: sudo apt-get install jq (Linux) or brew install jq (macOS).
- Save this as error_handling.sh.
- Make it executable: chmod +x error_handling.sh.
- Run it: ./error_handling.sh.

Advanced Usage Patterns

Using with Schema Generation

The most powerful TABS workflow is to combine schema generation with schema extraction. This creates a "one-shot" scraper.

Step 1: Call the /extract/json/schema endpoint with a URL and plain-text instructions.
Step 2: Use the json_schema from that response to call the /extract/json endpoint.

JavaScript
Python

This function chains the two API calls together.

async function completeWorkflow(url, instructions) {
  const apiKey = process.env.TABS_API_KEY;

  // Step 1: Generate schema
  console.log('Step 1: Generating schema...');
  const schemaResponse = await fetch('https://api.tabstack.ai/v1/extract/json/schema', {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${apiKey}`,
      'Content-Type': 'application/json'
    },
    body: JSON.stringify({
      url,
      instructions
    })
  });
  
  if (!schemaResponse.ok) throw new Error('Failed to generate schema');
  const schema = await schemaResponse.json();
  console.log('Generated schema:', JSON.stringify(schema, null, 2));

  // Step 2: Use schema to extract data
  console.log('Step 2: Extracting data with schema...');
  const extractResponse = await fetch('https://api.tabstack.ai/v1/extract/json', {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${apiKey}`,
      'Content-Type': 'application/json'
    },
    body: JSON.stringify({
      url,
      json_schema: schema
    })
  });

  if (!extractResponse.ok) throw new Error('Failed to extract data');
  const data = await extractResponse.json();
  console.log('Extracted data:', JSON.stringify(data, null, 2));

  return { schema, data };
}

// Usage
completeWorkflow(
  'https://news.ycombinator.com',
  'extract top 5 stories with title, points, and author'
).catch(e => console.error(e.message));

This workflow combines schema generation and extraction into a single function. First, it generates a schema from your natural language instructions. Then it immediately uses that schema to extract data. This "one-shot" approach means you don't have to manually write schemas—just describe what you want in plain English.

How to Run:
- Save this as workflow.js.
- Run node workflow.js. You will see the two-step process log to your console.

This function chains the two API calls together using the requests library.

import requests
import os
import json

def complete_workflow(url, instructions):
    api_key = os.environ.get("TABS_API_KEY")
    schema_url = 'https://api.tabstack.ai/v1/extract/json/schema'
    extract_url = 'https://api.tabstack.ai/v1/extract/json'
    
    headers = {
        'Authorization': f'Bearer {api_key}',
        'Content-Type': 'application/json'
    }

    try:
        # Step 1: Generate schema
        print('Step 1: Generating schema...')
        schema_payload = {'url': url, 'instructions': instructions}
        schema_response = requests.post(schema_url, headers=headers, json=schema_payload)
        schema_response.raise_for_status()
        
        schema = schema_response.json()
        print('Generated schema:', json.dumps(schema, indent=2))

        # Step 2: Use schema to extract data
        print('Step 2: Extracting data with schema...')
        extract_payload = {'url': url, 'json_schema': schema}
        extract_response = requests.post(extract_url, headers=headers, json=extract_payload)
        extract_response.raise_for_status()

        data = extract_response.json()
        print('Extracted data:', json.dumps(data, indent=2))

        return {'schema': schema, 'data': data}

    except requests.exceptions.HTTPError as e:
        print(f"API Error: {e.response.status_code} - {e.response.text}")
        return None

# Usage
if __name__ == "__main__":
    result = complete_workflow(
        'https://news.ycombinator.com',
        'extract top 5 stories with title, points, and author'
    )

The Python implementation follows the same two-step pattern: generate a schema from instructions, then use it to extract data. The requests library handles the HTTP details, and raise_for_status() ensures errors are caught appropriately. This workflow lets you build extractors without writing schemas manually.

How to Run:
- Save this as workflow.py.
- Run python workflow.py.

Batch Processing Multiple URLs

To extract data from multiple pages (like a list of product pages), you can loop through your URLs and call the API for each one.

Note: Please be a good web citizen. When running batch jobs, we recommend adding a small delay between requests to avoid overwhelming the API or the target server.

JavaScript
Python

This function loops through a list of URLs and aggregates the results.

// Helper function for a respectful delay
const sleep = (ms) => new Promise(resolve => setTimeout(resolve, ms));

async function batchExtract(urls, schema) {
  const results = [];
  const apiKey = process.env.TABS_API_KEY;

  for (const url of urls) {
    try {
      console.log(`Extracting ${url}...`);
      const response = await fetch('https://api.tabstack.ai/v1/extract/json', {
        method: 'POST',
        headers: {
          'Authorization': `Bearer ${apiKey}`,
          'Content-Type': 'application/json'
        },
        body: JSON.stringify({
          url,
          json_schema: schema
        })
      });

      if (!response.ok) {
        throw new Error(`HTTP error ${response.status}`);
      }
      
      const data = await response.json();
      results.push({ url, success: true, data });

    } catch (error) {
      console.error(`Failed to extract ${url}: ${error.message}`);
      results.push({ url, success: false, error: error.message });
    }
    
    // Respectful rate limiting: wait 500ms between requests
    await sleep(500);
  }

  return results;
}

// Usage
const urls = [
  'https://example.com/page1',
  'https://example.com/page2',
  'https://example.com/page3'
];

const schema = {
  type: 'object',
  properties: {
    title: { type: 'string' },
    content: { type: 'string' }
  }
};

batchExtract(urls, schema).then(results => {
  const successful = results.filter(r => r.success);
  console.log(`\n--- Batch Complete ---`);
  console.log(`Successfully extracted ${successful.length}/${urls.length} pages.`);
});

This function loops through multiple URLs, extracting data from each with the same schema. The key detail is the rate limiting—await sleep(500) adds a 500ms delay between requests to avoid overwhelming servers. Each result (success or failure) is tracked in the results array, letting you see which extractions worked and which didn't.

How to Run:
- Save as batch.js.
- Run node batch.js.

This function loops through a list of URLs and aggregates the results.

import requests
import os
import time

def batch_extract(urls, schema):
    results = []
    api_key = os.environ.get("TABS_API_KEY")
    headers = {
        'Authorization': f'Bearer {api_key}',
        'Content-Type': 'application/json'
    }

    for url in urls:
        try:
            print(f'Extracting {url}...')
            response = requests.post(
                'https://api.tabstack.ai/v1/extract/json',
                headers=headers,
                json={
                    'url': url,
                    'json_schema': schema
                },
                timeout=30
            )
            response.raise_for_status()
            
            data = response.json()
            results.append({'url': url, 'success': True, 'data': data})

        except Exception as error:
            print(f'Failed to extract {url}: {error}')
            results.append({'url': url, 'success': False, 'error': str(error)})
        
        # Respectful rate limiting: wait 0.5 seconds between requests
        time.sleep(0.5)

    return results

# Usage
if __name__ == "__main__":
    urls = [
        'https://example.com/page1',
        'https://example.com/page2',
        'https://example.com/page3'
    ]

    schema = {
        'type': 'object',
        'properties': {
            'title': {'type': 'string'},
            'content': {'type': 'string'}
        }
    }

    results = batch_extract(urls, schema)
    successful = [r for r in results if r['success']]
    print(f'\n--- Batch Complete ---')
    print(f'Successfully extracted {len(successful)}/{len(urls)} pages.')

The Python version follows the same pattern: loop through URLs, extract data, and add a delay. Using time.sleep(0.5) between requests is good citizenship—it prevents hitting rate limits and reduces load on target servers. Error handling ensures one failure doesn't stop the entire batch.

How to Run:
- Save as batch.py.
- Run python batch.py.

Best Practices

1. Generate Schemas, Don't Write Them

Manually writing complex JSON schemas is tedious and error-prone. Always start by using the JSON Schema endpoint to automatically generate a schema. You can then fine-tune that schema if needed.

2. Test Schemas on Representative Pages

A schema that works for one product page might fail on another (e.g., a "product bundle" page). Before deploying to production, test your schema against a handful of representative URLs to ensure it's robust.

This script shows a simple testing harness for a schema.

# This example is in Python
def test_schema(urls, schema):
    print('--- Testing schema ---')
    success_count = 0
    
    for url in urls:
        try:
            # Re-using the 'extract_with_error_handling' function from earlier
            data = extract_with_error_handling(url, schema)
            if data:
                print(f'✓ {url}: Success')
                success_count += 1
            else:
                print(f'✗ {url}: Failed extraction')
        except Exception as e:
            print(f'✗ {url}: {e}')
    
    print(f'--- Test Complete: {success_count}/{len(urls)} successful ---')

# Test on multiple representative pages
test_urls = [
    'https://example.com/product/1',
    'https://example.com/product/2',
    'https://example.com/product/on-sale'
]
# your_schema = ... (the schema you want to test)
# test_schema(test_urls, your_schema)

This simple test harness validates your schema against multiple URLs. It gives you a quick pass/fail report, helping you identify edge cases before production. Testing against varied page types (regular products, sale items, bundles) reveals schema weaknesses early.

3. Handle Missing or Null Data

Web pages are unreliable. A "rating" field might not exist for a new product. To prevent your application from crashing, design your schemas and code to handle missing data.

This schema demonstrates how to define optional and nullable fields.

{
  "type": "object",
  "properties": {
    "title": {
      "type": "string",
      "description": "Product title"
    },
    "price": {
      "type": ["number", "null"],
      "description": "Price (may be null if 'Call for Price')"
    },
    "rating": {
      "type": "number",
      "description": "Customer rating (optional)"
    }
  },
  "required": ["title"]
}

Using "type": ["number", "null"] allows a field to be either a number or null—useful for prices that might show "Call for quote." Fields not in the required array are optional; they'll be omitted if not found. Only fields in required must exist, or the extraction fails for that item.

4. Use Caching Strategically

Default (Cached): For most use cases, like scraping articles or products that don't change every second, our default caching is ideal. It's fast and reduces load.
nocache: true: Only use this when you absolutely need real-time data, such as for monitoring stock prices, or when you are actively debugging a schema.

5. Validate Extracted Data

Don't trust, verify. Even if the API successfully returns data, add a layer of validation in your own application before using it.

This JavaScript snippet shows a basic post-extraction validation check.

// ... after you get the 'data' from the API ...
// const data = await extractAndValidate(url, schema);

if (!data.products || data.products.length === 0) {
  throw new Error('Validation failed: No products array found');
}

// Check data quality
const invalidProducts = data.products.filter(p => !p.name || !p.price);
if (invalidProducts.length > 0) {
  console.warn(`Warning: Found ${invalidProducts.length} products with missing name or price`);
}

// If it passed, the data is good to use
// processProducts(data.products);

This validation checks that you got the expected structure (a products array exists) and that individual items have required fields. Filtering for incomplete items helps you monitor data quality—you might get a successful API response, but some products could be missing critical information.

The Challenge: Unstructured Data​

Prerequisites​

Your First Extraction: A Step-by-Step Guide​

Understanding the Response​

API Parameters Reference​

url (required)​

json_schema (required)​

nocache (optional)​

Real-World Examples​

Example 1: E-commerce Product Extraction​

Example 2: News Article Extraction​

Putting It to Work: Processing and Saving Data​

Processing Extracted Data​

Saving Data to Files​

Error Handling​

Common Error Status Codes​

Error Response Format​

Error Handling Examples​

Advanced Usage Patterns​

Using with Schema Generation​

Batch Processing Multiple URLs​

Best Practices​

1. Generate Schemas, Don't Write Them​

2. Test Schemas on Representative Pages​

3. Handle Missing or Null Data​

4. Use Caching Strategically​

5. Validate Extracted Data​

Related Resources​