Guides

How to Extract JSON Data

Use the Tabstack SDK to extract structured JSON from any web page by defining a schema and letting the API handle the rest.

The Challenge: Unstructured Data

Extracting structured data from the web is often a brittle, time-consuming process. You build a custom scraper, only for it to break the moment a website changes its layout. You’re left maintaining complex CSS selectors and parsing messy HTML, all for what you really need: clean, predictable JSON.

The Tabstack API JSON extraction endpoint solves this. It’s an intelligent service that accepts a URL and a JSON schema you provide. It then fetches the page, analyzes its content, and returns clean, structured JSON that perfectly matches your schema.

This endpoint turns any web page into a structured API, making it ideal for:

Extraction with a consistent, reliable data structure
Extracting product information from e-commerce sites
Gathering news articles and blog posts
Monitoring competitor pricing and product changes
Building data aggregation pipelines
Collecting structured data for analysis or AI model training

Core Features:

Schema-Based Extraction: You define the “what,” and the API handles the “how.”
Consistent Output: The returned data is validated against your schema.
Intelligent Parsing: Works even with complex, dynamic, or JavaScript-heavy pages.
Built-in Caching: Improves performance for frequently accessed pages (and can be bypassed when needed).

Prerequisites

Before you can start, you’ll need a few things:

A valid Tabstack API key: Sign up at https://tabstack.ai to get your free key.
The SDK installed: npm install @tabstack/sdk (TypeScript) or pip install tabstack (Python).
A JSON schema: This defines the data structure you want to extract. See the examples throughout this guide, and use description fields to help the AI understand what data to find.

Store your API key as an environment variable:

export TABSTACK_API_KEY="your-api-key-here"

The SDK reads TABSTACK_API_KEY from the environment automatically — no configuration needed.

Your First Extraction: A Step-by-Step Guide

Let’s walk through a practical example: extracting the top stories from Hacker News.

Our goal is to get a list of stories, and for each story, we want its title and points.

import Tabstack from '@tabstack/sdk'

const client = new Tabstack()
// TABSTACK_API_KEY is read from env automatically

const result = await client.extract.json({
  url: 'https://news.ycombinator.com',
  json_schema: {
    type: 'object',
    properties: {
      stories: {
        type: 'array',
        items: {
          type: 'object',
          properties: {
            title: { type: 'string' },
            points: { type: 'number' },
          },
        },
      },
    },
  },
})

console.log(JSON.stringify(result, null, 2))

The key part is the json_schema — it defines exactly what data structure you want back. Here, we’re asking for an object with a stories array, where each story has a title (string) and points (number). The SDK handles authentication and serialization; you get clean, structured data matching your schema.

from tabstack import Tabstack

client = Tabstack()
# TABSTACK_API_KEY is read from env automatically

result = client.extract.json(
    url="https://news.ycombinator.com",
    json_schema={
        "type": "object",
        "properties": {
            "stories": {
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "title": {"type": "string"},
                        "points": {"type": "number"},
                    },
                },
            },
        },
    },
)

import json
print(json.dumps(result, indent=2))

The Python SDK mirrors the TypeScript API surface. Pass url and json_schema as keyword arguments; the client handles authentication and request serialization.

curl -X POST https://api.tabstack.ai/v1/extract/json \
  -H "Authorization: Bearer $TABSTACK_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://news.ycombinator.com",
    "json_schema": {
      "type": "object",
      "properties": {
        "stories": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "title": {"type": "string"},
              "points": {"type": "number"}
            }
          }
        }
      }
    }
  }'

This sends a POST request to the extraction endpoint with authentication and a JSON payload. The json_schema field defines exactly what data structure you want back.

Note: The examples below use placeholder URLs like https://example.com/products. Replace them with the real URL of the page you want to extract from.

Understanding the Response

A successful request (after a few moments of processing) will return a 200 OK status. The response body will contain the clean, structured data you asked for.

{
  "stories": [
    {
      "title": "New AI Model Released",
      "points": 342
    },
    {
      "title": "Database Performance Tips",
      "points": 156
    },
    {
      "title": "Understanding Distributed Systems",
      "points": 89
    }
  ]
}

The response structure exactly matches your schema — that’s the power of this endpoint. Instead of parsing HTML yourself, you get clean JSON with proper types (strings and numbers, not everything as text). This data is ready to use immediately in your application.

API Parameters Reference

The request body is a JSON object with the following properties:

`url` (required)

Type: string (URI format)
Description: The fully qualified, publicly accessible URL of the web page you want to fetch and extract data from.
Validation:
- Must be a valid URL format (e.g., https://example.com).
- Cannot access internal/private resources (e.g., localhost, 127.0.0.1, or private IPs).

`json_schema` (required)

Type: object
Description: A JSON schema definition that describes the structure of the data you want to extract. The API will use this schema to guide its extraction and parsing.
Tips for creating schemas:
- Best Practice: Include description fields in your schema properties to give the AI extractor context about what data to find. Start with a simple schema and refine it based on results.
- Be specific about required vs. optional fields using the required keyword.
- Include description fields in your properties to give the AI extractor hints about what data to find.

Here is a more complex schema example for a blog post:

{
  "json_schema": {
    "type": "object",
    "properties": {
      "title": {
        "type": "string",
        "description": "The main title of the blog post"
      },
      "articles": {
        "type": "array",
        "items": {
          "type": "object",
          "properties": {
            "headline": { "type": "string" },
            "author": { "type": "string" },
            "date": { "type": "string" }
          },
          "required": ["headline"]
        }
      }
    },
    "required": ["title", "articles"]
  }
}

This schema demonstrates two useful techniques. First, the description field helps the API distinguish between similar elements — here, finding the main title versus other headings. Second, the required array at different levels controls data quality: setting required: ["headline"] in items filters out incomplete entries, while the root-level required ensures critical fields are present or the extraction fails entirely.

`effort` (optional)

Type: 'min' | 'standard' | 'max'
Default: 'standard'
Description: Controls the speed vs. capability tradeoff. Use min for static pages when speed matters, standard for most cases, and max for JS-heavy SPAs or complex rendering.

`nocache` (optional)

Type: boolean
Default: false
Description: Bypasses the cache and forces a fresh fetch and extraction of the URL.

By default, the API caches responses for a short period to improve performance and reduce redundant fetches. Setting nocache to true is useful for:

Getting real-time data from frequently updated pages.
Debugging extraction issues.
Forcing a re-extraction after a page’s structure has changed.

const result = await client.extract.json({
  url: 'https://example.com/products',
  json_schema: { /* ... */ },
  nocache: true,
})

Setting nocache: true forces a fresh extraction, bypassing the cache. This is useful for real-time data but will be slower since nothing can be reused from previous requests.

`geo_target` (optional)

Type: { country: string } (ISO 3166-1 alpha-2)
Description: Fetches the page from a specific geographic location. Useful for region-locked content or localized pricing.

const result = await client.extract.json({
  url: 'https://example.com',
  json_schema: { /* ... */ },
  geo_target: { country: 'GB' },
})

Real-World Examples

Example 1: E-commerce Product Extraction

Here, we’ll extract product data from a hypothetical e-commerce category page.

This is the request payload. We are defining a schema to capture a list of products with their name, price, stock status, and rating.

{
  "url": "https://shop.example.com/category/laptops",
  "json_schema": {
    "type": "object",
    "properties": {
      "products": {
        "type": "array",
        "items": {
          "type": "object",
          "properties": {
            "name": { "type": "string" },
            "price": { "type": "number" },
            "currency": { "type": "string" },
            "inStock": { "type": "boolean" },
            "rating": { "type": "number" }
          },
          "required": ["name", "price"]
        }
      }
    }
  }
}

The schema defines a products array with properties of different types — strings, numbers, and booleans. Making name and price required prevents partial entries; products without these critical fields won’t be included in the response.

This is a potential response from the API.

{
  "products": [
    {
      "name": "Pro Laptop 15\"",
      "price": 1299.99,
      "currency": "USD",
      "inStock": true,
      "rating": 4.5
    },
    {
      "name": "Business Ultrabook",
      "price": 899.99,
      "currency": "USD",
      "inStock": false,
      "rating": 4.2
    }
  ]
}

Notice the API handles type conversion automatically — prices become numbers (not strings like “$899.99”), and stock status becomes a proper boolean. Optional fields like rating are included when found but omitted when missing, keeping your data clean.

Example 2: News Article Extraction

This example shows how to gather a list of articles from a news homepage.

This is the request payload. We want a list of articles, each with a title, summary, URL, and publication date.

{
  "url": "https://news.example.com",
  "json_schema": {
    "type": "object",
    "properties": {
      "articles": {
        "type": "array",
        "items": {
          "type": "object",
          "properties": {
            "title": { "type": "string" },
            "summary": { "type": "string" },
            "url": { "type": "string" },
            "publishedAt": { "type": "string" },
            "category": { "type": "string" }
          },
          "required": ["title", "url"]
        }
      }
    }
  }
}

This schema extracts article metadata including URLs. The API intelligently identifies article links on the page. Making title and url required ensures you only get complete article data.

This is a potential response from the API.

{
  "articles": [
    {
      "title": "Global Climate Summit Reaches Agreement",
      "summary": "World leaders agree on new emissions targets",
      "url": "https://news.example.com/climate-summit",
      "publishedAt": "2024-01-15T10:30:00Z",
      "category": "Environment"
    }
  ]
}

The API extracted the complete article data, including properly formatted dates in ISO 8601 format when available on the source page.

Putting It to Work: Processing and Saving Data

Getting the JSON is just the first step. Here’s how you can immediately process or save that data.

Processing Extracted Data

Once you have the JSON response, you can use standard programming-language features to filter, sort, and analyze it. This example takes the product data from our e-commerce example and finds only the in-stock products, sorted by price.

TypeScript
Python

import Tabstack from '@tabstack/sdk'

const client = new Tabstack()

const result = await client.extract.json({
  url: 'https://shop.example.com/category/laptops',
  json_schema: {
    type: 'object',
    properties: {
      products: {
        type: 'array',
        items: {
          type: 'object',
          properties: {
            name: { type: 'string' },
            price: { type: 'number' },
            inStock: { type: 'boolean' },
          },
          required: ['name', 'price'],
        },
      },
    },
  },
})

// `extract.json` returns Record<string, unknown> -- TS strict mode needs
// the cast (or a typed shape) before you can navigate the response.
const products = (result as any).products ?? []

const availableProducts = products
  .filter((p: any) => p.inStock)
  .sort((a: any, b: any) => a.price - b.price)

console.log('Available products (lowest price first):')
availableProducts.forEach((product: any) => {
  console.log(`${product.name}: $${product.price}`)
})

Once you have the extracted data, processing it is straightforward. The example chains .filter() to get only in-stock products, then .sort() to order by price. The API gives you clean data; you process it however you need.

from tabstack import Tabstack

client = Tabstack()

result = client.extract.json(
    url="https://shop.example.com/category/laptops",
    json_schema={
        "type": "object",
        "properties": {
            "products": {
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "name": {"type": "string"},
                        "price": {"type": "number"},
                        "inStock": {"type": "boolean"},
                    },
                    "required": ["name", "price"],
                },
            },
        },
    },
)

products = result.get("products", [])

available_products = [p for p in products if p.get("inStock")]
available_products.sort(key=lambda x: x["price"])

print("Available products (lowest price first):")
for product in available_products:
    print(f"{product['name']}: ${product['price']}")

The Python version uses list comprehension to filter and sort() with a key function to order by price. Note the use of .get('inStock') instead of direct key access — this prevents errors if a product is missing that field.

Saving Data to Files

You can also easily save your extracted data to a file, like a JSON file for later use or a CSV file for analysis in a spreadsheet.

TypeScript
Python

import Tabstack from '@tabstack/sdk'
import { promises as fs } from 'fs'

const client = new Tabstack()

const result = await client.extract.json({
  url: 'https://news.ycombinator.com',
  json_schema: {
    type: 'object',
    properties: {
      stories: {
        type: 'array',
        items: {
          type: 'object',
          properties: {
            title: { type: 'string' },
            points: { type: 'number' },
          },
        },
      },
    },
  },
})

// Save as JSON
await fs.writeFile('extracted-data.json', JSON.stringify(result, null, 2))
console.log('Data saved to extracted-data.json')

// Save as CSV
const stories = (result as any).stories ?? []
const csvHeader = 'Title,Points'
const csvRows = stories.map((s: any) => {
  const title = `"${s.title.replace(/"/g, '""')}"`
  return `${title},${s.points}`
})
await fs.writeFile('extracted-data.csv', [csvHeader, ...csvRows].join('\n'))
console.log('Data saved to extracted-data.csv')

For JSON, JSON.stringify() with formatting creates a readable file. For CSV, the code maps each story to a CSV row, properly escaping quotes with "" to handle titles containing commas or quotes.

import json
import csv
from tabstack import Tabstack

client = Tabstack()

result = client.extract.json(
    url="https://news.ycombinator.com",
    json_schema={
        "type": "object",
        "properties": {
            "stories": {
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "title": {"type": "string"},
                        "points": {"type": "number"},
                    },
                },
            },
        },
    },
)

# Save as JSON
with open("extracted-data.json", "w", encoding="utf-8") as f:
    json.dump(result, f, indent=2)
print("Data saved to extracted-data.json")

# Save as CSV
stories = result.get("stories", [])
with open("extracted-data.csv", "w", newline="", encoding="utf-8") as f:
    writer = csv.writer(f)
    writer.writerow(["Title", "Points"])
    for story in stories:
        writer.writerow([story["title"], story["points"]])
print("Data saved to extracted-data.csv")

Python makes file saving straightforward with built-in modules. The json.dump() function handles JSON serialization with proper formatting. For CSV, the csv.writer object handles all the formatting details — quoting, escaping, and special characters — automatically.

Error Handling

A robust application must handle potential failures. The API uses standard HTTP status codes to indicate the success or failure of a request.

Common Error Status Codes

Status Code	Error Message	Description
400	`url is required`	The request body is missing the required `url` parameter.
400	`json schema is required`	The request body is missing the required `json_schema` parameter.
400	`json schema must be a valid object`	The `json_schema` provided is not valid.
400	`invalid JSON request body`	The request body itself is malformed JSON.
401	`Unauthorized - Invalid token`	Your `Authorization` header is missing or your API key is invalid.
422	`url is invalid`	The provided URL is malformed or cannot be processed.
500	`failed to fetch URL`	The server encountered an error trying to access the target URL.
500	`web page is too large`	The target page’s content exceeds the processing size limit.
500	`failed to generate JSON`	The server failed to extract data matching your schema. This can happen if the page structure is vastly different from what the schema implies.

Error Response Format

All error responses return a JSON object with a single error field.

{
  "error": "json schema is required"
}

Error Handling Examples

Here’s how to build robust error handling into your application.

import Tabstack, {
  BadRequestError,
  AuthenticationError,
  UnprocessableEntityError,
  RateLimitError,
  InternalServerError,
} from '@tabstack/sdk'

const client = new Tabstack()

async function extractWithErrorHandling(url: string, jsonSchema: unknown) {
  try {
    const result = await client.extract.json({
      url,
      json_schema: jsonSchema,
    })
    return result
  } catch (err) {
    if (err instanceof BadRequestError) {
      throw new Error(`Bad request: ${err.message}`)
    }
    if (err instanceof AuthenticationError) {
      throw new Error('Authentication failed. Check your API key.')
    }
    if (err instanceof UnprocessableEntityError) {
      throw new Error(`Invalid URL: ${err.message}`)
    }
    if (err instanceof RateLimitError) {
      throw new Error('Rate limit hit. Back off and retry.')
    }
    if (err instanceof InternalServerError) {
      throw new Error(`Server error: ${err.message}`)
    }
    throw err
  }
}

// Usage
const schema = { type: 'object', properties: { title: { type: 'string' } } }

const data = await extractWithErrorHandling('https://example.com', schema)
console.log('Success:', data)

The SDK throws typed error classes rather than requiring manual status code checks. Catching specific error types lets you respond appropriately to each failure mode — authentication failures need a different response than rate limits or server errors.

from tabstack import Tabstack
from tabstack import BadRequestError, AuthenticationError, UnprocessableEntityError, RateLimitError, InternalServerError

client = Tabstack()

def extract_with_error_handling(url: str, json_schema: dict):
    try:
        return client.extract.json(url=url, json_schema=json_schema)
    except BadRequestError as e:
        print(f"Bad request: {e.message}")
    except AuthenticationError:
        print("Authentication failed. Check your API key.")
    except UnprocessableEntityError as e:
        print(f"Invalid URL: {e.message}")
    except RateLimitError:
        print("Rate limit hit. Back off and retry.")
    except InternalServerError as e:
        print(f"Server error: {e.message}")
    return None

# Usage
schema = {"type": "object", "properties": {"title": {"type": "string"}}}

data = extract_with_error_handling("https://example.com", schema)
if data:
    print("Success:", data)

The Python SDK raises the same typed error classes as the TypeScript SDK. Catching specific exception types gives you clear, actionable error handling for each failure mode.

#!/bin/bash

# -s: silent mode
# -w "\n%{http_code}": appends the http_code to the output
response=$(curl -s -w "\n%{http_code}" -X POST https://api.tabstack.ai/v1/extract/json \
  -H "Authorization: Bearer $TABSTACK_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "json_schema": {
      "type": "object",
      "properties": {
        "title": {"type": "string"}
      }
    }
  }')

# Split response body and status code
http_code=$(echo "$response" | tail -n1)
response_body=$(echo "$response" | sed '$d')

if [ "$http_code" -eq 200 ]; then
  echo "Success:"
  echo "$response_body" | jq .
else
  echo "Error (HTTP $http_code):"
  echo "$response_body" | jq .error
  exit 1
fi

This bash script captures both the response body and HTTP status code by using curl’s -w flag to append the status code. It checks the status code and uses jq to format the output.

Advanced Usage Patterns

Batch Processing Multiple URLs

To extract data from multiple pages (like a list of product pages), you can loop through your URLs and call the API for each one.

Note: Please be a good web citizen. When running batch jobs, we recommend adding a small delay between requests to avoid overwhelming the API or the target server.

TypeScript
Python

import Tabstack from '@tabstack/sdk'

const client = new Tabstack()

const sleep = (ms: number) => new Promise((resolve) => setTimeout(resolve, ms))

async function batchExtract(urls: string[], jsonSchema: unknown) {
  const results = []

  for (const url of urls) {
    try {
      console.log(`Extracting ${url}...`)
      const data = await client.extract.json({ url, json_schema: jsonSchema })
      results.push({ url, success: true, data })
    } catch (error: any) {
      console.error(`Failed to extract ${url}: ${error.message}`)
      results.push({ url, success: false, error: error.message })
    }

    // Respectful rate limiting: wait 500ms between requests
    await sleep(500)
  }

  return results
}

// Usage
const urls = [
  'https://example.com/page1',
  'https://example.com/page2',
  'https://example.com/page3',
]

const schema = {
  type: 'object',
  properties: {
    title: { type: 'string' },
    content: { type: 'string' },
  },
}

const results = await batchExtract(urls, schema)
const successful = results.filter((r) => r.success)
console.log(`\n--- Batch Complete ---`)
console.log(`Successfully extracted ${successful.length}/${urls.length} pages.`)

This function loops through multiple URLs, extracting data from each with the same schema. The key detail is the rate limiting — await sleep(500) adds a 500ms delay between requests to avoid overwhelming servers. Each result (success or failure) is tracked in the results array, letting you see which extractions worked and which didn’t.

import time
from tabstack import Tabstack

client = Tabstack()

def batch_extract(urls: list, json_schema: dict) -> list:
    results = []

    for url in urls:
        try:
            print(f"Extracting {url}...")
            data = client.extract.json(url=url, json_schema=json_schema)
            results.append({"url": url, "success": True, "data": data})
        except Exception as error:
            print(f"Failed to extract {url}: {error}")
            results.append({"url": url, "success": False, "error": str(error)})

        # Respectful rate limiting: wait 0.5 seconds between requests
        time.sleep(0.5)

    return results

# Usage
urls = [
    "https://example.com/page1",
    "https://example.com/page2",
    "https://example.com/page3",
]

schema = {
    "type": "object",
    "properties": {
        "title": {"type": "string"},
        "content": {"type": "string"},
    },
}

results = batch_extract(urls, schema)
successful = [r for r in results if r["success"]]
print(f"\n--- Batch Complete ---")
print(f"Successfully extracted {len(successful)}/{len(urls)} pages.")

The Python version follows the same pattern: loop through URLs, extract data, and add a delay. Using time.sleep(0.5) between requests is good citizenship — it prevents hitting rate limits and reduces load on target servers. Error handling ensures one failure doesn’t stop the entire batch.

Best Practices

1. Start Simple and Iterate

Writing complex JSON schemas from scratch can be error-prone. Start with a minimal schema capturing just the essential fields, test it against your target page, then gradually add more properties. Use description fields liberally to help the AI extractor understand what data you’re looking for.

2. Test Schemas on Representative Pages

A schema that works for one product page might fail on another (e.g., a “product bundle” page). Before deploying to production, test your schema against a handful of representative URLs to ensure it’s robust.

This script shows a simple testing harness for a schema.

# This example is in Python
def test_schema(urls, schema):
    print('--- Testing schema ---')
    success_count = 0

    for url in urls:
        try:
            # Re-using the 'extract_with_error_handling' function from earlier
            data = extract_with_error_handling(url, schema)
            if data:
                print(f'✓ {url}: Success')
                success_count += 1
            else:
                print(f'✗ {url}: Failed extraction')
        except Exception as e:
            print(f'✗ {url}: {e}')

    print(f'--- Test Complete: {success_count}/{len(urls)} successful ---')

# Test on multiple representative pages
test_urls = [
    'https://example.com/product/1',
    'https://example.com/product/2',
    'https://example.com/product/on-sale'
]
# your_schema = ... (the schema you want to test)
# test_schema(test_urls, your_schema)

This simple test harness validates your schema against multiple URLs. It gives you a quick pass/fail report, helping you identify edge cases before production. Testing against varied page types (regular products, sale items, bundles) reveals schema weaknesses early.

3. Handle Missing or Null Data

Web pages are unreliable. A “rating” field might not exist for a new product. To prevent your application from crashing, design your schemas and code to handle missing data.

This schema demonstrates how to define optional and nullable fields.

{
  "type": "object",
  "properties": {
    "title": {
      "type": "string",
      "description": "Product title"
    },
    "price": {
      "type": ["number", "null"],
      "description": "Price (may be null if 'Call for Price')"
    },
    "rating": {
      "type": "number",
      "description": "Customer rating (optional)"
    }
  },
  "required": ["title"]
}

Using "type": ["number", "null"] allows a field to be either a number or null — useful for prices that might show “Call for quote.” Fields not in the required array are optional; they’ll be omitted if not found. Only fields in required must exist, or the extraction fails for that item.

4. Use Caching Strategically

Default (Cached): For most use cases, like extracting articles or products that don’t change every second, our default caching is ideal. It’s fast and reduces load.
nocache: true: Only use this when you absolutely need real-time data, such as for monitoring stock prices, or when you are actively debugging a schema.

5. Validate Extracted Data

Don’t trust, verify. Even if the API successfully returns data, add a layer of validation in your own application before using it.

This TypeScript snippet shows a basic post-extraction validation check.

// result is the return value from client.extract.json(...)
const products = (result as any).products

if (!products || products.length === 0) {
  throw new Error('Validation failed: No products array found')
}

// Check data quality
const invalidProducts = products.filter((p: any) => !p.name || !p.price)
if (invalidProducts.length > 0) {
  console.warn(
    `Warning: Found ${invalidProducts.length} products with missing name or price`
  )
}

// If it passed, the data is good to use
// processProducts(products)

This validation checks that you got the expected structure (a products array exists) and that individual items have required fields. Filtering for incomplete items helps you monitor data quality — you might get a successful API response, but some products could be missing critical information.

How to Generate JSON — when you want AI to transform a page’s content into a schema rather than extract what’s already there
How to Use the Markdown Endpoint
TypeScript SDK Quickstart
Python SDK Quickstart
API Reference