Skip to main content

Extract Features

The Extract operator converts web pages into structured, usable data. It provides three powerful methods for working with web content: converting to Markdown, generating JSON schemas, and extracting structured JSON data.

Overview

The extract operator is accessed through your TABStack client instance:

import { TABStack } from '@tabstack/sdk';

const tabs = new TABStack({
apiKey: process.env.TABSTACK_API_KEY!
});

// Access extract methods
tabs.extract.markdown(url, options);
tabs.extract.schema(url, options);
tabs.extract.json(url, schema, options);

Extract Markdown

Convert any web page to clean, readable Markdown format. This is perfect for:

  • Creating readable content from web pages
  • Feeding content to LLMs
  • Archiving web content
  • Building documentation from web sources

Basic Usage

const result = await tabs.extract.markdown('https://example.com/blog/article');
console.log(result.content);

With Metadata

Extract additional page metadata (title, description, author, etc.) alongside the content:

const result = await tabs.extract.markdown('https://example.com/blog/article', {
metadata: true
});

console.log('Content:', result.content);
console.log('Title:', result.metadata?.title);
console.log('Description:', result.metadata?.description);
console.log('Author:', result.metadata?.author);
console.log('Image:', result.metadata?.image);

Bypass Cache

Force a fresh fetch when you need the latest content:

const result = await tabs.extract.markdown('https://example.com/live-prices', {
nocache: true
});

Response Type

The markdown method returns a MarkdownResponse object:

interface MarkdownResponse {
url: string; // Source URL
content: string; // Markdown content
metadata?: Metadata; // Optional page metadata
}

interface Metadata {
title?: string;
description?: string;
author?: string;
publisher?: string;
image?: string;
siteName?: string;
url?: string;
type?: string;
}

Generate Schema

Automatically generate a JSON schema from any web page using AI. This is the fastest way to create schemas without writing them manually.

Basic Usage

const schema = await tabs.extract.schema('https://news.ycombinator.com', {
instructions: 'extract top stories with title, points, and author'
});

console.log(JSON.stringify(schema, null, 2));

Generated Schema:

{
"type": "object",
"properties": {
"stories": {
"type": "array",
"items": {
"type": "object",
"properties": {
"title": { "type": "string" },
"points": { "type": "number" },
"author": { "type": "string" }
},
"required": ["title", "points", "author"]
}
}
},
"required": ["stories"]
}

Use Generated Schema Immediately

Chain schema generation with extraction:

// Step 1: Generate the schema
const schema = await tabs.extract.schema('https://news.ycombinator.com', {
instructions: 'extract top 5 stories with title, points, and author'
});

// Step 2: Use it to extract data
const result = await tabs.extract.json('https://news.ycombinator.com', schema);
console.log(result.data);

Extract JSON

Extract structured data from web pages that matches your JSON schema. This is the most powerful extraction method.

Basic Usage

const schema = {
type: 'object',
properties: {
name: { type: 'string' },
price: { type: 'number' },
inStock: { type: 'boolean' }
},
required: ['name', 'price', 'inStock']
};

const result = await tabs.extract.json('https://shop.com/product/123', schema);
console.log(result.data);

Type-Safe Extraction (TypeScript)

Use TypeScript generics for type-safe data access:

interface Product {
name: string;
price: number;
inStock: boolean;
features?: string[];
}

const schema = {
type: 'object',
properties: {
name: { type: 'string' },
price: { type: 'number' },
inStock: { type: 'boolean' },
features: { type: 'array', items: { type: 'string' } }
},
required: ['name', 'price', 'inStock']
};

// Type-safe response
const result = await tabs.extract.json<Product>('https://shop.com/product', schema);

// TypeScript knows the structure
console.log(result.data.name); // string
console.log(result.data.price); // number
console.log(result.data.inStock); // boolean
console.log(result.data.features); // string[] | undefined

Real-World Examples

Example 1: News Scraping

Extract articles from a news site:

interface Article {
title: string;
summary: string;
url: string;
publishedAt: string;
category?: string;
}

interface NewsData {
articles: Article[];
}

const schema = {
type: 'object',
properties: {
articles: {
type: 'array',
items: {
type: 'object',
properties: {
title: { type: 'string' },
summary: { type: 'string' },
url: { type: 'string' },
publishedAt: { type: 'string' },
category: { type: 'string' }
},
required: ['title', 'url']
}
}
}
};

const result = await tabs.extract.json<NewsData>('https://news.example.com', schema);

result.data.articles.forEach(article => {
console.log(`${article.title} (${article.category})`);
console.log(` ${article.url}`);
console.log(` Published: ${article.publishedAt}`);
});

Example 2: E-commerce Product Extraction

Extract product information from an online store:

interface ProductCatalog {
products: Array<{
name: string;
price: number;
currency: string;
inStock: boolean;
rating: number;
}>;
}

const schema = {
type: 'object',
properties: {
products: {
type: 'array',
items: {
type: 'object',
properties: {
name: { type: 'string' },
price: { type: 'number' },
currency: { type: 'string' },
inStock: { type: 'boolean' },
rating: { type: 'number' }
},
required: ['name', 'price']
}
}
}
};

const result = await tabs.extract.json<ProductCatalog>(
'https://shop.example.com/category/laptops',
schema
);

// Filter in-stock products sorted by rating
const topProducts = result.data.products
.filter(p => p.inStock)
.sort((a, b) => (b.rating || 0) - (a.rating || 0))
.slice(0, 5);

console.log('Top 5 In-Stock Products:');
topProducts.forEach((product, i) => {
console.log(`${i + 1}. ${product.name} - ${product.currency}${product.price}`);
console.log(` Rating: ${product.rating} stars`);
});

Example 3: Multi-Page Data Extraction

Extract data from multiple pages:

const urls = [
'https://example.com/products/page-1',
'https://example.com/products/page-2',
'https://example.com/products/page-3'
];

const schema = {
type: 'object',
properties: {
products: {
type: 'array',
items: {
type: 'object',
properties: {
name: { type: 'string' },
price: { type: 'number' }
}
}
}
}
};

// Extract from all pages
const allProducts = [];
for (const url of urls) {
const result = await tabs.extract.json(url, schema);
allProducts.push(...result.data.products);

// Be respectful - add a small delay between requests
await new Promise(resolve => setTimeout(resolve, 500));
}

console.log(`Extracted ${allProducts.length} total products`);

Options Reference

ExtractMarkdownOptions

OptionTypeDefaultDescription
metadatabooleanfalseInclude page metadata (title, description, author, etc.)
nocachebooleanfalseBypass cache and force fresh fetch

ExtractSchemaOptions

OptionTypeDefaultDescription
instructionsstring-Natural language description of what data to extract
nocachebooleanfalseBypass cache and force fresh fetch

ExtractJsonOptions

OptionTypeDefaultDescription
nocachebooleanfalseBypass cache and force fresh fetch

Best Practices

1. Generate Schemas First

Instead of writing complex schemas manually, use extract.schema() to generate them automatically:

// ✅ Good: Generate schema automatically
const schema = await tabs.extract.schema(url, {
instructions: 'extract product details'
});

// ❌ Tedious: Writing complex schemas by hand
const schema = { /* hundreds of lines */ };

2. Use Type Safety in TypeScript

Define interfaces for your data and use generic types:

// ✅ Good: Type-safe with interfaces
interface MyData {
title: string;
items: Array<{ name: string; value: number }>;
}
const result = await tabs.extract.json<MyData>(url, schema);
// result.data is typed as MyData

// ❌ Loses type safety
const result = await tabs.extract.json(url, schema);
// result.data is typed as unknown

3. Handle Optional Fields

Make schemas flexible for real-world data:

const schema = {
type: 'object',
properties: {
title: { type: 'string' },
price: { type: ['number', 'null'] }, // Can be null
rating: { type: 'number' } // Optional (not in required)
},
required: ['title'] // Only title is required
};

4. Use Cache Strategically

  • Default (cached): Fast and efficient for most use cases
  • nocache: true: Only when you need real-time data or debugging
// ✅ Good: Cache for static content
const blog = await tabs.extract.markdown('https://blog.com/article');

// ✅ Good: No cache for live data
const prices = await tabs.extract.json('https://stocks.com/live', schema, {
nocache: true
});

Next Steps