Extract Features
The Extract operator converts web pages into structured, usable data. It provides three powerful methods for working with web content: converting to Markdown, generating JSON schemas, and extracting structured JSON data.
Overview
The extract operator is accessed through your TABStack client instance:
- TypeScript
- JavaScript
import { TABStack } from '@tabstack/sdk';
const tabs = new TABStack({
apiKey: process.env.TABSTACK_API_KEY!
});
// Access extract methods
tabs.extract.markdown(url, options);
tabs.extract.schema(url, options);
tabs.extract.json(url, schema, options);
const { TABStack } = require('@tabstack/sdk');
const tabs = new TABStack({
apiKey: process.env.TABSTACK_API_KEY
});
// Access extract methods
tabs.extract.markdown(url, options);
tabs.extract.schema(url, options);
tabs.extract.json(url, schema, options);
Extract Markdown
Convert any web page to clean, readable Markdown format. This is perfect for:
- Creating readable content from web pages
- Feeding content to LLMs
- Archiving web content
- Building documentation from web sources
Basic Usage
- TypeScript
- JavaScript
const result = await tabs.extract.markdown('https://example.com/blog/article');
console.log(result.content);
const result = await tabs.extract.markdown('https://example.com/blog/article');
console.log(result.content);
With Metadata
Extract additional page metadata (title, description, author, etc.) alongside the content:
- TypeScript
- JavaScript
const result = await tabs.extract.markdown('https://example.com/blog/article', {
metadata: true
});
console.log('Content:', result.content);
console.log('Title:', result.metadata?.title);
console.log('Description:', result.metadata?.description);
console.log('Author:', result.metadata?.author);
console.log('Image:', result.metadata?.image);
const result = await tabs.extract.markdown('https://example.com/blog/article', {
metadata: true
});
console.log('Content:', result.content);
console.log('Title:', result.metadata?.title);
console.log('Description:', result.metadata?.description);
console.log('Author:', result.metadata?.author);
console.log('Image:', result.metadata?.image);
Bypass Cache
Force a fresh fetch when you need the latest content:
- TypeScript
- JavaScript
const result = await tabs.extract.markdown('https://example.com/live-prices', {
nocache: true
});
const result = await tabs.extract.markdown('https://example.com/live-prices', {
nocache: true
});
Response Type
The markdown method returns a MarkdownResponse object:
interface MarkdownResponse {
url: string; // Source URL
content: string; // Markdown content
metadata?: Metadata; // Optional page metadata
}
interface Metadata {
title?: string;
description?: string;
author?: string;
publisher?: string;
image?: string;
siteName?: string;
url?: string;
type?: string;
}
Generate Schema
Automatically generate a JSON schema from any web page using AI. This is the fastest way to create schemas without writing them manually.
Basic Usage
- TypeScript
- JavaScript
const schema = await tabs.extract.schema('https://news.ycombinator.com', {
instructions: 'extract top stories with title, points, and author'
});
console.log(JSON.stringify(schema, null, 2));
const schema = await tabs.extract.schema('https://news.ycombinator.com', {
instructions: 'extract top stories with title, points, and author'
});
console.log(JSON.stringify(schema, null, 2));
Generated Schema:
{
"type": "object",
"properties": {
"stories": {
"type": "array",
"items": {
"type": "object",
"properties": {
"title": { "type": "string" },
"points": { "type": "number" },
"author": { "type": "string" }
},
"required": ["title", "points", "author"]
}
}
},
"required": ["stories"]
}
Use Generated Schema Immediately
Chain schema generation with extraction:
- TypeScript
- JavaScript
// Step 1: Generate the schema
const schema = await tabs.extract.schema('https://news.ycombinator.com', {
instructions: 'extract top 5 stories with title, points, and author'
});
// Step 2: Use it to extract data
const result = await tabs.extract.json('https://news.ycombinator.com', schema);
console.log(result.data);
// Step 1: Generate the schema
const schema = await tabs.extract.schema('https://news.ycombinator.com', {
instructions: 'extract top 5 stories with title, points, and author'
});
// Step 2: Use it to extract data
const result = await tabs.extract.json('https://news.ycombinator.com', schema);
console.log(result.data);
Extract JSON
Extract structured data from web pages that matches your JSON schema. This is the most powerful extraction method.
Basic Usage
- TypeScript
- JavaScript
const schema = {
type: 'object',
properties: {
name: { type: 'string' },
price: { type: 'number' },
inStock: { type: 'boolean' }
},
required: ['name', 'price', 'inStock']
};
const result = await tabs.extract.json('https://shop.com/product/123', schema);
console.log(result.data);
const schema = {
type: 'object',
properties: {
name: { type: 'string' },
price: { type: 'number' },
inStock: { type: 'boolean' }
},
required: ['name', 'price', 'inStock']
};
const result = await tabs.extract.json('https://shop.com/product/123', schema);
console.log(result.data);
Type-Safe Extraction (TypeScript)
Use TypeScript generics for type-safe data access:
interface Product {
name: string;
price: number;
inStock: boolean;
features?: string[];
}
const schema = {
type: 'object',
properties: {
name: { type: 'string' },
price: { type: 'number' },
inStock: { type: 'boolean' },
features: { type: 'array', items: { type: 'string' } }
},
required: ['name', 'price', 'inStock']
};
// Type-safe response
const result = await tabs.extract.json<Product>('https://shop.com/product', schema);
// TypeScript knows the structure
console.log(result.data.name); // string
console.log(result.data.price); // number
console.log(result.data.inStock); // boolean
console.log(result.data.features); // string[] | undefined
Real-World Examples
Example 1: News Scraping
Extract articles from a news site:
- TypeScript
- JavaScript
interface Article {
title: string;
summary: string;
url: string;
publishedAt: string;
category?: string;
}
interface NewsData {
articles: Article[];
}
const schema = {
type: 'object',
properties: {
articles: {
type: 'array',
items: {
type: 'object',
properties: {
title: { type: 'string' },
summary: { type: 'string' },
url: { type: 'string' },
publishedAt: { type: 'string' },
category: { type: 'string' }
},
required: ['title', 'url']
}
}
}
};
const result = await tabs.extract.json<NewsData>('https://news.example.com', schema);
result.data.articles.forEach(article => {
console.log(`${article.title} (${article.category})`);
console.log(` ${article.url}`);
console.log(` Published: ${article.publishedAt}`);
});
const schema = {
type: 'object',
properties: {
articles: {
type: 'array',
items: {
type: 'object',
properties: {
title: { type: 'string' },
summary: { type: 'string' },
url: { type: 'string' },
publishedAt: { type: 'string' },
category: { type: 'string' }
},
required: ['title', 'url']
}
}
}
};
const result = await tabs.extract.json('https://news.example.com', schema);
result.data.articles.forEach(article => {
console.log(`${article.title} (${article.category})`);
console.log(` ${article.url}`);
console.log(` Published: ${article.publishedAt}`);
});
Example 2: E-commerce Product Extraction
Extract product information from an online store:
- TypeScript
- JavaScript
interface ProductCatalog {
products: Array<{
name: string;
price: number;
currency: string;
inStock: boolean;
rating: number;
}>;
}
const schema = {
type: 'object',
properties: {
products: {
type: 'array',
items: {
type: 'object',
properties: {
name: { type: 'string' },
price: { type: 'number' },
currency: { type: 'string' },
inStock: { type: 'boolean' },
rating: { type: 'number' }
},
required: ['name', 'price']
}
}
}
};
const result = await tabs.extract.json<ProductCatalog>(
'https://shop.example.com/category/laptops',
schema
);
// Filter in-stock products sorted by rating
const topProducts = result.data.products
.filter(p => p.inStock)
.sort((a, b) => (b.rating || 0) - (a.rating || 0))
.slice(0, 5);
console.log('Top 5 In-Stock Products:');
topProducts.forEach((product, i) => {
console.log(`${i + 1}. ${product.name} - ${product.currency}${product.price}`);
console.log(` Rating: ${product.rating} stars`);
});
const schema = {
type: 'object',
properties: {
products: {
type: 'array',
items: {
type: 'object',
properties: {
name: { type: 'string' },
price: { type: 'number' },
currency: { type: 'string' },
inStock: { type: 'boolean' },
rating: { type: 'number' }
},
required: ['name', 'price']
}
}
}
};
const result = await tabs.extract.json(
'https://shop.example.com/category/laptops',
schema
);
// Filter in-stock products sorted by rating
const topProducts = result.data.products
.filter(p => p.inStock)
.sort((a, b) => (b.rating || 0) - (a.rating || 0))
.slice(0, 5);
console.log('Top 5 In-Stock Products:');
topProducts.forEach((product, i) => {
console.log(`${i + 1}. ${product.name} - ${product.currency}${product.price}`);
console.log(` Rating: ${product.rating} stars`);
});
Example 3: Multi-Page Data Extraction
Extract data from multiple pages:
- TypeScript
- JavaScript
const urls = [
'https://example.com/products/page-1',
'https://example.com/products/page-2',
'https://example.com/products/page-3'
];
const schema = {
type: 'object',
properties: {
products: {
type: 'array',
items: {
type: 'object',
properties: {
name: { type: 'string' },
price: { type: 'number' }
}
}
}
}
};
// Extract from all pages
const allProducts = [];
for (const url of urls) {
const result = await tabs.extract.json(url, schema);
allProducts.push(...result.data.products);
// Be respectful - add a small delay between requests
await new Promise(resolve => setTimeout(resolve, 500));
}
console.log(`Extracted ${allProducts.length} total products`);
const urls = [
'https://example.com/products/page-1',
'https://example.com/products/page-2',
'https://example.com/products/page-3'
];
const schema = {
type: 'object',
properties: {
products: {
type: 'array',
items: {
type: 'object',
properties: {
name: { type: 'string' },
price: { type: 'number' }
}
}
}
}
};
// Extract from all pages
const allProducts = [];
for (const url of urls) {
const result = await tabs.extract.json(url, schema);
allProducts.push(...result.data.products);
// Be respectful - add a small delay between requests
await new Promise(resolve => setTimeout(resolve, 500));
}
console.log(`Extracted ${allProducts.length} total products`);
Options Reference
ExtractMarkdownOptions
| Option | Type | Default | Description |
|---|---|---|---|
metadata | boolean | false | Include page metadata (title, description, author, etc.) |
nocache | boolean | false | Bypass cache and force fresh fetch |
ExtractSchemaOptions
| Option | Type | Default | Description |
|---|---|---|---|
instructions | string | - | Natural language description of what data to extract |
nocache | boolean | false | Bypass cache and force fresh fetch |
ExtractJsonOptions
| Option | Type | Default | Description |
|---|---|---|---|
nocache | boolean | false | Bypass cache and force fresh fetch |
Best Practices
1. Generate Schemas First
Instead of writing complex schemas manually, use extract.schema() to generate them automatically:
// ✅ Good: Generate schema automatically
const schema = await tabs.extract.schema(url, {
instructions: 'extract product details'
});
// ❌ Tedious: Writing complex schemas by hand
const schema = { /* hundreds of lines */ };
2. Use Type Safety in TypeScript
Define interfaces for your data and use generic types:
// ✅ Good: Type-safe with interfaces
interface MyData {
title: string;
items: Array<{ name: string; value: number }>;
}
const result = await tabs.extract.json<MyData>(url, schema);
// result.data is typed as MyData
// ❌ Loses type safety
const result = await tabs.extract.json(url, schema);
// result.data is typed as unknown
3. Handle Optional Fields
Make schemas flexible for real-world data:
const schema = {
type: 'object',
properties: {
title: { type: 'string' },
price: { type: ['number', 'null'] }, // Can be null
rating: { type: 'number' } // Optional (not in required)
},
required: ['title'] // Only title is required
};
4. Use Cache Strategically
- Default (cached): Fast and efficient for most use cases
nocache: true: Only when you need real-time data or debugging
// ✅ Good: Cache for static content
const blog = await tabs.extract.markdown('https://blog.com/article');
// ✅ Good: No cache for live data
const prices = await tabs.extract.json('https://stocks.com/live', schema, {
nocache: true
});
Next Steps
- Generate Features: Transform and analyze extracted data with AI
- Automate Features: Execute complex browser automation tasks
- Error Handling: Build robust applications with proper error handling
- REST API Reference: See the underlying REST API endpoints