Extract vs. Generate

Build

When to use /extract/json vs /generate/json: extract returns data already on the page, generate has a model produce new data from it.

Extract returns data that already exists on the page. Generate produces data that does not exist on the page until the model creates it. That one distinction decides which endpoint you reach for: if the value is already sitting in the HTML, extract it. If a model has to read the content and produce something new, generate it.

Same URL, different answer

Point both endpoints at the same article and the difference shows up in what comes back.

Extract

/extract/json lifts fields that are already on the page.

{
  "title": "Understanding Microservices",
  "author": "Jane Doe",
  "published_date": "2026-05-12"
}

Every field here is present in the page HTML. Extract reads it and hands it back.

Generate

/generate/json returns fields the model produced from the content.

{
  "title": "Understanding Microservices",
  "category": "engineering",
  "sentiment": "neutral",
  "relevance_score": 8
}

category, sentiment, and relevance_score appear nowhere in the page. The model read the article and created them.

relevance_score is the proof. Nothing in the page contains an 8, so it can only have come from generate.

Does the data exist on the page?

That is the whole decision. If the field is already in the HTML, use extract. If the model has to read, reason, and produce it, use generate.

The Hacker News homepage makes the split obvious. Each story’s title is text on the page, so extract handles it. A category and a one-line summary are not on the page at all, so generate has to create them:

{
  "summaries": [
    {
      "title": "New AI Model Released",
      "category": "tech",
      "summary": "A research lab announced a new language model that performs better on reasoning tasks."
    }
  ]
}

title was extracted. category and summary did not exist until the model wrote them. The same response carries both kinds of field, which is exactly what generate is for: a schema where some values are read and others are produced.

A field that cannot be read

Here is one generate call built around a value the page does not contain. The model reads each article and assigns a relevance_score, then returns it inside your schema.

import Tabstack from "@tabstack/sdk";

const client = new Tabstack();

const result = await client.generate.json({
  url: "https://competitor.example.com/blog",
  json_schema: {
    type: "object",
    properties: {
      articles: {
        type: "array",
        items: {
          type: "object",
          properties: {
            title: { type: "string" },
            target_audience: { type: "string" },
            relevance_score: {
              type: "number",
              description: "1-10 score for a developer focused on AI agents",
            },
          },
        },
      },
    },
  },
  instructions:
    "For each article, identify the target audience and assign a relevance_score from 1-10 for a developer focused on AI agents.",
});

console.log(JSON.stringify(result, null, 2));

import json
from tabstack import Tabstack

client = Tabstack()

result = client.generate.json(
    url="https://competitor.example.com/blog",
    json_schema={
        "type": "object",
        "properties": {
            "articles": {
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "title": {"type": "string"},
                        "target_audience": {"type": "string"},
                        "relevance_score": {
                            "type": "number",
                            "description": "1-10 score for a developer focused on AI agents",
                        },
                    },
                },
            },
        },
    },
    instructions="For each article, identify the target audience and assign a relevance_score from 1-10 for a developer focused on AI agents.",
)

print(json.dumps(result, indent=2))

tabstack generate json "https://competitor.example.com/blog" \
  --instructions "For each article, identify the target audience and assign a relevance_score from 1-10 for a developer focused on AI agents." \
  --schema @schema.json

The response carries the generated fields back inside the shape you asked for:

{
  "articles": [
    {
      "title": "Shipping faster with feature flags",
      "target_audience": "engineering leads",
      "relevance_score": 6
    },
    {
      "title": "Building autonomous agents with tool use",
      "target_audience": "AI engineers",
      "relevance_score": 9
    }
  ]
}

title is on the page. target_audience and relevance_score are not. The model produced both.

When not to use generate

If extract can get it, use extract. Both endpoints are AI-guided, but generate adds an explicit transformation step: you hand it instructions, and it produces content that is not on the page. That extra step usually makes generate slower and more expensive than extract, so reach for it only when the value has to be created.

Generate earns that cost when the value has to be reasoned into existence: scoring, summarizing, categorizing, sentiment, rewriting. It is not for pulling data that is already structured on the page. A product price, a published date, a table of specs: that is extract. A relevance score, a one-line summary, a sentiment label: that is generate.

This still matters when read and produced fields could share one schema. A generate.json call is billed as a generate request no matter how many of its fields are read-only, so folding easy-to-read fields into it saves nothing on those fields. When cost matters, split the work: pull the read-only fields with extract.json and use generate.json only for the derived outputs.

For the parameters that tune a generate call, see the how-to and the SDK reference rather than this page.

Next steps

How to generate JSON for the step-by-step mechanics: parameters, instructions, and error handling.
Generate Features (SDK reference) for the generate.json surface in the SDKs.
How to extract JSON for the other side of the contrast.