Integration Guides

Using Tabstack with LangChain

Replace WebBaseLoader and PlaywrightURLLoader with schema-enforced extraction. How to integrate Tabstack as a LangChain tool, in LCEL chains, and in RAG pipelines.

LangChain’s built-in browser tools (WebBaseLoader, PlaywrightURLLoader) are the standard starting point for giving LangChain agents web access. They work for prototypes. In production, they break.

This guide shows how to replace them with Tabstack and what you get in return: schema-enforced structured output, managed infrastructure, and reliable extraction that doesn’t depend on your LangChain version or a locally running Playwright binary.

The core swap: WebBaseLoader → extract.json

The most common pattern is WebBaseLoader fetching a URL and passing raw text to a chain or agent. Here’s the before and after.

Before: WebBaseLoader

from langchain_community.document_loaders import WebBaseLoader
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

# Fetch raw HTML/text
loader = WebBaseLoader("https://example.com/pricing")
docs = loader.load()
raw_text = docs[0].page_content  # Messy, unpredictable

# Now you have to prompt-engineer structured output from messy text
llm = ChatOpenAI(model="gpt-4o")
prompt = ChatPromptTemplate.from_template(
    "Extract pricing plans from this text as JSON: {text}"
)
chain = prompt | llm
result = chain.invoke({"text": raw_text})  # Inconsistent, unpredictable

After: Tabstack

import os
from tabstack import Tabstack

client = Tabstack(api_key=os.environ["TABSTACK_API_KEY"])

# Schema-enforced extraction — no prompt engineering, no parsing
result = client.extract.json(
    url="https://example.com/pricing",
    json_schema={
        "type": "object",
        "properties": {
            "plans": {
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "name": {"type": "string", "description": "Plan name"},
                        "price": {"type": "number", "description": "Monthly price in USD"},
                        "features": {
                            "type": "array",
                            "items": {"type": "string"},
                            "description": "Included features"
                        }
                    }
                }
            }
        }
    }
)
# result is already structured, typed, schema-validated
# No downstream LLM call needed for extraction
print(result["plans"])

The difference: WebBaseLoader returns text you then need to parse. Tabstack returns the shape you defined.

Use Tabstack as a LangChain Tool

The cleanest integration pattern is wrapping Tabstack calls as @tool-decorated functions inside a LangChain agent. The agent then decides when to call them.

import os
import json
from tabstack import Tabstack
from langchain_core.tools import tool
from langchain.agents import create_agent

client = Tabstack(api_key=os.environ["TABSTACK_API_KEY"])


@tool
def extract_structured_data(url: str, json_schema_json: str) -> str:
    """Extract structured JSON data from a URL.
    Use when you need specific fields from a web page.
    'json_schema_json' must be a JSON-encoded JSON Schema object with
    concrete properties and descriptions for each field to extract.
    See the schema-design guide for patterns that produce reliable
    results. Returns the extracted data as JSON.
    """
    result = client.extract.json(
        url=url,
        json_schema=json.loads(json_schema_json),
        effort="standard",
    )
    return json.dumps(result)


@tool
def extract_page_content(url: str) -> str:
    """Fetch a URL and return its content as clean markdown.
    Use when you need to read a page's full content for summarization
    or when you don't know what specific fields to extract.
    Returns clean markdown text.
    """
    result = client.extract.markdown(url=url)
    return result.content


@tool
def research_question(query: str) -> str:
    """Research a question using multiple web sources.
    Use when you need a synthesized answer from multiple sources,
    not just data from a single known page.
    Returns an answer with cited sources.
    """
    # Iterate the stream directly. event.data is a typed model — access
    # fields as attributes. The complete event carries the synthesized
    # report plus metadata.cited_pages with source title/url entries.
    for event in client.agent.research(query=query, mode="balanced"):
        if event.event == "error":
            msg = getattr(event.data.error, "message", None) or "unknown error"
            raise RuntimeError(f"Research failed: {msg}")
        if event.event == "complete":
            cited = event.data.metadata.cited_pages or []
            return json.dumps({
                "answer": event.data.report,
                "sources": [
                    {"title": p.title or "", "url": p.url} for p in cited
                ],
            })
    return json.dumps({"answer": "Research did not complete", "sources": []})


# Build the agent (LangChain 1.x API)
agent = create_agent(
    "openai:gpt-4o",
    tools=[extract_structured_data, extract_page_content, research_question],
    system_prompt=(
        "You are a research assistant with access to web intelligence tools. "
        "Use extract_structured_data when you need specific fields from a URL — "
        "author a concrete JSON Schema with descriptions, per the schema-design guide. "
        "Use extract_page_content for full page text. "
        "Use research_question for multi-source research."
    ),
)

# Run it
result = agent.invoke({
    "messages": [{
        "role": "user",
        "content": "What are the current pricing plans for Vercel and how do they compare?",
    }],
})
print(result["messages"][-1].content)

Using LangChain 0.x? Replace the import with from langchain_classic.agents import AgentExecutor, create_tool_calling_agent and use the AgentExecutor + ChatPromptTemplate pattern instead. The new create_agent API is the canonical LangChain 1.x replacement.

Replace PlaywrightURLLoader

If you’re using PlaywrightURLLoader for JS-heavy pages, replace it with effort: 'max':

Before: PlaywrightURLLoader

from langchain_community.document_loaders import PlaywrightURLLoader

# Requires Playwright installed, browser binaries, async handling
loader = PlaywrightURLLoader(
    urls=["https://spa-site.com/data"],
    remove_selectors=["header", "footer"]
)
docs = loader.load()  # Frequently breaks in prod

After: Tabstack with effort: max

result = client.extract.markdown(
    url="https://spa-site.com/data",
    effort="max"  # Full browser rendering for JS-heavy pages
)
content = result.content  # Clean markdown, no install required

No Playwright binary. No version dependency. No async handling issues. The effort: 'max' flag tells Tabstack to use full headless browser rendering server-side. You get the same rendered content without managing the browser yourself.

Use Tabstack with LCEL (LangChain Expression Language)

For LCEL chains, wrap Tabstack as a plain callable:

from langchain_core.runnables import RunnableLambda, RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from tabstack import Tabstack

client = Tabstack(api_key=os.environ["TABSTACK_API_KEY"])


def fetch_page_content(url: str) -> str:
    """Fetch clean markdown from a URL via Tabstack."""
    result = client.extract.markdown(url=url)
    return result.content


def fetch_structured(inputs: dict) -> dict:
    """Fetch structured data from URL using schema from inputs."""
    result = client.extract.json(
        url=inputs["url"],
        json_schema=inputs["schema"]
    )
    return {**inputs, "extracted": result}


# Chain: URL → clean markdown → summarize
summarize_chain = (
    RunnableLambda(lambda url: fetch_page_content(url))
    | ChatPromptTemplate.from_template("Summarize this in 3 bullet points: {text}")
    | ChatOpenAI(model="gpt-4o-mini")
    | StrOutputParser()
)

summary = summarize_chain.invoke("https://example.com/blog/article")
print(summary)

Use Tabstack with LangChain RAG pipelines

Tabstack’s /extract/markdown returns content with optional metadata, useful for enriching documents before embedding:

from langchain_core.documents import Document
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from tabstack import Tabstack

client = Tabstack(api_key=os.environ["TABSTACK_API_KEY"])


def tabstack_to_document(url: str) -> Document:
    """Fetch a URL via Tabstack and return a LangChain Document."""
    result = client.extract.markdown(url=url, metadata=True)

    return Document(
        page_content=result.content,
        metadata={
            "source": url,
            "title": result.metadata.title if result.metadata else None,
            "author": result.metadata.author if result.metadata else None,
            "published": result.metadata.created_at if result.metadata else None,
        }
    )


# Build a vector store from multiple URLs
urls = [
    "https://docs.example.com/getting-started",
    "https://docs.example.com/api-reference",
    "https://docs.example.com/tutorials",
]

docs = [tabstack_to_document(url) for url in urls]
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(docs, embeddings)
retriever = vectorstore.as_retriever()

# Use in a RAG chain
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

rag_prompt = ChatPromptTemplate.from_template(
    "Answer based on this context: {context}\n\nQuestion: {question}"
)

rag_chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | rag_prompt
    | ChatOpenAI(model="gpt-4o")
    | StrOutputParser()
)

answer = rag_chain.invoke("How do I authenticate with the API?")
print(answer)

Installation

pip install tabstack langchain langchain-openai langchain-community
export TABSTACK_API_KEY="your-key-here"
export OPENAI_API_KEY="your-openai-key"

Tabstack has no framework dependency. It works alongside any LangChain version without creating additional version conflicts in your dependency tree.

Why this matters in production

WebBaseLoader and PlaywrightURLLoader break in production for predictable reasons:

PlaywrightURLLoader depends on your Playwright version, browser binary availability, and async handling that changes across LangChain minor releases
WebBaseLoader returns raw BeautifulSoup-parsed text: what you get varies by page, no schema enforcement, prompt-dependent extraction that drifts at scale
LangChain releases frequently; browser loader APIs have changed across minor versions

Tabstack removes all of that:

Managed infrastructure: no browser to install or maintain
Schema-enforced output: consistent structure every call
No LangChain version dependency: it’s an HTTP API call
effort: 'max' handles JS-heavy pages server-side

Choosing the right tool

Situation	Tool
Need specific structured fields from a known URL	`client.extract.json()`
Need full page content for summarization or embedding	`client.extract.markdown()`
Need AI transformation of page content	`client.generate.json()`
Need multi-source research with citations	`client.agent.research()`
Quick prototype, raw text is fine	`WebBaseLoader` (LangChain)
Need offline / local LLM support	`WebBaseLoader` (LangChain)