--- title: Using Tabstack with LangChain | Tabstack description: Replace WebBaseLoader and PlaywrightURLLoader with schema-enforced extraction. How to integrate Tabstack as a LangChain tool, in LCEL chains, and in RAG pipelines. --- LangChain’s built-in browser tools (`WebBaseLoader`, `PlaywrightURLLoader`) are the standard starting point for giving LangChain agents web access. They work for prototypes. In production, they break. This guide shows how to replace them with Tabstack and what you get in return: schema-enforced structured output, managed infrastructure, and reliable extraction that doesn’t depend on your LangChain version or a locally running Playwright binary. --- ## The core swap: WebBaseLoader → extract.json The most common pattern is `WebBaseLoader` fetching a URL and passing raw text to a chain or agent. Here’s the before and after. **Before: WebBaseLoader** ``` from langchain_community.document_loaders import WebBaseLoader from langchain_core.prompts import ChatPromptTemplate from langchain_openai import ChatOpenAI # Fetch raw HTML/text loader = WebBaseLoader("https://example.com/pricing") docs = loader.load() raw_text = docs[0].page_content # Messy, unpredictable # Now you have to prompt-engineer structured output from messy text llm = ChatOpenAI(model="gpt-4o") prompt = ChatPromptTemplate.from_template( "Extract pricing plans from this text as JSON: {text}" ) chain = prompt | llm result = chain.invoke({"text": raw_text}) # Inconsistent, unpredictable ``` **After: Tabstack** ``` import os from tabstack import Tabstack client = Tabstack(api_key=os.environ["TABSTACK_API_KEY"]) # Schema-enforced extraction — no prompt engineering, no parsing result = client.extract.json( url="https://example.com/pricing", json_schema={ "type": "object", "properties": { "plans": { "type": "array", "items": { "type": "object", "properties": { "name": {"type": "string", "description": "Plan name"}, "price": {"type": "number", "description": "Monthly price in USD"}, "features": { "type": "array", "items": {"type": "string"}, "description": "Included features" } } } } } } ) # result is already structured, typed, schema-validated # No downstream LLM call needed for extraction print(result["plans"]) ``` The difference: WebBaseLoader returns text you then need to parse. Tabstack returns the shape you defined. --- ## Use Tabstack as a LangChain Tool The cleanest integration pattern is wrapping Tabstack calls as `@tool`-decorated functions inside a LangChain agent. The agent then decides when to call them. ``` import os import json from tabstack import Tabstack from langchain_core.tools import tool from langchain.agents import create_agent client = Tabstack(api_key=os.environ["TABSTACK_API_KEY"]) @tool def extract_structured_data(url: str, json_schema_json: str) -> str: """Extract structured JSON data from a URL. Use when you need specific fields from a web page. 'json_schema_json' must be a JSON-encoded JSON Schema object with concrete properties and descriptions for each field to extract. See the schema-design guide for patterns that produce reliable results. Returns the extracted data as JSON. """ result = client.extract.json( url=url, json_schema=json.loads(json_schema_json), effort="standard", ) return json.dumps(result) @tool def extract_page_content(url: str) -> str: """Fetch a URL and return its content as clean markdown. Use when you need to read a page's full content for summarization or when you don't know what specific fields to extract. Returns clean markdown text. """ result = client.extract.markdown(url=url) return result.content @tool def research_question(query: str) -> str: """Research a question using multiple web sources. Use when you need a synthesized answer from multiple sources, not just data from a single known page. Returns an answer with cited sources. """ # Iterate the stream directly. event.data is a typed model — access # fields as attributes. The complete event carries the synthesized # report plus metadata.cited_pages with source title/url entries. for event in client.agent.research(query=query, mode="balanced"): if event.event == "error": msg = getattr(event.data.error, "message", None) or "unknown error" raise RuntimeError(f"Research failed: {msg}") if event.event == "complete": cited = event.data.metadata.cited_pages or [] return json.dumps({ "answer": event.data.report, "sources": [ {"title": p.title or "", "url": p.url} for p in cited ], }) return json.dumps({"answer": "Research did not complete", "sources": []}) # Build the agent (LangChain 1.x API) agent = create_agent( "openai:gpt-4o", tools=[extract_structured_data, extract_page_content, research_question], system_prompt=( "You are a research assistant with access to web intelligence tools. " "Use extract_structured_data when you need specific fields from a URL — " "author a concrete JSON Schema with descriptions, per the schema-design guide. " "Use extract_page_content for full page text. " "Use research_question for multi-source research." ), ) # Run it result = agent.invoke({ "messages": [{ "role": "user", "content": "What are the current pricing plans for Vercel and how do they compare?", }], }) print(result["messages"][-1].content) ``` > Using LangChain 0.x? Replace the import with `from langchain_classic.agents import AgentExecutor, create_tool_calling_agent` and use the `AgentExecutor` + `ChatPromptTemplate` pattern instead. The new `create_agent` API is the canonical LangChain 1.x replacement. --- ## Replace PlaywrightURLLoader If you’re using `PlaywrightURLLoader` for JS-heavy pages, replace it with `effort: 'max'`: **Before: PlaywrightURLLoader** ``` from langchain_community.document_loaders import PlaywrightURLLoader # Requires Playwright installed, browser binaries, async handling loader = PlaywrightURLLoader( urls=["https://spa-site.com/data"], remove_selectors=["header", "footer"] ) docs = loader.load() # Frequently breaks in prod ``` **After: Tabstack with effort: max** ``` result = client.extract.markdown( url="https://spa-site.com/data", effort="max" # Full browser rendering for JS-heavy pages ) content = result.content # Clean markdown, no install required ``` No Playwright binary. No version dependency. No async handling issues. The `effort: 'max'` flag tells Tabstack to use full headless browser rendering server-side. You get the same rendered content without managing the browser yourself. --- ## Use Tabstack with LCEL (LangChain Expression Language) For LCEL chains, wrap Tabstack as a plain callable: ``` from langchain_core.runnables import RunnableLambda, RunnablePassthrough from langchain_core.output_parsers import StrOutputParser from langchain_core.prompts import ChatPromptTemplate from langchain_openai import ChatOpenAI from tabstack import Tabstack client = Tabstack(api_key=os.environ["TABSTACK_API_KEY"]) def fetch_page_content(url: str) -> str: """Fetch clean markdown from a URL via Tabstack.""" result = client.extract.markdown(url=url) return result.content def fetch_structured(inputs: dict) -> dict: """Fetch structured data from URL using schema from inputs.""" result = client.extract.json( url=inputs["url"], json_schema=inputs["schema"] ) return {**inputs, "extracted": result} # Chain: URL → clean markdown → summarize summarize_chain = ( RunnableLambda(lambda url: fetch_page_content(url)) | ChatPromptTemplate.from_template("Summarize this in 3 bullet points: {text}") | ChatOpenAI(model="gpt-4o-mini") | StrOutputParser() ) summary = summarize_chain.invoke("https://example.com/blog/article") print(summary) ``` --- ## Use Tabstack with LangChain RAG pipelines Tabstack’s `/extract/markdown` returns content with optional metadata, useful for enriching documents before embedding: ``` from langchain_core.documents import Document from langchain_openai import OpenAIEmbeddings from langchain_community.vectorstores import FAISS from tabstack import Tabstack client = Tabstack(api_key=os.environ["TABSTACK_API_KEY"]) def tabstack_to_document(url: str) -> Document: """Fetch a URL via Tabstack and return a LangChain Document.""" result = client.extract.markdown(url=url, metadata=True) return Document( page_content=result.content, metadata={ "source": url, "title": result.metadata.title if result.metadata else None, "author": result.metadata.author if result.metadata else None, "published": result.metadata.created_at if result.metadata else None, } ) # Build a vector store from multiple URLs urls = [ "https://docs.example.com/getting-started", "https://docs.example.com/api-reference", "https://docs.example.com/tutorials", ] docs = [tabstack_to_document(url) for url in urls] embeddings = OpenAIEmbeddings() vectorstore = FAISS.from_documents(docs, embeddings) retriever = vectorstore.as_retriever() # Use in a RAG chain from langchain_core.prompts import ChatPromptTemplate from langchain_openai import ChatOpenAI rag_prompt = ChatPromptTemplate.from_template( "Answer based on this context: {context}\n\nQuestion: {question}" ) rag_chain = ( {"context": retriever, "question": RunnablePassthrough()} | rag_prompt | ChatOpenAI(model="gpt-4o") | StrOutputParser() ) answer = rag_chain.invoke("How do I authenticate with the API?") print(answer) ``` --- ## Installation Terminal window ``` pip install tabstack langchain langchain-openai langchain-community export TABSTACK_API_KEY="your-key-here" export OPENAI_API_KEY="your-openai-key" ``` Tabstack has no framework dependency. It works alongside any LangChain version without creating additional version conflicts in your dependency tree. --- ## Why this matters in production `WebBaseLoader` and `PlaywrightURLLoader` break in production for predictable reasons: - `PlaywrightURLLoader` depends on your Playwright version, browser binary availability, and async handling that changes across LangChain minor releases - `WebBaseLoader` returns raw BeautifulSoup-parsed text: what you get varies by page, no schema enforcement, prompt-dependent extraction that drifts at scale - LangChain releases frequently; browser loader APIs have changed across minor versions Tabstack removes all of that: - Managed infrastructure: no browser to install or maintain - Schema-enforced output: consistent structure every call - No LangChain version dependency: it’s an HTTP API call - `effort: 'max'` handles JS-heavy pages server-side --- ## Choosing the right tool | Situation | Tool | | ----------------------------------------------------- | --------------------------- | | Need specific structured fields from a known URL | `client.extract.json()` | | Need full page content for summarization or embedding | `client.extract.markdown()` | | Need AI transformation of page content | `client.generate.json()` | | Need multi-source research with citations | `client.agent.research()` | | Quick prototype, raw text is fine | `WebBaseLoader` (LangChain) | | Need offline / local LLM support | `WebBaseLoader` (LangChain) |