Using Tabstack with LangChain
Replace WebBaseLoader and PlaywrightURLLoader with schema-enforced extraction. How to integrate Tabstack as a LangChain tool, in LCEL chains, and in RAG pipelines.
LangChain’s built-in browser tools (WebBaseLoader, PlaywrightURLLoader) are the standard starting point for giving LangChain agents web access. They work for prototypes. In production, they break.
This guide shows how to replace them with Tabstack and what you get in return: schema-enforced structured output, managed infrastructure, and reliable extraction that doesn’t depend on your LangChain version or a locally running Playwright binary.
The core swap: WebBaseLoader → extract.json
Section titled “The core swap: WebBaseLoader → extract.json”The most common pattern is WebBaseLoader fetching a URL and passing raw text to a chain or agent. Here’s the before and after.
Before: WebBaseLoader
from langchain_community.document_loaders import WebBaseLoaderfrom langchain_core.prompts import ChatPromptTemplatefrom langchain_openai import ChatOpenAI
# Fetch raw HTML/textloader = WebBaseLoader("https://example.com/pricing")docs = loader.load()raw_text = docs[0].page_content # Messy, unpredictable
# Now you have to prompt-engineer structured output from messy textllm = ChatOpenAI(model="gpt-4o")prompt = ChatPromptTemplate.from_template( "Extract pricing plans from this text as JSON: {text}")chain = prompt | llmresult = chain.invoke({"text": raw_text}) # Inconsistent, unpredictableAfter: Tabstack
import osfrom tabstack import Tabstack
client = Tabstack(api_key=os.environ["TABSTACK_API_KEY"])
# Schema-enforced extraction — no prompt engineering, no parsingresult = client.extract.json( url="https://example.com/pricing", json_schema={ "type": "object", "properties": { "plans": { "type": "array", "items": { "type": "object", "properties": { "name": {"type": "string", "description": "Plan name"}, "price": {"type": "number", "description": "Monthly price in USD"}, "features": { "type": "array", "items": {"type": "string"}, "description": "Included features" } } } } } })# result is already structured, typed, schema-validated# No downstream LLM call needed for extractionprint(result["plans"])The difference: WebBaseLoader returns text you then need to parse. Tabstack returns the shape you defined.
Use Tabstack as a LangChain Tool
Section titled “Use Tabstack as a LangChain Tool”The cleanest integration pattern is wrapping Tabstack calls as @tool-decorated functions inside a LangChain agent. The agent then decides when to call them.
import osimport jsonfrom tabstack import Tabstackfrom langchain_core.tools import toolfrom langchain.agents import create_agent
client = Tabstack(api_key=os.environ["TABSTACK_API_KEY"])
@tooldef extract_structured_data(url: str, json_schema_json: str) -> str: """Extract structured JSON data from a URL. Use when you need specific fields from a web page. 'json_schema_json' must be a JSON-encoded JSON Schema object with concrete properties and descriptions for each field to extract. See the schema-design guide for patterns that produce reliable results. Returns the extracted data as JSON. """ result = client.extract.json( url=url, json_schema=json.loads(json_schema_json), effort="standard", ) return json.dumps(result)
@tooldef extract_page_content(url: str) -> str: """Fetch a URL and return its content as clean markdown. Use when you need to read a page's full content for summarization or when you don't know what specific fields to extract. Returns clean markdown text. """ result = client.extract.markdown(url=url) return result.content
@tooldef research_question(query: str) -> str: """Research a question using multiple web sources. Use when you need a synthesized answer from multiple sources, not just data from a single known page. Returns an answer with cited sources. """ # Iterate the stream directly. event.data is a typed model — access # fields as attributes. The complete event carries the synthesized # report plus metadata.cited_pages with source title/url entries. for event in client.agent.research(query=query, mode="balanced"): if event.event == "error": msg = getattr(event.data.error, "message", None) or "unknown error" raise RuntimeError(f"Research failed: {msg}") if event.event == "complete": cited = event.data.metadata.cited_pages or [] return json.dumps({ "answer": event.data.report, "sources": [ {"title": p.title or "", "url": p.url} for p in cited ], }) return json.dumps({"answer": "Research did not complete", "sources": []})
# Build the agent (LangChain 1.x API)agent = create_agent( "openai:gpt-4o", tools=[extract_structured_data, extract_page_content, research_question], system_prompt=( "You are a research assistant with access to web intelligence tools. " "Use extract_structured_data when you need specific fields from a URL — " "author a concrete JSON Schema with descriptions, per the schema-design guide. " "Use extract_page_content for full page text. " "Use research_question for multi-source research." ),)
# Run itresult = agent.invoke({ "messages": [{ "role": "user", "content": "What are the current pricing plans for Vercel and how do they compare?", }],})print(result["messages"][-1].content)Using LangChain 0.x? Replace the import with
from langchain_classic.agents import AgentExecutor, create_tool_calling_agentand use theAgentExecutor+ChatPromptTemplatepattern instead. The newcreate_agentAPI is the canonical LangChain 1.x replacement.
Replace PlaywrightURLLoader
Section titled “Replace PlaywrightURLLoader”If you’re using PlaywrightURLLoader for JS-heavy pages, replace it with effort: 'max':
Before: PlaywrightURLLoader
from langchain_community.document_loaders import PlaywrightURLLoader
# Requires Playwright installed, browser binaries, async handlingloader = PlaywrightURLLoader( urls=["https://spa-site.com/data"], remove_selectors=["header", "footer"])docs = loader.load() # Frequently breaks in prodAfter: Tabstack with effort: max
result = client.extract.markdown( url="https://spa-site.com/data", effort="max" # Full browser rendering for JS-heavy pages)content = result.content # Clean markdown, no install requiredNo Playwright binary. No version dependency. No async handling issues. The effort: 'max' flag tells Tabstack to use full headless browser rendering server-side. You get the same rendered content without managing the browser yourself.
Use Tabstack with LCEL (LangChain Expression Language)
Section titled “Use Tabstack with LCEL (LangChain Expression Language)”For LCEL chains, wrap Tabstack as a plain callable:
from langchain_core.runnables import RunnableLambda, RunnablePassthroughfrom langchain_core.output_parsers import StrOutputParserfrom langchain_core.prompts import ChatPromptTemplatefrom langchain_openai import ChatOpenAIfrom tabstack import Tabstack
client = Tabstack(api_key=os.environ["TABSTACK_API_KEY"])
def fetch_page_content(url: str) -> str: """Fetch clean markdown from a URL via Tabstack.""" result = client.extract.markdown(url=url) return result.content
def fetch_structured(inputs: dict) -> dict: """Fetch structured data from URL using schema from inputs.""" result = client.extract.json( url=inputs["url"], json_schema=inputs["schema"] ) return {**inputs, "extracted": result}
# Chain: URL → clean markdown → summarizesummarize_chain = ( RunnableLambda(lambda url: fetch_page_content(url)) | ChatPromptTemplate.from_template("Summarize this in 3 bullet points: {text}") | ChatOpenAI(model="gpt-4o-mini") | StrOutputParser())
summary = summarize_chain.invoke("https://example.com/blog/article")print(summary)Use Tabstack with LangChain RAG pipelines
Section titled “Use Tabstack with LangChain RAG pipelines”Tabstack’s /extract/markdown returns content with optional metadata, useful for enriching documents before embedding:
from langchain_core.documents import Documentfrom langchain_openai import OpenAIEmbeddingsfrom langchain_community.vectorstores import FAISSfrom tabstack import Tabstack
client = Tabstack(api_key=os.environ["TABSTACK_API_KEY"])
def tabstack_to_document(url: str) -> Document: """Fetch a URL via Tabstack and return a LangChain Document.""" result = client.extract.markdown(url=url, metadata=True)
return Document( page_content=result.content, metadata={ "source": url, "title": result.metadata.title if result.metadata else None, "author": result.metadata.author if result.metadata else None, "published": result.metadata.created_at if result.metadata else None, } )
# Build a vector store from multiple URLsurls = [ "https://docs.example.com/getting-started", "https://docs.example.com/api-reference", "https://docs.example.com/tutorials",]
docs = [tabstack_to_document(url) for url in urls]embeddings = OpenAIEmbeddings()vectorstore = FAISS.from_documents(docs, embeddings)retriever = vectorstore.as_retriever()
# Use in a RAG chainfrom langchain_core.prompts import ChatPromptTemplatefrom langchain_openai import ChatOpenAI
rag_prompt = ChatPromptTemplate.from_template( "Answer based on this context: {context}\n\nQuestion: {question}")
rag_chain = ( {"context": retriever, "question": RunnablePassthrough()} | rag_prompt | ChatOpenAI(model="gpt-4o") | StrOutputParser())
answer = rag_chain.invoke("How do I authenticate with the API?")print(answer)Installation
Section titled “Installation”pip install tabstack langchain langchain-openai langchain-communityexport TABSTACK_API_KEY="your-key-here"export OPENAI_API_KEY="your-openai-key"Tabstack has no framework dependency. It works alongside any LangChain version without creating additional version conflicts in your dependency tree.
Why this matters in production
Section titled “Why this matters in production”WebBaseLoader and PlaywrightURLLoader break in production for predictable reasons:
PlaywrightURLLoaderdepends on your Playwright version, browser binary availability, and async handling that changes across LangChain minor releasesWebBaseLoaderreturns raw BeautifulSoup-parsed text: what you get varies by page, no schema enforcement, prompt-dependent extraction that drifts at scale- LangChain releases frequently; browser loader APIs have changed across minor versions
Tabstack removes all of that:
- Managed infrastructure: no browser to install or maintain
- Schema-enforced output: consistent structure every call
- No LangChain version dependency: it’s an HTTP API call
effort: 'max'handles JS-heavy pages server-side
Choosing the right tool
Section titled “Choosing the right tool”| Situation | Tool |
|---|---|
| Need specific structured fields from a known URL | client.extract.json() |
| Need full page content for summarization or embedding | client.extract.markdown() |
| Need AI transformation of page content | client.generate.json() |
| Need multi-source research with citations | client.agent.research() |
| Quick prototype, raw text is fine | WebBaseLoader (LangChain) |
| Need offline / local LLM support | WebBaseLoader (LangChain) |