Developer's Guide
Mastering the Markdown Endpoint
Scraping web content is often messy. You're left with a complex tangle of HTML, boilerplate, and ads, when all you really want is the clean, structured content.
The TABS API Markdown Endpoint solves this. It's a single POST request that fetches any public URL, intelligently parses the HTML, and returns clean, well-formatted Markdown. It's the perfect tool for:
- Building content aggregation or "read-it-later" apps.
- Preparing web content for AI/LLM processing and RAG pipelines.
- Converting blog posts or articles into a stable, storable format.
- Powering documentation and content management systems.
This guide will walk you through setting up your environment, making your first request, and building a production-ready function to handle content conversion robustly.
Prerequisites & Authentication
Before you begin, you'll need a TABS API key. You can get yours by signing up at https://tabstack.ai.
The API uses Bearer token authentication. We strongly recommend storing your key as an environment variable rather than hardcoding it in your application.
First, set the variable in your terminal session.
export TABS_API_KEY="your-api-key-here"
The export TABS_API_KEY=... Bash command sets an environment variable named TABS_API_KEY for your current session. Your application code (e.g., in Python or Node.js) can then access this variable, keeping your secret key out of your source code.
The Basic Request
Let's start by converting a URL with the simplest possible request. The endpoint lives at https://api.tabstack.ai/v1/extract/markdown and expects a POST request with a JSON body.
- curl
- JavaScript
- Python
curl -X POST https://api.tabstack.ai/v1/extract/markdown \
-H "Authorization: Bearer $TABS_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/blog/article"
}'
const response = await fetch('https://api.tabstack.ai/v1/extract/markdown', {
method: 'POST',
headers: {
'Authorization': `Bearer ${process.env.TABS_API_KEY}`,
'Content-Type': 'application/json'
},
body: JSON.stringify({
url: 'https://example.com/blog/article'
})
});
const data = await response.json();
console.log(data);
import requests
import os
api_key = os.environ.get("TABS_API_KEY")
endpoint_url = "https://api.tabstack.ai/v1/extract/markdown"
headers = {
"Authorization": f"Bearer {api_key}",
"Content-Type": "application/json"
}
payload = {
"url": "https://example.com/blog/article"
}
response = requests.post(endpoint_url, headers=headers, json=payload)
data = response.json()
print(data)
All three examples make an authenticated POST request with a JSON body containing the target URL. The API fetches the page, extracts the main content, and returns it as clean Markdown with metadata.
Default Response: Content with Frontmatter
A successful request returns a JSON object. By default, the API cleverly embeds all extracted metadata (like title, author, etc.) as YAML frontmatter at the top of the content.
{
"url": "https://example.com/blog/article",
"content": "---\ntitle: Example Article Title\ndescription: This is an example article...\nauthor: Example Author\nimage: https://example.com/images/article.jpg\n---\n\n# Example Article Title\n\nThis is the article content converted to markdown..."
}
The response includes the processed URL and the content with YAML frontmatter embedded—perfect for static site generators like Hugo or Jekyll that expect this format.
Getting Separate Metadata
YAML frontmatter is great, but sometimes you want metadata as a clean, parsable JSON object, separate from the content. This is essential for populating databases or feeding structured data to other systems.
To do this, simply add the metadata: true parameter to your request.
- curl
- JavaScript
- Python
curl -X POST https://api.tabstack.ai/v1/extract/markdown \
-H "Authorization: Bearer $TABS_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/blog/article",
"metadata": true
}'
const response = await fetch('https://api.tabstack.ai/v1/extract/markdown', {
method: 'POST',
headers: {
'Authorization': `Bearer ${process.env.TABS_API_KEY}`,
'Content-Type': 'application/json'
},
body: JSON.stringify({
url: 'https://example.com/blog/article',
metadata: true // Request separate metadata
})
});
const data = await response.json();
console.log(data);
import requests
import os
api_key = os.environ.get("TABS_API_KEY")
endpoint_url = "https://api.tabstack.ai/v1/extract/markdown"
headers = {
"Authorization": f"Bearer {api_key}",
"Content-Type": "application/json"
}
payload = {
"url": "https://example.com/blog/article",
"metadata": True # Request separate metadata
}
response = requests.post(endpoint_url, headers=headers, json=payload)
data = response.json()
print(data)
Adding metadata: true to the request changes the response format to separate content and metadata into distinct fields.
New Response: Clean Content + Metadata Object
By setting metadata: true, the response structure now includes a separate metadata object.
{
"url": "https://example.com/blog/article",
"content": "# Example Article Title\n\nThis is the article content converted to markdown...",
"metadata": {
"title": "Example Article Title",
"description": "This is an example article description",
"author": "Example Author",
"publisher": "Example Publisher",
"image": "https://example.com/images/article.jpg",
"site_name": "Example Blog",
"url": "https://example.com/blog/article",
"type": "article"
}
}
Now content contains pure Markdown without frontmatter, and metadata is a structured JSON object. This format is easier to work with programmatically—no YAML parsing needed.
✨ Pro Tip: We recommend using
metadata: truefor most programmatic use cases. It's more reliable and easier than parsing YAML, which can be brittle if descriptions or titles contain special characters.
Forcing a Fresh Fetch
For performance, the TABS API caches results for a short period. This is perfect for static content like blog posts. However, if you're scraping a breaking news site or a live feed, you'll want to bypass the cache.
You can do this using the nocache: true parameter.
- curl
- JavaScript
- Python
curl -X POST https://api.tabstack.ai/v1/extract/markdown \
-H "Authorization: Bearer $TABS_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://news-site.com/breaking-news",
"nocache": true
}'
const response = await fetch('https://api.tabstack.ai/v1/extract/markdown', {
method: 'POST',
headers: {
'Authorization': `Bearer ${process.env.TABS_API_KEY}`,
'Content-Type': 'application/json'
},
body: JSON.stringify({
url: 'https://news-site.com/breaking-news',
nocache: true // Force a fresh fetch
})
});
const data = await response.json();
console.log(data.content);
import requests
import os
api_key = os.environ.get("TABS_API_KEY")
endpoint_url = "https://api.tabstack.ai/v1/extract/markdown"
headers = {
"Authorization": f"Bearer {api_key}",
"Content-Type": "application/json"
}
payload = {
"url": "https://news-site.com/breaking-news",
"nocache": True # Force a fresh fetch
}
response = requests.post(endpoint_url, headers=headers, json=payload)
data = response.json()
print(data['content'])
Setting nocache: true bypasses the cache and fetches fresh content. Use this for real-time data but expect slower responses since nothing can be reused from cache.
Note: Use nocache judiciously. Forcing a fresh fetch will result in slightly slower response times, as the API cannot serve the request from its cache.
Production-Ready Error Handling
In a real-world application, you can't assume every request will succeed. URLs may be invalid, sites may be down, or your API key might be wrong. A robust application must handle these failures gracefully.
The API uses standard HTTP status codes to indicate errors.
| Status Code | Error Message | Description |
|---|---|---|
| 400 | url is required | The JSON body is missing the url parameter. |
| 401 | Unauthorized - Invalid token | Your API key is missing, invalid, or expired. |
| 422 | url is invalid | The provided URL is malformed. |
| 422 | access to internal resources... | You tried to access localhost or a private IP. |
| 500 | failed to fetch URL | The target server is down or blocked our request. |
| 500 | failed to convert HTML... | An internal error occurred during conversion. |
All error responses return a simple JSON object:
{
"error": "url is invalid"
}
Robust Error Handling Examples
Here are production-ready examples that encapsulate the logic, set timeouts, and handle potential errors correctly.
- JavaScript
- Python
- curl (Bash)
import "dotenv/config"; // To load .env file
import { AbortSignal } from "abort-controller";
async function getMarkdownFromUrl(url, forceFresh = false) {
const apiKey = process.env.TABS_API_KEY;
if (!apiKey) {
console.error("TABS_API_KEY environment variable not set.");
return null;
}
const endpoint = 'https://api.tabstack.ai/v1/extract/markdown';
// Set a 30-second timeout
const controller = new AbortController();
const timeoutId = setTimeout(() => controller.abort(), 30000);
try {
const response = await fetch(endpoint, {
method: 'POST',
headers: {
'Authorization': `Bearer ${apiKey}`,
'Content-Type': 'application/json'
},
body: JSON.stringify({
url: url,
metadata: true, // Always use the reliable metadata object
nocache: forceFresh
}),
signal: controller.signal // Pass the AbortSignal
});
clearTimeout(timeoutId); // Clear the timeout if fetch succeeds
const data = await response.json();
if (!response.ok) {
// Handle API errors (4xx, 5xx)
console.warn(
`API Error (HTTP ${response.status}) for ${url}: ${data.error || 'Unknown error'}`
);
return null;
}
return data;
} catch (error) {
clearTimeout(timeoutId); // Clear timeout on error
if (error.name === 'AbortError') {
console.error(`Request timed out for ${url}`);
} else {
console.error(`Network error for ${url}: ${error.message}`);
}
return null;
}
}
// --- Usage ---
// (async () => {
// const data = await getMarkdownFromUrl("https://example.com/article");
// if (data) {
// console.log(`Title: ${data.metadata?.title}`);
// // console.log(data.content);
// }
// })();
import requests
import os
import logging
# Configure logging
logging.basicConfig(level=logging.INFO, format='%(levelname)s: %(message)s')
def get_markdown_from_url(url: str, force_fresh: bool = False) -> dict | None:
"""
Fetches clean markdown from a URL using the TABS API.
Returns the parsed JSON data or None on failure.
"""
api_key = os.environ.get("TABS_API_KEY")
if not api_key:
logging.error("TABS_API_KEY environment variable not set.")
return None
endpoint_url = "https://api.tabstack.ai/v1/extract/markdown"
headers = {
"Authorization": f"Bearer {api_key}",
"Content-Type": "application/json"
}
payload = {
"url": url,
"metadata": True, # Always use the reliable metadata object
"nocache": force_fresh
}
try:
response = requests.post(
endpoint_url,
headers=headers,
json=payload,
timeout=30 # Always set a 30-second timeout!
)
# Check for HTTP errors (4xx or 5xx)
if not response.ok:
error_data = response.json()
logging.warning(
f"API Error (HTTP {response.status_code}) for {url}: {error_data.get('error')}"
)
return None
return response.json()
except requests.exceptions.Timeout:
logging.error(f"Request timed out for {url}")
return None
except requests.exceptions.RequestException as e:
# Catch-all for network/connection errors
logging.error(f"Network error for {url}: {e}")
return None
except requests.exceptions.JSONDecodeError:
# Catch error if response is not valid JSON
logging.error(f"Failed to decode JSON response from API. Status: {response.status_code}")
return None
# --- Usage ---
# good_url = "https://your-blog.com/some-article"
# data = get_markdown_from_url(good_url)
#
# if data:
# logging.info(f"Title: {data['metadata'].get('title')}")
# # print(data['content'])
#
# bad_url = "not-a-real-url"
# get_markdown_from_url(bad_url)
#!/bin/bash
# A robust bash script for error handling with curl
# Requires: curl, jq
API_KEY="$TABS_API_KEY"
URL_TO_FETCH="$1"
if [ -z "$API_KEY" ]; then
echo "Error: TABS_API_KEY environment variable not set." >&2
exit 1
fi
if [ -z "$URL_TO_FETCH" ]; then
echo "Usage: $0 <url-to-fetch>" >&2
exit 1
fi
# -s: silent
# -w "\n%{http_code}": write the http code on a new line
response=$(curl -s -w "\n%{http_code}" \
-X POST https://api.tabstack.ai/v1/extract/markdown \
-H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
--connect-timeout 10 \
--max-time 30 \
-d '{
"url": "'"$URL_TO_FETCH"'",
"metadata": true,
"nocache": false
}')
# Split response body and status code
http_code=$(echo "$response" | tail -n1)
response_body=$(echo "$response" | sed '$d')
if [ "$http_code" -eq 200 ]; then
echo "Success:"
echo "$response_body" | jq .
else
echo "Error (HTTP $http_code):" >&2
# Try to parse error with jq, fall back to plain echo
echo "$response_body" | jq .error 2>/dev/null || echo "$response_body" >&2
exit 1
fi
This production-ready function adds three key safety features: a guard clause that checks for the API key upfront, a 30-second timeout to prevent hanging requests, and comprehensive error handling for both HTTP errors and network failures. On errors, it logs helpful messages and returns null rather than crashing.
Quick Reference
Endpoint
- URL:
https://api.tabstack.ai/v1/extract/markdown - Method:
POST - Authentication:
Authorization: Bearer YOUR_API_KEY
Request Parameters (JSON Body)
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
url | string | Yes | The publicly accessible URL to convert. | |
metadata | boolean | No | false | If true, returns metadata as a separate metadata object. If false, embeds metadata as YAML frontmatter in the content. |
nocache | boolean | No | false | If true, bypasses the cache and forces a fresh fetch of the URL. |
Metadata Object Fields
When metadata: true is used (or in frontmatter), these are the common fields you can expect.
Note: Not all fields will be present for every URL. Availability depends entirely on the metadata provided by the source website.
| Field | Type | Description |
|---|---|---|
title | string | Page title from Open Graph or HTML <title>. |
description | string | Page description from Open Graph or HTML meta tags. |
author | string | Author information from HTML metadata. |
publisher | string | Publisher name from Open Graph. |
image | string | Featured image URL from Open Graph. |
site_name | string | Website name from Open Graph. |
url | string | Canonical URL from Open Graph. |
type | string | Content type from Open Graph (e.g., "article"). |
Best Practices Review
To recap, follow these rules for a smooth integration:
-
Secure Your Key: Never hardcode API keys. Use environment variables.
- JS:
process.env.TABS_API_KEY - Python:
os.environ.get("TABS_API_KEY")
- JS:
-
Use
metadata: true: Prefer the separatemetadataobject for programmatic access. It's more reliable than parsing YAML. -
Set Timeouts: Always set a reasonable timeout on your HTTP requests.
- JavaScript
- Python
// Use AbortController for fetch timeouts
const controller = new AbortController();
const timeoutId = setTimeout(() => controller.abort(), 30000);
fetch(url, {
signal: controller.signal
}).finally(() => {
clearTimeout(timeoutId);
});# The 'requests' library makes this easy
try:
response = requests.post(url, json=data, timeout=30)
except requests.exceptions.Timeout:
print("Request timed out") -
Handle Errors: Check for non-2xx HTTP status codes (
!response.ok) and wrap your network calls intry...catch/try...exceptblocks. -
Validate URLs: If possible, validate that a string is a valid
http/httpsURL on your end before sending it to the API to save a request. -
Use Caching: Don't use
nocache: trueunless you absolutely need real-time data. Let the API's cache work for you to get faster responses.