Self-hosted HTML-to-Markdown proxy with progressive enhancement, Playwright fallback, and token estimation for AI agent context windows
AI agents need to consume web content to answer questions, research topics, and verify information. However, web pages are delivered as HTML with visual styling, navigation chrome, ads, and JavaScript-rendered content. This creates multiple problems:
<div> soupExisting solutions like Jina AI's r.jina.ai proxy are cloud-based (privacy concerns), rate-limited (unreliable for production), and opaque (no control over conversion quality or fallback behavior).
r.jina.ai create external dependencies and send URLs to remote serversA self-hosted FastAPI proxy that converts any URL to clean markdown with progressive enhancement and graceful degradation. Designed for AI agent consumption with token estimation, YAML frontmatter extraction, and three deployment modes.
Key capabilities:
x-markdown-tokens response header for context budget planning
| Component | Technology | Responsibility |
|---|---|---|
| Proxy Server | FastAPI (Python 3.12) | HTTP request handling, routing, conversion orchestration |
| HTML Fetcher | httpx (async) | Download HTML content with user-agent spoofing, timeout enforcement |
| Content Extractor | trafilatura | Extract main content from HTML, remove boilerplate, convert to markdown |
| JS Renderer | Playwright | Execute JavaScript, wait for content load, screenshot capability |
| Token Counter | tiktoken (cl100k_base) | Estimate token count for context budget planning |
| Frontmatter Parser | python-frontmatter | Extract YAML metadata from markdown documents |
Conversion Pipeline (Progressive Enhancement):
.md extension or Content-Type: text/markdown
x-markdown-source: trafilaturax-markdown-source: playwrightx-markdown-source: fallback, content-signal: lowURL embedded in request path. Simplest for agent tool integration.
GET http://localhost:8090/https://example.com/article
→ Returns markdown conversion of example.com/article
# Agent tool call
curl http://localhost:8090/https://docs.docker.com/compose/
Original URL in path, proxy behavior triggered by Accept: text/markdown header. Enables nginx content negotiation.
GET https://example.com/article
Accept: text/markdown
→ Nginx proxies to localhost:8090, returns markdown
# Nginx config
location / {
if ($http_accept ~ "text/markdown") {
proxy_pass http://127.0.0.1:8090;
}
}
POST request with URL in JSON body. Used by local applications that need markdown conversion.
POST http://localhost:8090/convert
Content-Type: application/json
{"url": "https://example.com/article", "timeout": 15}
→ Returns JSON: {markdown, tokens, source, signal}
Every successful response includes metadata headers for agent decision-making:
| Header | Values | Purpose |
|---|---|---|
x-markdown-tokens |
Integer (e.g., 1247) | Estimated token count for context budget planning |
x-markdown-source |
native | trafilatura | playwright | fallback | Which conversion method succeeded |
content-signal |
high | medium | low | Content quality indicator (high=native/trafilatura, medium=playwright, low=fallback) |
x-original-url |
URL string | Original URL (useful when following redirects) |
x-conversion-time-ms |
Integer (milliseconds) | Time taken for conversion (for performance monitoring) |
x-markdown-tokens header. If token count exceeds remaining context budget, agent can skip fetching the content or request a summary instead.
Behavior controlled via environment variables:
# Size limits
MAX_HTML_BYTES=5242880 # 5MB max HTML download
MIN_CONTENT_TOKENS=30 # Minimum tokens to consider conversion successful
# Timeouts
REQUEST_TIMEOUT=15 # HTTP fetch timeout (seconds)
PLAYWRIGHT_TIMEOUT=20 # Browser rendering timeout (seconds)
PLAYWRIGHT_WAIT_FOR=networkidle # Wait condition: networkidle | load | domcontentloaded
# Behavior
USER_AGENT=Mozilla/5.0... # Browser user-agent string
ENABLE_PLAYWRIGHT=true # Disable to skip Playwright fallback
CACHE_TTL=3600 # Redis cache TTL for converted content (seconds)
BIND_HOST=127.0.0.1 # Localhost-only binding
BIND_PORT=8090 # Proxy listen port
GET /{url:path})Main endpoint for agent consumption. URL embedded in path after leading slash.
@app.get("/{url:path}")
async def proxy_url(url: str):
# Extract URL from path (e.g., "/https://example.com" → "https://example.com")
if not url.startswith(("http://", "https://")):
url = "https://" + url
# Check cache (Redis)
cached = await get_cached_markdown(url)
if cached:
return Response(cached["markdown"], headers=cached["headers"])
# Run conversion pipeline
result = await convert_to_markdown(url)
# Cache result
await cache_markdown(url, result)
return Response(
content=result["markdown"],
media_type="text/markdown",
headers={
"x-markdown-tokens": str(result["tokens"]),
"x-markdown-source": result["source"],
"content-signal": result["signal"],
"x-original-url": url,
"x-conversion-time-ms": str(result["time_ms"])
}
)
Example Usage:
# Agent tool integration
import httpx
async def fetch_as_markdown(url: str) -> dict:
proxy_url = f"http://localhost:8090/{url}"
async with httpx.AsyncClient() as client:
response = await client.get(proxy_url)
return {
"content": response.text,
"tokens": int(response.headers["x-markdown-tokens"]),
"source": response.headers["x-markdown-source"],
"signal": response.headers["content-signal"]
}
result = await fetch_as_markdown("https://docs.python.org/3/library/asyncio.html")
print(f"Fetched {result['tokens']} tokens via {result['source']}")
convert_to_markdown())Core logic that implements progressive enhancement with fallback chain.
async def convert_to_markdown(url: str) -> dict:
start_time = time.time()
# Step 1: Try native markdown
if url.endswith(".md") or url.endswith(".markdown"):
result = await fetch_native_markdown(url)
if result:
return finalize_result(result, "native", "high", start_time)
# Step 2: Fetch HTML
html = await fetch_html(url)
if not html:
return error_result("Failed to fetch URL", start_time)
# Step 3: Try trafilatura extraction
markdown = extract_with_trafilatura(html)
tokens = count_tokens(markdown)
if tokens >= MIN_CONTENT_TOKENS:
return finalize_result(markdown, "trafilatura", "high", start_time)
# Step 4: Try Playwright rendering (if enabled)
if ENABLE_PLAYWRIGHT:
markdown = await render_with_playwright(url)
tokens = count_tokens(markdown)
if tokens >= MIN_CONTENT_TOKENS:
return finalize_result(markdown, "playwright", "medium", start_time)
# Step 5: Fallback to raw HTML
markdown = clean_raw_html(html)
return finalize_result(markdown, "fallback", "low", start_time)
Uses trafilatura library to extract main content from HTML, removing boilerplate, ads, and navigation.
import trafilatura
def extract_with_trafilatura(html: str) -> str:
# Extract main content
markdown = trafilatura.extract(
html,
output_format="markdown",
include_comments=False,
include_tables=True,
include_images=True,
include_links=True,
no_fallback=False
)
if not markdown:
return ""
# Clean up common artifacts
markdown = markdown.strip()
markdown = re.sub(r'\n{3,}', '\n\n', markdown) # Collapse excessive newlines
markdown = re.sub(r'\[!\[.*?\]\(.*?\)\]\(.*?\)', '', markdown) # Remove image links
return markdown
Why Trafilatura?
Launches headless Chromium browser to execute JavaScript and wait for content to load. Handles single-page apps and lazy-loaded content.
from playwright.async_api import async_playwright
import trafilatura
async def render_with_playwright(url: str) -> str:
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
page = await browser.new_page(
user_agent=USER_AGENT,
viewport={"width": 1920, "height": 1080}
)
try:
# Navigate and wait for content
await page.goto(url, timeout=PLAYWRIGHT_TIMEOUT * 1000, wait_until=PLAYWRIGHT_WAIT_FOR)
# Optional: Wait for specific selectors (common content containers)
try:
await page.wait_for_selector("article, main, .content, #content", timeout=5000)
except:
pass # Continue if selector not found
# Get rendered HTML
html = await page.content()
# Extract markdown from rendered HTML
markdown = extract_with_trafilatura(html)
return markdown
finally:
await browser.close()
Performance Note: Playwright adds 2-5 seconds overhead. Only used when trafilatura extraction fails to meet minimum token threshold.
Uses OpenAI's tiktoken library with cl100k_base encoding (same as GPT-4) for accurate context budget planning.
import tiktoken
# Initialize encoder once at startup
ENCODER = tiktoken.get_encoding("cl100k_base")
def count_tokens(text: str) -> int:
"""Count tokens using cl100k_base encoding (GPT-4 compatible)"""
return len(ENCODER.encode(text, disallowed_special=()))
Extracts YAML frontmatter from markdown documents and article metadata from HTML.
import frontmatter
from bs4 import BeautifulSoup
def extract_frontmatter(markdown: str, html: str = None) -> tuple[str, dict]:
"""
Returns (markdown_without_frontmatter, metadata_dict)
"""
# Try parsing existing frontmatter
post = frontmatter.loads(markdown)
if post.metadata:
return post.content, post.metadata
# Extract from HTML metadata if available
if html:
soup = BeautifulSoup(html, 'html.parser')
metadata = {}
# OpenGraph tags
for prop in ['og:title', 'og:description', 'og:author', 'article:published_time']:
tag = soup.find('meta', property=prop)
if tag and tag.get('content'):
key = prop.split(':')[-1]
metadata[key] = tag.get('content')
# Standard meta tags
for name in ['description', 'author', 'keywords']:
tag = soup.find('meta', attrs={'name': name})
if tag and tag.get('content'):
metadata[name] = tag.get('content')
# Title
title_tag = soup.find('title')
if title_tag and 'title' not in metadata:
metadata['title'] = title_tag.get_text().strip()
if metadata:
# Prepend as YAML frontmatter
yaml_header = "---\n"
for key, value in metadata.items():
yaml_header += f"{key}: {value}\n"
yaml_header += "---\n\n"
return markdown, metadata
return markdown, {}
Caches converted markdown with 1-hour TTL to avoid redundant conversions.
import redis.asyncio as redis
import json
import hashlib
async def get_cached_markdown(url: str) -> dict | None:
r = redis.from_url(os.getenv("REDIS_URL", "redis://localhost:6379"))
key = f"mdproxy:{hashlib.sha256(url.encode()).hexdigest()}"
cached = await r.get(key)
if cached:
return json.loads(cached)
return None
async def cache_markdown(url: str, result: dict):
r = redis.from_url(os.getenv("REDIS_URL", "redis://localhost:6379"))
key = f"mdproxy:{hashlib.sha256(url.encode()).hexdigest()}"
await r.setex(
key,
CACHE_TTL, # 3600 seconds default
json.dumps({
"markdown": result["markdown"],
"headers": {
"x-markdown-tokens": str(result["tokens"]),
"x-markdown-source": result["source"],
"content-signal": result["signal"]
}
})
)
CACHE_TTL or add ?nocache=true bypass parameter.
Reports proxy health and dependency status.
@app.get("/health")
async def health_check():
checks = {
"status": "ok",
"playwright": False,
"redis": False
}
# Test Playwright
try:
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
await browser.close()
checks["playwright"] = True
except:
pass
# Test Redis
try:
r = redis.from_url(os.getenv("REDIS_URL"))
await r.ping()
checks["redis"] = True
except:
pass
return checks
.md, .markdown) AND Content-Type header (text/markdown).
networkidle (default) or configurable wait condition (load, domcontentloaded).
tiktoken library with cl100k_base encoding for GPT-4 compatibility.
MIN_CONTENT_TOKENS environment variable (default 30 tokens).
<script>, <style>, <noscript> tags before returning.
<meta> tags (OpenGraph, standard meta).
GET /{url:path} MUST extract URL from path, support both /https://example.com and /example.com formats.
GET / with Accept: text/markdown header MUST proxy to conversion pipeline (nginx integration mode).
POST /convert MUST accept JSON body with url (required), timeout (optional), wait_for (optional).
GET /health MUST return JSON with status, playwright (bool), redis (bool) fields.
x-markdown-tokens, x-markdown-source, content-signal, x-original-url, x-conversion-time-ms.
Content-Type MUST be text/markdown; charset=utf-8 for successful conversions.
REQUEST_TIMEOUT (default 15 seconds).
PLAYWRIGHT_TIMEOUT (default 20 seconds).
MAX_HTML_BYTES (default 5MB). Abort fetch if exceeded.
x-conversion-time-ms header.
127.0.0.1 by default (configurable via BIND_HOST), not 0.0.0.0.
file://, ftp://, and non-HTTP(S) schemes.
USER_AGENT environment variable (default: modern browser UA).
x-original-url header with final URL.
content-signal: high MUST be set for native markdown and trafilatura conversions with ≥100 tokens.
content-signal: medium MUST be set for Playwright conversions meeting minimum token threshold.
content-signal: low MUST be set for raw HTML fallback or any conversion below minimum token threshold.
x-markdown-source MUST be trafilatura (not native).
MIN_CONTENT_TOKENS, fallback to raw HTML (do not return empty markdown).
playwright install chromium in build stage.
/tmp and ~/.cache/ms-playwright.
proxy service (port 8090) and redis service (port 6379).
GET /health endpoint with 30-second interval for Docker HEALTHCHECK directive.
Build a self-hosted HTML-to-Markdown proxy for AI agents with progressive enhancement and graceful degradation.
## Tech Stack
- Backend: FastAPI (Python 3.12+), async/await
- HTML Fetcher: httpx (async HTTP client)
- Content Extractor: trafilatura (HTML → markdown)
- JS Renderer: Playwright (headless Chromium)
- Token Counter: tiktoken (cl100k_base encoding)
- Cache: Redis (1-hour TTL)
- Deployment: Docker, localhost-only (127.0.0.1:8090)
## Conversion Pipeline (Progressive Enhancement)
1. **Native Markdown Detection**
- Check URL extension (.md, .markdown)
- Check Content-Type header (text/markdown)
- If found → Return as-is with frontmatter extraction
2. **Trafilatura Extraction**
- Fetch HTML via httpx (15s timeout, 5MB max)
- Extract main content with trafilatura
- Convert to markdown (preserve headings, lists, tables, code, links, images)
- Count tokens with tiktoken
- If ≥30 tokens → Return with x-markdown-source: trafilatura
3. **Playwright Rendering** (if trafilatura failed)
- Launch headless Chromium
- Navigate to URL (20s timeout, wait for networkidle)
- Wait for content selectors (article, main, .content, #content)
- Get rendered HTML
- Extract with trafilatura
- If ≥30 tokens → Return with x-markdown-source: playwright
4. **Raw HTML Fallback**
- Strip