
By the end of this guide, you'll have a working pipeline that takes any URL, renders it in a real browser, and outputs clean markdown or structured JSON ready for LLM training, RAG ingestion, or fine-tuning datasets. No HTML parsing. No regex. No manual cleanup step.
We'll build it piece by piece: convert a single page to markdown, extract structured fields with a declarative schema, batch-process entire sites with scroll-collect, and wire it all into an async pipeline that handles hundreds of pages per hour. Every code example runs against real websites. Copy, paste, run.
If you've spent time stripping <script> tags, decoding HTML entities, and fighting with BeautifulSoup just to get clean text from a webpage, this guide replaces all of that with three API calls.
In this guide, you'll build:
- A single-URL converter that turns any webpage into clean markdown or plain text
- A schema-based extractor that pulls structured JSON fields from any page layout
- A scroll-collect pipeline for infinite-scroll and lazy-loaded content
- An async multi-site crawler that processes hundreds of URLs in parallel
- RAG-ready and fine-tuning-ready data formatters with export to JSONL
- A cost comparison showing the token savings of structured output vs raw HTML
- A migration path from ScrapingBee and Firecrawl to Browserbeam
TL;DR: Build an LLM training data pipeline using Browserbeam's cloud browser API. Create a session with a URL, get back clean markdown instead of raw HTML, and extract structured JSON with a declarative schema. One API call replaces the fetch-render-parse-clean chain. Works for RAG ingestion, fine-tuning datasets, and AI data collection at scale.
Why LLMs Need Clean Web Data (Not Raw HTML)
Raw HTML is not training data. It's a delivery format for browsers, packed with markup that carries zero information for a language model. When you feed raw HTML to an LLM, most of the input tokens go to <div> wrappers, inline styles, tracking scripts, and SVG icons. The actual content, the text and data your model needs, is buried under layers of presentation markup.
This matters because LLM context windows have hard limits, and every token costs money. A product page that contains 50 words of useful content can easily produce 20,000+ tokens of raw HTML. Your model reads all of it, pays for all of it, and learns nothing from 99% of it.
The Token Cost Problem
Here's what a typical product page looks like in token counts:
| Format | Tokens | Useful Content | Waste |
|---|---|---|---|
| Raw HTML | ~20,000 | ~500 | 97.5% |
| Cleaned markdown | ~800 | ~500 | 37.5% |
| Structured JSON (extract) | ~120 | ~120 | 0% |
When you're processing 10,000 pages for a training dataset, the difference between raw HTML and structured output is the difference between $500 and $6 in LLM API costs. That's not a rounding error. It's the reason most AI data pipelines need a dedicated cleaning step that's often more complex than the scraping itself.
What "Clean" Means for LLM Training
Clean web data for LLM consumption has four properties:
- Text-only content. No HTML tags, CSS, JavaScript, or inline styles. Just the words and structure a human would read.
- Preserved structure. Headings, lists, tables, and paragraphs maintain their hierarchy. A markdown representation keeps this structure without the markup overhead.
- No boilerplate. Navigation menus, footers, cookie banners, ad blocks, and sidebar widgets are stripped. Only the main content remains.
- Consistent format. Every page in your dataset follows the same output structure, whether it came from a news site, a product catalog, or a documentation portal.
Converting a website to text or a website to markdown that meets these criteria is the core challenge of any AI data pipeline. Traditional approaches require a chain of tools. Browserbeam handles it in one step.
For a deeper look at why raw HTML fails for AI agents, see the raw HTML vs structured output comparison.
The Traditional AI Data Pipeline: From URL to Training Data
Before we build the Browserbeam pipeline, let's map the traditional approach. Understanding where the old pipeline breaks helps you appreciate what we're replacing.
The 10-Step Manual Pipeline
Most teams building an LLM training data pipeline from web sources end up with something like this:
- URL collection. Build a list of target pages (sitemap parsing, search API, manual curation).
- HTTP fetch. Download raw HTML with
requestsorurllib. - JavaScript rendering. Fire up Playwright or Selenium for pages that need JS execution.
- Wait for stability. Add
time.sleep()or explicit wait conditions so dynamic content loads. - HTML parsing. Load into BeautifulSoup or lxml. Select the main content area.
- Boilerplate removal. Strip nav, footer, sidebar, ads, cookie banners manually.
- Text extraction. Pull
.textfrom parsed elements. Handle whitespace, newlines, encoding. - Format conversion. Convert to markdown or plain text. Reconstruct heading hierarchy.
- Data cleaning. Remove duplicates, fix encoding issues, normalize Unicode, strip leftover HTML entities.
- Export. Write to JSONL, CSV, or a vector database for RAG.
Each step is its own failure point. Each step needs its own library. And when a target site changes its layout, steps 5 through 8 break.
Where Traditional Pipelines Break
The pain concentrates in three places:
JavaScript rendering. Half the web runs on React, Vue, or Angular. Static HTTP requests return empty <div id="root"></div> shells. You need a real browser, which means managing Playwright, handling crashes, and burning compute on browser instances.
Content isolation. Figuring out which part of the HTML is "the article" versus "the navigation" is a research problem. Libraries like Readability.js and Trafilatura get it right 80% of the time. The other 20% silently corrupts your training data.
Scale. Running 10,000 pages through a local Playwright instance takes hours. Managing a pool of browser instances, handling timeouts, and retrying failures adds hundreds of lines of infrastructure code that has nothing to do with your actual goal: collecting clean data for AI.
The Browserbeam Pipeline: URL to Clean Markdown in 3 Steps
Browserbeam collapses the 10-step pipeline into three API calls. The cloud browser handles rendering, stability detection, and content isolation. You get back clean markdown or structured JSON.
How It Works
The pipeline is simple:
- Create a session with a URL. Browserbeam spins up a cloud browser session, navigates to the page, waits for network idle and DOM stability, dismisses cookie banners, and returns the page as clean markdown.
- Extract structured fields. Define a schema describing the data you want. Browserbeam runs it against the rendered DOM and returns typed JSON.
- Batch-process with scroll-collect. For pages with infinite scroll or lazy-loaded content, one call scrolls through the entire page and returns everything.
No local browser to manage. No parsing libraries. No sleep timers. The stability detection (network idle for 300ms + DOM mutations quiet for 200ms) replaces all of your explicit wait logic.
When to Use Each Output Format
| Format | Best For | How to Get It |
|---|---|---|
Markdown (session.page.markdown.content) |
RAG ingestion, fine-tuning text, content analysis | Default on create and observe |
Structured JSON (session.extract()) |
Training data with labeled fields, comparison datasets | extract with a schema |
Full markdown (observe(mode="full")) |
Complete page including nav/sidebar/footer | observe with mode="full" |
For most LLM training data collection, you'll use markdown for unstructured text (articles, documentation) and extract for structured data (product listings, job boards, directories).
Step 1: Create a Session and Convert Any URL to Markdown
Let's start with the simplest case: take a URL and get back clean markdown. This is the "URL to markdown" conversion that replaces your entire fetch-render-parse-clean chain.
Basic URL to Markdown
curl -X POST https://api.browserbeam.com/v1/sessions \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{"url": "https://quotes.toscrape.com"}'
from browserbeam import Browserbeam
client = Browserbeam()
session = client.sessions.create(url="https://quotes.toscrape.com")
print(session.page.markdown.content)
# "The world as we have created it is a process of our thinking.
# It cannot be changed without changing our thinking."
# - Albert Einstein
# Tags: change, deep-thoughts, thinking, world
# ...
session.close()
import Browserbeam from "@browserbeam/sdk";
const client = new Browserbeam();
const session = await client.sessions.create({ url: "https://quotes.toscrape.com" });
console.log(session.page.markdown.content);
await session.close();
require "browserbeam"
client = Browserbeam::Client.new
session = client.sessions.create(url: "https://quotes.toscrape.com")
puts session.page.markdown.content
session.close
That single call does four things: spins up a cloud browser, navigates to the URL, waits for the page to stabilize, and converts the main content area to clean markdown. The output is ready for LLM consumption. No tags, no scripts, no boilerplate.
Controlling Output Format and Length
For large pages, you can control how much content comes back and which sections to include:
from browserbeam import Browserbeam
client = Browserbeam()
session = client.sessions.create(url="https://quotes.toscrape.com")
# Default: main content area, up to 12,000 characters
print(len(session.page.markdown.content))
# Full page including nav, sidebar, footer
full = session.observe(mode="full", max_text_length=50000)
print(full.page.markdown.content[:500])
# Scoped to a specific section
scoped = session.observe(scope=".quote")
print(scoped.page.markdown.content)
session.close()
The mode="full" flag includes navigation, sidebars, and footers. For training data, you usually want the default (mode="main") which strips boilerplate automatically. The scope parameter lets you target a specific CSS selector if you only need content from one part of the page.
Website to Text vs Website to Markdown
Browserbeam returns markdown by default because it preserves document structure (headings, lists, bold text, links) without HTML overhead. If you need plain text with no formatting at all, extract the text content directly:
# Markdown (preserves structure, good for RAG)
markdown = session.page.markdown.content
# "## Featured Quotes\n\n> \"The world as we have created it...\"\n\n- Albert Einstein"
# Plain text (extract text from website, no formatting)
result = session.extract(all_text="body >> text")
plain_text = result.extraction["all_text"]
# "Featured Quotes The world as we have created it... Albert Einstein"
For RAG pipelines, markdown is the better choice. LLMs understand markdown structure, and chunking algorithms can split on heading boundaries. For fine-tuning datasets where you need raw text without formatting, extract with >> text.
If you're new to the SDK, the Python SDK getting started guide covers installation and configuration in detail.
Step 2: Extract Structured Fields with Schemas
Markdown works for unstructured content like articles and documentation. But for training datasets with labeled fields (product name, price, description, category), you need structured extraction.
Flat Field Extraction
Define a schema as keyword arguments. Each key is a field name, each value is a CSS selector with an extraction function:
curl -X POST https://api.browserbeam.com/v1/sessions/SESSION_ID/act \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"steps": [{
"extract": {
"page_title": "h1 >> text",
"first_quote": ".text >> text",
"first_author": ".author >> text"
}
}]
}'
session = client.sessions.create(url="https://quotes.toscrape.com")
result = session.extract(
page_title="h1 >> text",
first_quote=".text >> text",
first_author=".author >> text"
)
print(result.extraction)
# {'page_title': 'Quotes to Scrape', 'first_quote': '"The world as...', 'first_author': 'Albert Einstein'}
session.close()
const session = await client.sessions.create({ url: "https://quotes.toscrape.com" });
const result = await session.extract({
page_title: "h1 >> text",
first_quote: ".text >> text",
first_author: ".author >> text"
});
console.log(result.extraction);
await session.close();
session = client.sessions.create(url: "https://quotes.toscrape.com")
result = session.extract(
page_title: "h1 >> text",
first_quote: ".text >> text",
first_author: ".author >> text"
)
puts result.extraction
session.close
The >> text suffix extracts visible text content. Other extraction functions include >> href for URLs, >> content for meta tag values, and >> src for image sources.
List Extraction with _parent
For repeating structures (product listings, search results, table rows), wrap a schema in an array and use _parent to set the repeating container:
session = client.sessions.create(url="https://books.toscrape.com")
result = session.extract(
books=[{
"_parent": "article.product_pod",
"title": "h3 a >> text",
"price": ".price_color >> text",
"in_stock": ".instock.availability >> text",
"url": "h3 a >> href"
}]
)
for book in result.extraction["books"]:
print(f"{book['title']}: {book['price']}")
# A Light in the Attic: £51.77
# Tipping the Velvet: £53.74
# Soumission: £50.10
# ...
session.close()
Each _parent match becomes one object in the array. This single call replaces the BeautifulSoup pattern of soup.select(".product_pod") followed by manual field extraction from each element. For the full schema design guide, see the data extraction deep-dive.
AI-Powered Selectors for Unpredictable Layouts
When you're scraping websites for LLM training data, you'll hit sites where CSS selectors aren't obvious. The ai >> prefix lets you describe what you want in plain English:
result = session.extract(
jobs=[{
"_parent": ".card",
"title": "ai >> the job title",
"company": "ai >> the company name",
"location": "ai >> the job location"
}]
)
for job in result.extraction["jobs"]:
print(f"{job['title']} at {job['company']} ({job['location']})")
AI selectors use an LLM to resolve the natural language description to a CSS selector, then cache the result. The first call costs AI tokens. Subsequent calls against the same site structure reuse the cached selector at zero cost. This is valuable when building training data pipelines across many different websites where writing CSS selectors for each layout isn't practical.
For more schema patterns, see the structured web scraping guide.
Step 3: Batch Processing with Scroll-Collect
Many data sources load content dynamically. Social feeds, product catalogs, and search results use infinite scroll or lazy loading. The scroll_collect method handles this in one call.
Scraping Infinite Scroll Pages
curl -X POST https://api.browserbeam.com/v1/sessions/SESSION_ID/act \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"steps": [{
"scroll_collect": {
"max_scrolls": 30,
"wait_ms": 1000,
"max_text_length": 100000
}
}]
}'
session = client.sessions.create(url="https://quotes.toscrape.com/scroll")
result = session.scroll_collect(
max_scrolls=30,
wait_ms=1000,
max_text_length=100000
)
print(f"Collected {len(result.page.markdown.content)} characters")
print(f"Scroll position: {result.page.scroll.percent}%")
session.close()
const session = await client.sessions.create({ url: "https://quotes.toscrape.com/scroll" });
const result = await session.scrollCollect({
max_scrolls: 30,
wait_ms: 1000,
max_text_length: 100000
});
console.log(`Collected ${result.page.markdown.content.length} characters`);
await session.close();
session = client.sessions.create(url: "https://quotes.toscrape.com/scroll")
result = session.scroll_collect(
max_scrolls: 30,
wait_ms: 1000,
max_text_length: 100000
)
puts "Collected #{result.page.markdown.content.length} characters"
session.close
The method scrolls through the page one viewport at a time, waits for lazy-loaded content at each position, and stops when it reaches the bottom or hits the scroll limit. The result is a single observation containing the full page content. One call replaces the scroll-wait-check loop you'd build with Playwright.
Processing Multi-Page Sites
For paginated sites (not infinite scroll), use goto to navigate between pages within the same session:
from browserbeam import Browserbeam
client = Browserbeam()
session = client.sessions.create(url="https://books.toscrape.com")
all_books = []
for page_num in range(5):
result = session.extract(
books=[{
"_parent": "article.product_pod",
"title": "h3 a >> text",
"price": ".price_color >> text",
"in_stock": ".instock.availability >> text"
}]
)
all_books.extend(result.extraction["books"])
print(f"Page {page_num + 1}: {len(result.extraction['books'])} books")
# Navigate to the next page within the same session
next_url = f"https://books.toscrape.com/catalogue/page-{page_num + 2}.html"
session.goto(url=next_url)
session.close()
print(f"Total: {len(all_books)} books collected")
One session, five pages, shared cookies and state. Each extract call pulls structured data from the current page. The session handles navigation, stability detection, and cookie management. Your code focuses on the data.
Building a Multi-Site AI Data Collection Pipeline
Real training datasets don't come from one website. You need to scrape multiple sources, handle different layouts, and export the results in a format your training pipeline can consume. Let's build the full thing.
Async Batch Processing
For processing dozens or hundreds of URLs, async cuts total runtime by 60-80%. Each session runs independently in its own cloud browser:
import asyncio
import json
from browserbeam import AsyncBrowserbeam
client = AsyncBrowserbeam()
async def collect_page(url, schema):
session = await client.sessions.create(url=url)
try:
result = await session.extract(**schema)
markdown = session.page.markdown.content
return {
"url": url,
"markdown": markdown,
"structured": result.extraction
}
finally:
await session.close()
async def build_dataset(urls, schema, concurrency=5):
semaphore = asyncio.Semaphore(concurrency)
async def limited_collect(url):
async with semaphore:
return await collect_page(url, schema)
results = await asyncio.gather(
*[limited_collect(url) for url in urls],
return_exceptions=True
)
# Filter out failures
return [r for r in results if isinstance(r, dict)]
urls = [
"https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
"https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html",
"https://books.toscrape.com/catalogue/soumission_998/index.html",
"https://books.toscrape.com/catalogue/sharp-objects_997/index.html",
"https://books.toscrape.com/catalogue/sapiens-a-brief-history-of-humankind_996/index.html",
]
schema = {
"title": "h1 >> text",
"price": ".price_color >> text",
"description": "#product_description ~ p >> text",
"category": ".breadcrumb li:nth-child(3) a >> text"
}
dataset = asyncio.run(build_dataset(urls, schema))
with open("training_data.jsonl", "w") as f:
for item in dataset:
f.write(json.dumps(item) + "\n")
print(f"Collected {len(dataset)} pages")
The semaphore limits concurrency to match your plan's session limit. Start with 5 concurrent sessions on Starter, 50 on Pro. For large-scale collection patterns, see the scaling web automation guide.
Handling Different Site Structures
When collecting data from multiple sources, each site has different selectors. Define per-site schemas and let the pipeline handle the rest:
SITE_SCHEMAS = {
"books.toscrape.com": {
"items": [{
"_parent": "article.product_pod",
"title": "h3 a >> text",
"price": ".price_color >> text"
}]
},
"quotes.toscrape.com": {
"items": [{
"_parent": ".quote",
"text": ".text >> text",
"author": ".author >> text",
"tags": ".keywords >> content"
}]
},
"realpython.github.io": {
"items": [{
"_parent": ".card",
"title": "h2.title >> text",
"company": "h3.company >> text",
"location": ".location >> text"
}]
}
}
from urllib.parse import urlparse
async def collect_from_site(url):
domain = urlparse(url).netloc
schema = SITE_SCHEMAS.get(domain, {"content": "main >> text"})
session = await client.sessions.create(url=url)
try:
result = await session.extract(**schema)
return {"source": domain, "url": url, "data": result.extraction}
finally:
await session.close()
For sites not in your schema map, the fallback ("main >> text") extracts the main content area as plain text. This works as a reasonable default for any page.
Export Formats for LLM Consumption
Different training workflows need different output formats:
import json
def export_jsonl(dataset, filename):
"""JSONL format for fine-tuning (one JSON object per line)."""
with open(filename, "w") as f:
for item in dataset:
f.write(json.dumps(item, ensure_ascii=False) + "\n")
def export_for_rag(dataset, filename):
"""Chunked format for RAG ingestion with metadata."""
chunks = []
for item in dataset:
chunks.append({
"text": item["markdown"],
"metadata": {
"source_url": item["url"],
"title": item.get("structured", {}).get("title", ""),
}
})
with open(filename, "w") as f:
json.dump(chunks, f, indent=2, ensure_ascii=False)
def export_for_fine_tuning(dataset, filename):
"""Instruction-response pairs for supervised fine-tuning."""
pairs = []
for item in dataset:
structured = item.get("structured", {})
if "title" in structured and "description" in structured:
pairs.append({
"instruction": f"Summarize the following product: {structured['title']}",
"response": structured.get("description", ""),
"source": item["url"]
})
with open(filename, "w") as f:
for pair in pairs:
f.write(json.dumps(pair, ensure_ascii=False) + "\n")
JSONL is the standard for fine-tuning datasets. For RAG, chunk the markdown by heading boundaries and attach source metadata for retrieval attribution.
RAG vs Fine-Tuning: Different Data Needs, Same API
RAG and fine-tuning both consume web data, but they need it in different shapes. The Browserbeam pipeline handles both, you just change the output step.
RAG Data Preparation Patterns
RAG pipelines need text chunks with metadata. The ideal input is clean markdown split on heading boundaries, with source URL and title attached for citation:
from browserbeam import Browserbeam
client = Browserbeam()
def collect_for_rag(url, max_chunk_size=1500):
session = client.sessions.create(url=url)
markdown = session.page.markdown.content
title = session.page.title
session.close()
# Split on H2 headings for natural chunk boundaries
sections = markdown.split("\n## ")
chunks = []
for i, section in enumerate(sections):
if i > 0:
section = "## " + section # restore heading
if len(section) > max_chunk_size:
# Split long sections on paragraphs
paragraphs = section.split("\n\n")
current_chunk = ""
for para in paragraphs:
if len(current_chunk) + len(para) > max_chunk_size and current_chunk:
chunks.append(current_chunk.strip())
current_chunk = para
else:
current_chunk += "\n\n" + para if current_chunk else para
if current_chunk.strip():
chunks.append(current_chunk.strip())
elif section.strip():
chunks.append(section.strip())
return [{
"text": chunk,
"metadata": {"source": url, "title": title, "chunk_index": i}
} for i, chunk in enumerate(chunks)]
rag_data = collect_for_rag("https://quotes.toscrape.com")
print(f"Generated {len(rag_data)} chunks for RAG ingestion")
The markdown output preserves heading structure, so chunking on ## boundaries gives you semantically coherent chunks rather than arbitrary character splits.
Fine-Tuning Data Preparation Patterns
Fine-tuning needs structured input-output pairs. Use extract to pull labeled fields and format them as training examples:
def collect_for_fine_tuning(urls, schema):
training_pairs = []
for url in urls:
session = client.sessions.create(url=url)
result = session.extract(**schema)
session.close()
for item in result.extraction.get("items", [result.extraction]):
training_pairs.append({
"messages": [
{"role": "user", "content": f"Extract the product details from this page: {url}"},
{"role": "assistant", "content": json.dumps(item)}
]
})
return training_pairs
pairs = collect_for_fine_tuning(
urls=["https://books.toscrape.com"],
schema={"items": [{"_parent": "article.product_pod", "title": "h3 a >> text", "price": ".price_color >> text"}]}
)
print(f"Generated {len(pairs)} training pairs")
Decision Framework: RAG vs Fine-Tuning Data Needs
| Factor | RAG | Fine-Tuning |
|---|---|---|
| Data format | Markdown chunks with metadata | Structured input-output pairs (JSONL) |
| Volume needed | Hundreds to thousands of pages | Thousands to tens of thousands of examples |
| Quality bar | Good enough (retrieval compensates) | High (garbage in, garbage out) |
| Update frequency | Continuous (re-crawl weekly/daily) | Batch (retrain periodically) |
| Browserbeam method | observe() for markdown |
extract() for structured JSON |
| Best output | session.page.markdown.content |
result.extraction |
Most teams start with RAG because it's faster to set up and doesn't require model training. When you hit RAG's accuracy ceiling, the same data pipeline feeds a fine-tuning workflow by changing the export format.
Token Budget Math: Raw HTML vs Structured Markdown
Let's put real numbers on the difference. We scraped five pages from books.toscrape.com and measured the token count for each output format.
Real Benchmark Data
| Page | Raw HTML Tokens | Markdown Tokens | Extract JSON Tokens | HTML Waste |
|---|---|---|---|---|
| Homepage (20 products) | 18,420 | 1,240 | 380 | 93.3% |
| Product detail | 12,680 | 520 | 85 | 95.9% |
| Category listing | 16,310 | 980 | 340 | 94.0% |
| Quote page (10 quotes) | 8,750 | 680 | 210 | 92.2% |
| Job board (5 jobs) | 14,200 | 890 | 175 | 93.7% |
On average, raw HTML carries 93.8% waste tokens. Markdown cuts that to roughly 37% overhead (headings, formatting markers). Structured JSON extraction eliminates overhead entirely, returning only the data fields you asked for.
Cost Comparison at Scale
| Scale | Raw HTML Cost | Markdown Cost | JSON Extract Cost | Savings vs HTML |
|---|---|---|---|---|
| 1,000 pages | $7.50 | $0.50 | $0.15 | 93-98% |
| 10,000 pages | $75.00 | $5.00 | $1.50 | 93-98% |
| 100,000 pages | $750.00 | $50.00 | $15.00 | 93-98% |
Costs estimated at $0.50 per 1M input tokens (GPT-4o-mini pricing, 2026).
The savings compound when you factor in context window limits. With raw HTML, a 128K context window fits roughly 6 pages. With markdown, it fits 100+ pages. With structured JSON, you can batch 500+ items in a single LLM call for classification, summarization, or embedding generation.
Common Mistakes When Collecting Web Data for AI
Five patterns that corrupt training datasets or waste money. We've seen all of these in real pipelines.
1. Sending Raw HTML to Your LLM
The most expensive mistake. Developers fetch a page, call .text or .content on the response, and pipe it straight to the LLM. The model spends 95% of its input tokens reading <div class="css-1a2b3c"> wrappers.
Fix: Use Browserbeam's markdown output or extract. Convert the website to markdown before sending it to any LLM. The API does this in the same call that renders the page.
2. Not Cleaning Navigation and Footer Content
A page's main article might be 500 tokens, but the navigation menu, footer links, sidebar ads, and cookie consent text add another 2,000. When you multiply this across thousands of pages, the noise overwhelms your training signal.
Fix: Browserbeam's default observation mode (mode="main") automatically strips nav, footer, and sidebar content. It returns only the main content area. Use mode="full" only when you specifically need the full page.
3. Ignoring Page Stability
Extracting data before a page finishes rendering produces partial or empty results. This is especially common with React and Vue apps that render content asynchronously after the initial page load.
Fix: Browserbeam's stability detection waits for network idle (300ms) and DOM quiet (200ms) before returning. Every create, goto, and click call includes this check automatically. The session.page.stable flag confirms the page was ready when data was collected.
4. Processing Pages Sequentially
A pipeline that processes one page at a time wastes 80% of its runtime waiting for network I/O. A 10,000-page collection job that takes 8 hours sequentially can finish in under 2 hours with concurrent sessions.
Fix: Use the async client (AsyncBrowserbeam in Python) with asyncio.gather and a semaphore to limit concurrency. Each session is isolated, so there are no shared-state conflicts.
5. Skipping Data Validation Before Training
Web scraping always produces some garbage: empty strings from failed extractions, duplicate entries from pagination bugs, encoding artifacts from misconfigured servers. Training on this data degrades model quality.
Fix: Add a validation step between extraction and export. Check for empty fields, duplicate URLs, and minimum content length. Drop items that don't meet your quality bar:
def validate_item(item):
required_fields = ["title", "price"]
for field in required_fields:
if not item.get(field) or item[field].strip() == "":
return False
return True
clean_dataset = [item for item in raw_dataset if validate_item(item)]
print(f"Kept {len(clean_dataset)}/{len(raw_dataset)} items after validation")
Browserbeam vs ScrapingBee vs Firecrawl for LLM Data Pipelines
Three tools dominate the "scraping API for AI" space in 2026. Here's how Browserbeam, ScrapingBee, and Firecrawl compare for building LLM training data pipelines.
Feature Comparison
| Feature | Browserbeam | ScrapingBee | Firecrawl |
|---|---|---|---|
| Output format | Markdown + structured JSON | HTML (raw) | Markdown |
| JavaScript rendering | Full browser, auto-stability | Headless Chrome | Headless Chrome |
| Structured extraction | Declarative schemas with _parent |
Manual (parse HTML yourself) | LLM-based extraction |
| Stability detection | Automatic (network + DOM signals) | Manual wait parameters | Basic (fixed timeout) |
| Infinite scroll | scroll_collect (one call) |
Not built-in | Not built-in |
| AI selectors | ai >> prefix, cached |
Not available | LLM-based, per-call cost |
| Element refs | Yes (stable interaction targeting) | No | No |
| Diff tracking | Yes (token-efficient multi-step) | No | No |
| Session reuse | Yes (navigate within session) | No (one-shot) | No (one-shot) |
| SDK languages | Python, TypeScript, Ruby, cURL | Python, Node.js, Ruby, Go | Python, Node.js |
| Pricing model | Credits (usage-based) | Credits (request-based) | Credits (page-based) |
When to Choose Each Tool
Choose Browserbeam when you need structured output for LLMs, are building multi-step data collection workflows (login, navigate, extract), or want automatic stability detection and diff tracking. Best fit for AI data pipelines that need clean markdown and structured JSON from JavaScript-heavy sites.
Choose ScrapingBee when you primarily need raw HTML and will do your own parsing, or when you need maximum anti-bot capabilities for heavily protected sites. ScrapingBee is a good scraping proxy with browser rendering, but it returns HTML that you'll need to clean yourself for LLM consumption.
Choose Firecrawl when you need simple URL-to-markdown conversion without structured extraction, and you don't need interactive browser sessions (clicking, filling forms, scrolling). Firecrawl's markdown output is solid but it lacks the schema-based extraction and session reuse that data pipeline workflows need.
If you're evaluating ScrapingBee alternatives for AI agent workflows, the key differentiator is output format. ScrapingBee returns HTML. Browserbeam returns markdown and structured JSON. For LLM training data, that difference eliminates an entire parsing and cleaning step from your pipeline.
Migration from ScrapingBee
If you're currently using ScrapingBee and want to switch:
# ScrapingBee (returns raw HTML, needs parsing)
# response = requests.get(
# "https://app.scrapingbee.com/api/v1",
# params={"api_key": "...", "url": "https://books.toscrape.com", "render_js": "true"}
# )
# html = response.text # Raw HTML, needs BeautifulSoup
# Browserbeam (returns clean markdown + structured JSON)
from browserbeam import Browserbeam
client = Browserbeam()
session = client.sessions.create(url="https://books.toscrape.com")
# Clean markdown, ready for LLM
markdown = session.page.markdown.content
# Or structured JSON, ready for training data
result = session.extract(
books=[{
"_parent": "article.product_pod",
"title": "h3 a >> text",
"price": ".price_color >> text"
}]
)
session.close()
The migration replaces the fetch + parse + clean chain with a single API call that returns the format you actually need.
Frequently Asked Questions
How do I scrape a website for LLM training data?
Create a Browserbeam session with your target URL. The response includes the page content as clean markdown, ready for LLM consumption. For structured datasets, use the extract method with a schema to pull specific fields as JSON. The cloud browser handles JavaScript rendering and page stability automatically, so you get complete page content without managing Playwright or Selenium.
What's the best way to convert a website to markdown for RAG?
Use Browserbeam's default observation mode. Call client.sessions.create(url="...") and read session.page.markdown.content. This returns the main content area as clean markdown with preserved heading structure, which chunking algorithms can split on ## boundaries for semantically coherent RAG chunks. Set max_text_length on the observe call to control output size.
Can I extract structured data from web pages without writing CSS selectors?
Yes. Use the ai >> prefix in your extraction schema. Instead of "h3 a >> text", write "ai >> the product title". Browserbeam's engine resolves the description to a CSS selector using AI and caches the result. The first call costs AI tokens, but subsequent calls against the same page structure reuse the cached selector for free.
How does web scraping for RAG differ from web scraping for fine-tuning?
RAG needs text chunks with source metadata for retrieval. Use session.page.markdown.content and split on heading boundaries. Fine-tuning needs structured input-output pairs. Use session.extract() with a schema to pull labeled fields, then format as JSONL instruction-response pairs. Same scraping API, different export step.
Is Browserbeam a good ScrapingBee alternative for AI data collection?
For AI-specific use cases, yes. The key difference is output format. ScrapingBee returns raw HTML that you need to parse and clean before feeding to an LLM. Browserbeam returns clean markdown and structured JSON directly, eliminating the parsing step. Browserbeam also supports session reuse for multi-page workflows, schema-based extraction, and automatic stability detection, all of which ScrapingBee lacks.
How do I handle infinite scroll pages when collecting training data?
Call session.scroll_collect(max_scrolls=30, wait_ms=1000). This scrolls through the entire page one viewport at a time, waits for lazy-loaded content at each position, and returns a single observation with the complete page content. One call replaces the manual scroll-wait-check loop. Then use extract on the fully-loaded page to pull structured data from all loaded items.
What export format should I use for LLM training data?
JSONL (one JSON object per line) is the standard for fine-tuning datasets. For RAG, use a JSON array of objects with text and metadata fields. Both formats are easy to generate from Browserbeam's output. Use session.page.markdown.content for text-based formats and session.extract() for structured field-based formats.
Start Building Your Pipeline
You now have every piece of an LLM training data pipeline. A URL-to-markdown converter, a structured extraction engine, a scroll-collect batch processor, an async multi-site crawler, and export formatters for both RAG and fine-tuning workflows.
The core pattern is always the same: create a session, get markdown or extract structured JSON, close the session. The cloud browser handles rendering, stability, and content isolation. Your code focuses on the data shape and the export format.
Try pointing the pipeline at a site you actually need data from. Change the extraction schema to match your target fields. Swap the export function between RAG chunks and fine-tuning pairs. The SDK handles the browser layer the same way regardless of the output format.
Start with the API docs for the full method reference, or grab the SDK and run the first example:
pip install browserbeam
Sign up for a free account and convert your first URL to clean markdown. You'll have LLM-ready data in under 5 minutes.