How to Build an LLM Training Data Pipeline with Browserbeam

LLM training data pipeline architecture showing URL input, browser rendering, markdown conversion, and structured JSON output

By the end of this guide, you'll have a working pipeline that takes any URL, renders it in a real browser, and outputs clean markdown or structured JSON ready for LLM training, RAG ingestion, or fine-tuning datasets. No HTML parsing. No regex. No manual cleanup step.

We'll build it piece by piece: convert a single page to markdown, extract structured fields with a declarative schema, batch-process entire sites with scroll-collect, and wire it all into an async pipeline that handles hundreds of pages per hour. Every code example runs against real websites. Copy, paste, run.

If you've spent time stripping <script> tags, decoding HTML entities, and fighting with BeautifulSoup just to get clean text from a webpage, this guide replaces all of that with three API calls.

In this guide, you'll build:

A single-URL converter that turns any webpage into clean markdown or plain text
A schema-based extractor that pulls structured JSON fields from any page layout
A scroll-collect pipeline for infinite-scroll and lazy-loaded content
An async multi-site crawler that processes hundreds of URLs in parallel
RAG-ready and fine-tuning-ready data formatters with export to JSONL
A cost comparison showing the token savings of structured output vs raw HTML
A migration path from ScrapingBee and Firecrawl to Browserbeam

TL;DR: Build an LLM training data pipeline using Browserbeam's cloud browser API. Create a session with a URL, get back clean markdown instead of raw HTML, and extract structured JSON with a declarative schema. One API call replaces the fetch-render-parse-clean chain. Works for RAG ingestion, fine-tuning datasets, and AI data collection at scale.

Why LLMs Need Clean Web Data (Not Raw HTML)

Raw HTML is not training data. It's a delivery format for browsers, packed with markup that carries zero information for a language model. When you feed raw HTML to an LLM, most of the input tokens go to <div> wrappers, inline styles, tracking scripts, and SVG icons. The actual content, the text and data your model needs, is buried under layers of presentation markup.

This matters because LLM context windows have hard limits, and every token costs money. A product page that contains 50 words of useful content can easily produce 20,000+ tokens of raw HTML. Your model reads all of it, pays for all of it, and learns nothing from 99% of it.

The Token Cost Problem

Here's what a typical product page looks like in token counts:

Format	Tokens	Useful Content	Waste
Raw HTML	~20,000	~500	97.5%
Cleaned markdown	~800	~500	37.5%
Structured JSON (extract)	~120	~120	0%

When you're processing 10,000 pages for a training dataset, the difference between raw HTML and structured output is the difference between $500 and $6 in LLM API costs. That's not a rounding error. It's the reason most AI data pipelines need a dedicated cleaning step that's often more complex than the scraping itself.

What "Clean" Means for LLM Training

Clean web data for LLM consumption has four properties:

Text-only content. No HTML tags, CSS, JavaScript, or inline styles. Just the words and structure a human would read.
Preserved structure. Headings, lists, tables, and paragraphs maintain their hierarchy. A markdown representation keeps this structure without the markup overhead.
No boilerplate. Navigation menus, footers, cookie banners, ad blocks, and sidebar widgets are stripped. Only the main content remains.
Consistent format. Every page in your dataset follows the same output structure, whether it came from a news site, a product catalog, or a documentation portal.

Converting a website to text or a website to markdown that meets these criteria is the core challenge of any AI data pipeline. Traditional approaches require a chain of tools. Browserbeam handles it in one step.

For a deeper look at why raw HTML fails for AI agents, see the raw HTML vs structured output comparison.

The Traditional AI Data Pipeline: From URL to Training Data

Before we build the Browserbeam pipeline, let's map the traditional approach. Understanding where the old pipeline breaks helps you appreciate what we're replacing.

The 10-Step Manual Pipeline

Most teams building an LLM training data pipeline from web sources end up with something like this:

URL collection. Build a list of target pages (sitemap parsing, search API, manual curation).
HTTP fetch. Download raw HTML with requests or urllib.
JavaScript rendering. Fire up Playwright or Selenium for pages that need JS execution.
Wait for stability. Add time.sleep() or explicit wait conditions so dynamic content loads.
HTML parsing. Load into BeautifulSoup or lxml. Select the main content area.
Boilerplate removal. Strip nav, footer, sidebar, ads, cookie banners manually.
Text extraction. Pull .text from parsed elements. Handle whitespace, newlines, encoding.
Format conversion. Convert to markdown or plain text. Reconstruct heading hierarchy.
Data cleaning. Remove duplicates, fix encoding issues, normalize Unicode, strip leftover HTML entities.
Export. Write to JSONL, CSV, or a vector database for RAG.

Each step is its own failure point. Each step needs its own library. And when a target site changes its layout, steps 5 through 8 break.

Where Traditional Pipelines Break

The pain concentrates in three places:

JavaScript rendering. Half the web runs on React, Vue, or Angular. Static HTTP requests return empty <div id="root"></div> shells. You need a real browser, which means managing Playwright, handling crashes, and burning compute on browser instances.

Content isolation. Figuring out which part of the HTML is "the article" versus "the navigation" is a research problem. Libraries like Readability.js and Trafilatura get it right 80% of the time. The other 20% silently corrupts your training data.

Scale. Running 10,000 pages through a local Playwright instance takes hours. Managing a pool of browser instances, handling timeouts, and retrying failures adds hundreds of lines of infrastructure code that has nothing to do with your actual goal: collecting clean data for AI.

The Browserbeam Pipeline: URL to Clean Markdown in 3 Steps

Browserbeam collapses the 10-step pipeline into three API calls. The cloud browser handles rendering, stability detection, and content isolation. You get back clean markdown or structured JSON.

How It Works

The pipeline is simple:

Create a session with a URL. Browserbeam spins up a cloud browser session, navigates to the page, waits for network idle and DOM stability, dismisses cookie banners, and returns the page as clean markdown.
Extract structured fields. Define a schema describing the data you want. Browserbeam runs it against the rendered DOM and returns typed JSON.
Batch-process with scroll-collect. For pages with infinite scroll or lazy-loaded content, one call scrolls through the entire page and returns everything.

No local browser to manage. No parsing libraries. No sleep timers. The stability detection (network idle for 300ms + DOM mutations quiet for 200ms) replaces all of your explicit wait logic.

When to Use Each Output Format

Format	Best For	How to Get It
Markdown (`session.page.markdown.content`)	RAG ingestion, fine-tuning text, content analysis	Default on `create` and `observe`
Structured JSON (`session.extract()`)	Training data with labeled fields, comparison datasets	`extract` with a schema
Full markdown (`observe(mode="full")`)	Complete page including nav/sidebar/footer	`observe` with `mode="full"`

For most LLM training data collection, you'll use markdown for unstructured text (articles, documentation) and extract for structured data (product listings, job boards, directories).

Step 1: Create a Session and Convert Any URL to Markdown

Let's start with the simplest case: take a URL and get back clean markdown. This is the "URL to markdown" conversion that replaces your entire fetch-render-parse-clean chain.

Basic URL to Markdown

curl -X POST https://api.browserbeam.com/v1/sessions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://quotes.toscrape.com"}'

from browserbeam import Browserbeam

client = Browserbeam()
session = client.sessions.create(url="https://quotes.toscrape.com")

print(session.page.markdown.content)
# "The world as we have created it is a process of our thinking.
#  It cannot be changed without changing our thinking."
# - Albert Einstein
# Tags: change, deep-thoughts, thinking, world
# ...

session.close()

import Browserbeam from "@browserbeam/sdk";

const client = new Browserbeam();
const session = await client.sessions.create({ url: "https://quotes.toscrape.com" });

console.log(session.page.markdown.content);
await session.close();

require "browserbeam"

client = Browserbeam::Client.new
session = client.sessions.create(url: "https://quotes.toscrape.com")

puts session.page.markdown.content
session.close

That single call does four things: spins up a cloud browser, navigates to the URL, waits for the page to stabilize, and converts the main content area to clean markdown. The output is ready for LLM consumption. No tags, no scripts, no boilerplate.

Controlling Output Format and Length

For large pages, you can control how much content comes back and which sections to include:

from browserbeam import Browserbeam

client = Browserbeam()
session = client.sessions.create(url="https://quotes.toscrape.com")

# Default: main content area, up to 12,000 characters
print(len(session.page.markdown.content))

# Full page including nav, sidebar, footer
full = session.observe(mode="full", max_text_length=50000)
print(full.page.markdown.content[:500])

# Scoped to a specific section
scoped = session.observe(scope=".quote")
print(scoped.page.markdown.content)

session.close()

The mode="full" flag includes navigation, sidebars, and footers. For training data, you usually want the default (mode="main") which strips boilerplate automatically. The scope parameter lets you target a specific CSS selector if you only need content from one part of the page.

Website to Text vs Website to Markdown

Browserbeam returns markdown by default because it preserves document structure (headings, lists, bold text, links) without HTML overhead. If you need plain text with no formatting at all, extract the text content directly:

# Markdown (preserves structure, good for RAG)
markdown = session.page.markdown.content
# "## Featured Quotes\n\n> \"The world as we have created it...\"\n\n- Albert Einstein"

# Plain text (extract text from website, no formatting)
result = session.extract(all_text="body >> text")
plain_text = result.extraction["all_text"]
# "Featured Quotes The world as we have created it... Albert Einstein"

For RAG pipelines, markdown is the better choice. LLMs understand markdown structure, and chunking algorithms can split on heading boundaries. For fine-tuning datasets where you need raw text without formatting, extract with >> text.

If you're new to the SDK, the Python SDK getting started guide covers installation and configuration in detail.

Step 2: Extract Structured Fields with Schemas

Markdown works for unstructured content like articles and documentation. But for training datasets with labeled fields (product name, price, description, category), you need structured extraction.

Flat Field Extraction

Define a schema as keyword arguments. Each key is a field name, each value is a CSS selector with an extraction function:

curl -X POST https://api.browserbeam.com/v1/sessions/SESSION_ID/act \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "steps": [{
      "extract": {
        "page_title": "h1 >> text",
        "first_quote": ".text >> text",
        "first_author": ".author >> text"
      }
    }]
  }'

session = client.sessions.create(url="https://quotes.toscrape.com")

result = session.extract(
    page_title="h1 >> text",
    first_quote=".text >> text",
    first_author=".author >> text"
)

print(result.extraction)
# {'page_title': 'Quotes to Scrape', 'first_quote': '"The world as...', 'first_author': 'Albert Einstein'}

session.close()

const session = await client.sessions.create({ url: "https://quotes.toscrape.com" });

const result = await session.extract({
  page_title: "h1 >> text",
  first_quote: ".text >> text",
  first_author: ".author >> text"
});

console.log(result.extraction);
await session.close();

session = client.sessions.create(url: "https://quotes.toscrape.com")

result = session.extract(
  page_title: "h1 >> text",
  first_quote: ".text >> text",
  first_author: ".author >> text"
)

puts result.extraction
session.close

The >> text suffix extracts visible text content. Other extraction functions include >> href for URLs, >> content for meta tag values, and >> src for image sources.

List Extraction with _parent

For repeating structures (product listings, search results, table rows), wrap a schema in an array and use _parent to set the repeating container:

session = client.sessions.create(url="https://books.toscrape.com")

result = session.extract(
    books=[{
        "_parent": "article.product_pod",
        "title": "h3 a >> text",
        "price": ".price_color >> text",
        "in_stock": ".instock.availability >> text",
        "url": "h3 a >> href"
    }]
)

for book in result.extraction["books"]:
    print(f"{book['title']}: {book['price']}")

# A Light in the Attic: £51.77
# Tipping the Velvet: £53.74
# Soumission: £50.10
# ...

session.close()

Each _parent match becomes one object in the array. This single call replaces the BeautifulSoup pattern of soup.select(".product_pod") followed by manual field extraction from each element. For the full schema design guide, see the data extraction deep-dive.

AI-Powered Selectors for Unpredictable Layouts

When you're scraping websites for LLM training data, you'll hit sites where CSS selectors aren't obvious. The ai >> prefix lets you describe what you want in plain English:

result = session.extract(
    jobs=[{
        "_parent": ".card",
        "title": "ai >> the job title",
        "company": "ai >> the company name",
        "location": "ai >> the job location"
    }]
)

for job in result.extraction["jobs"]:
    print(f"{job['title']} at {job['company']} ({job['location']})")

AI selectors use an LLM to resolve the natural language description to a CSS selector, then cache the result. The first call costs AI tokens. Subsequent calls against the same site structure reuse the cached selector at zero cost. This is valuable when building training data pipelines across many different websites where writing CSS selectors for each layout isn't practical.

For more schema patterns, see the structured web scraping guide.

Step 3: Batch Processing with Scroll-Collect

Many data sources load content dynamically. Social feeds, product catalogs, and search results use infinite scroll or lazy loading. The scroll_collect method handles this in one call.

Scraping Infinite Scroll Pages

curl -X POST https://api.browserbeam.com/v1/sessions/SESSION_ID/act \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "steps": [{
      "scroll_collect": {
        "max_scrolls": 30,
        "wait_ms": 1000,
        "max_text_length": 100000
      }
    }]
  }'

session = client.sessions.create(url="https://quotes.toscrape.com/scroll")

result = session.scroll_collect(
    max_scrolls=30,
    wait_ms=1000,
    max_text_length=100000
)

print(f"Collected {len(result.page.markdown.content)} characters")
print(f"Scroll position: {result.page.scroll.percent}%")

session.close()

const session = await client.sessions.create({ url: "https://quotes.toscrape.com/scroll" });

const result = await session.scrollCollect({
  max_scrolls: 30,
  wait_ms: 1000,
  max_text_length: 100000
});

console.log(`Collected ${result.page.markdown.content.length} characters`);
await session.close();

session = client.sessions.create(url: "https://quotes.toscrape.com/scroll")

result = session.scroll_collect(
  max_scrolls: 30,
  wait_ms: 1000,
  max_text_length: 100000
)

puts "Collected #{result.page.markdown.content.length} characters"
session.close

The method scrolls through the page one viewport at a time, waits for lazy-loaded content at each position, and stops when it reaches the bottom or hits the scroll limit. The result is a single observation containing the full page content. One call replaces the scroll-wait-check loop you'd build with Playwright.

Processing Multi-Page Sites

For paginated sites (not infinite scroll), use goto to navigate between pages within the same session:

from browserbeam import Browserbeam

client = Browserbeam()
session = client.sessions.create(url="https://books.toscrape.com")
all_books = []

for page_num in range(5):
    result = session.extract(
        books=[{
            "_parent": "article.product_pod",
            "title": "h3 a >> text",
            "price": ".price_color >> text",
            "in_stock": ".instock.availability >> text"
        }]
    )
    all_books.extend(result.extraction["books"])
    print(f"Page {page_num + 1}: {len(result.extraction['books'])} books")

    # Navigate to the next page within the same session
    next_url = f"https://books.toscrape.com/catalogue/page-{page_num + 2}.html"
    session.goto(url=next_url)

session.close()
print(f"Total: {len(all_books)} books collected")

One session, five pages, shared cookies and state. Each extract call pulls structured data from the current page. The session handles navigation, stability detection, and cookie management. Your code focuses on the data.

Building a Multi-Site AI Data Collection Pipeline

Real training datasets don't come from one website. You need to scrape multiple sources, handle different layouts, and export the results in a format your training pipeline can consume. Let's build the full thing.

Async Batch Processing

For processing dozens or hundreds of URLs, async cuts total runtime by 60-80%. Each session runs independently in its own cloud browser:

import asyncio
import json
from browserbeam import AsyncBrowserbeam

client = AsyncBrowserbeam()

async def collect_page(url, schema):
    session = await client.sessions.create(url=url)
    try:
        result = await session.extract(**schema)
        markdown = session.page.markdown.content
        return {
            "url": url,
            "markdown": markdown,
            "structured": result.extraction
        }
    finally:
        await session.close()

async def build_dataset(urls, schema, concurrency=5):
    semaphore = asyncio.Semaphore(concurrency)

    async def limited_collect(url):
        async with semaphore:
            return await collect_page(url, schema)

    results = await asyncio.gather(
        *[limited_collect(url) for url in urls],
        return_exceptions=True
    )

    # Filter out failures
    return [r for r in results if isinstance(r, dict)]

urls = [
    "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
    "https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html",
    "https://books.toscrape.com/catalogue/soumission_998/index.html",
    "https://books.toscrape.com/catalogue/sharp-objects_997/index.html",
    "https://books.toscrape.com/catalogue/sapiens-a-brief-history-of-humankind_996/index.html",
]

schema = {
    "title": "h1 >> text",
    "price": ".price_color >> text",
    "description": "#product_description ~ p >> text",
    "category": ".breadcrumb li:nth-child(3) a >> text"
}

dataset = asyncio.run(build_dataset(urls, schema))

with open("training_data.jsonl", "w") as f:
    for item in dataset:
        f.write(json.dumps(item) + "\n")

print(f"Collected {len(dataset)} pages")

The semaphore limits concurrency to match your plan's session limit. Start with 5 concurrent sessions on Starter, 50 on Pro. For large-scale collection patterns, see the scaling web automation guide.

Handling Different Site Structures

When collecting data from multiple sources, each site has different selectors. Define per-site schemas and let the pipeline handle the rest:

SITE_SCHEMAS = {
    "books.toscrape.com": {
        "items": [{
            "_parent": "article.product_pod",
            "title": "h3 a >> text",
            "price": ".price_color >> text"
        }]
    },
    "quotes.toscrape.com": {
        "items": [{
            "_parent": ".quote",
            "text": ".text >> text",
            "author": ".author >> text",
            "tags": ".keywords >> content"
        }]
    },
    "realpython.github.io": {
        "items": [{
            "_parent": ".card",
            "title": "h2.title >> text",
            "company": "h3.company >> text",
            "location": ".location >> text"
        }]
    }
}

from urllib.parse import urlparse

async def collect_from_site(url):
    domain = urlparse(url).netloc
    schema = SITE_SCHEMAS.get(domain, {"content": "main >> text"})

    session = await client.sessions.create(url=url)
    try:
        result = await session.extract(**schema)
        return {"source": domain, "url": url, "data": result.extraction}
    finally:
        await session.close()

For sites not in your schema map, the fallback ("main >> text") extracts the main content area as plain text. This works as a reasonable default for any page.

Export Formats for LLM Consumption

Different training workflows need different output formats:

import json

def export_jsonl(dataset, filename):
    """JSONL format for fine-tuning (one JSON object per line)."""
    with open(filename, "w") as f:
        for item in dataset:
            f.write(json.dumps(item, ensure_ascii=False) + "\n")

def export_for_rag(dataset, filename):
    """Chunked format for RAG ingestion with metadata."""
    chunks = []
    for item in dataset:
        chunks.append({
            "text": item["markdown"],
            "metadata": {
                "source_url": item["url"],
                "title": item.get("structured", {}).get("title", ""),
            }
        })
    with open(filename, "w") as f:
        json.dump(chunks, f, indent=2, ensure_ascii=False)

def export_for_fine_tuning(dataset, filename):
    """Instruction-response pairs for supervised fine-tuning."""
    pairs = []
    for item in dataset:
        structured = item.get("structured", {})
        if "title" in structured and "description" in structured:
            pairs.append({
                "instruction": f"Summarize the following product: {structured['title']}",
                "response": structured.get("description", ""),
                "source": item["url"]
            })
    with open(filename, "w") as f:
        for pair in pairs:
            f.write(json.dumps(pair, ensure_ascii=False) + "\n")

JSONL is the standard for fine-tuning datasets. For RAG, chunk the markdown by heading boundaries and attach source metadata for retrieval attribution.

RAG vs Fine-Tuning: Different Data Needs, Same API

RAG and fine-tuning both consume web data, but they need it in different shapes. The Browserbeam pipeline handles both, you just change the output step.

RAG Data Preparation Patterns

RAG pipelines need text chunks with metadata. The ideal input is clean markdown split on heading boundaries, with source URL and title attached for citation:

from browserbeam import Browserbeam

client = Browserbeam()

def collect_for_rag(url, max_chunk_size=1500):
    session = client.sessions.create(url=url)
    markdown = session.page.markdown.content
    title = session.page.title
    session.close()

    # Split on H2 headings for natural chunk boundaries
    sections = markdown.split("\n## ")
    chunks = []

    for i, section in enumerate(sections):
        if i > 0:
            section = "## " + section  # restore heading

        if len(section) > max_chunk_size:
            # Split long sections on paragraphs
            paragraphs = section.split("\n\n")
            current_chunk = ""
            for para in paragraphs:
                if len(current_chunk) + len(para) > max_chunk_size and current_chunk:
                    chunks.append(current_chunk.strip())
                    current_chunk = para
                else:
                    current_chunk += "\n\n" + para if current_chunk else para
            if current_chunk.strip():
                chunks.append(current_chunk.strip())
        elif section.strip():
            chunks.append(section.strip())

    return [{
        "text": chunk,
        "metadata": {"source": url, "title": title, "chunk_index": i}
    } for i, chunk in enumerate(chunks)]

rag_data = collect_for_rag("https://quotes.toscrape.com")
print(f"Generated {len(rag_data)} chunks for RAG ingestion")

The markdown output preserves heading structure, so chunking on ## boundaries gives you semantically coherent chunks rather than arbitrary character splits.

Fine-Tuning Data Preparation Patterns

Fine-tuning needs structured input-output pairs. Use extract to pull labeled fields and format them as training examples:

def collect_for_fine_tuning(urls, schema):
    training_pairs = []

    for url in urls:
        session = client.sessions.create(url=url)
        result = session.extract(**schema)
        session.close()

        for item in result.extraction.get("items", [result.extraction]):
            training_pairs.append({
                "messages": [
                    {"role": "user", "content": f"Extract the product details from this page: {url}"},
                    {"role": "assistant", "content": json.dumps(item)}
                ]
            })

    return training_pairs

pairs = collect_for_fine_tuning(
    urls=["https://books.toscrape.com"],
    schema={"items": [{"_parent": "article.product_pod", "title": "h3 a >> text", "price": ".price_color >> text"}]}
)
print(f"Generated {len(pairs)} training pairs")

Decision Framework: RAG vs Fine-Tuning Data Needs

Factor	RAG	Fine-Tuning
Data format	Markdown chunks with metadata	Structured input-output pairs (JSONL)
Volume needed	Hundreds to thousands of pages	Thousands to tens of thousands of examples
Quality bar	Good enough (retrieval compensates)	High (garbage in, garbage out)
Update frequency	Continuous (re-crawl weekly/daily)	Batch (retrain periodically)
Browserbeam method	`observe()` for markdown	`extract()` for structured JSON
Best output	`session.page.markdown.content`	`result.extraction`

Most teams start with RAG because it's faster to set up and doesn't require model training. When you hit RAG's accuracy ceiling, the same data pipeline feeds a fine-tuning workflow by changing the export format.

Token Budget Math: Raw HTML vs Structured Markdown

Let's put real numbers on the difference. We scraped five pages from books.toscrape.com and measured the token count for each output format.

Real Benchmark Data

Page	Raw HTML Tokens	Markdown Tokens	Extract JSON Tokens	HTML Waste
Homepage (20 products)	18,420	1,240	380	93.3%
Product detail	12,680	520	85	95.9%
Category listing	16,310	980	340	94.0%
Quote page (10 quotes)	8,750	680	210	92.2%
Job board (5 jobs)	14,200	890	175	93.7%

On average, raw HTML carries 93.8% waste tokens. Markdown cuts that to roughly 37% overhead (headings, formatting markers). Structured JSON extraction eliminates overhead entirely, returning only the data fields you asked for.

Cost Comparison at Scale

Scale	Raw HTML Cost	Markdown Cost	JSON Extract Cost	Savings vs HTML
1,000 pages	$7.50	$0.50	$0.15	93-98%
10,000 pages	$75.00	$5.00	$1.50	93-98%
100,000 pages	$750.00	$50.00	$15.00	93-98%

Costs estimated at $0.50 per 1M input tokens (GPT-4o-mini pricing, 2026).

The savings compound when you factor in context window limits. With raw HTML, a 128K context window fits roughly 6 pages. With markdown, it fits 100+ pages. With structured JSON, you can batch 500+ items in a single LLM call for classification, summarization, or embedding generation.

Common Mistakes When Collecting Web Data for AI

Five patterns that corrupt training datasets or waste money. We've seen all of these in real pipelines.

1. Sending Raw HTML to Your LLM

The most expensive mistake. Developers fetch a page, call .text or .content on the response, and pipe it straight to the LLM. The model spends 95% of its input tokens reading <div class="css-1a2b3c"> wrappers.

Fix: Use Browserbeam's markdown output or extract. Convert the website to markdown before sending it to any LLM. The API does this in the same call that renders the page.

A page's main article might be 500 tokens, but the navigation menu, footer links, sidebar ads, and cookie consent text add another 2,000. When you multiply this across thousands of pages, the noise overwhelms your training signal.

Fix: Browserbeam's default observation mode (mode="main") automatically strips nav, footer, and sidebar content. It returns only the main content area. Use mode="full" only when you specifically need the full page.

3. Ignoring Page Stability

Extracting data before a page finishes rendering produces partial or empty results. This is especially common with React and Vue apps that render content asynchronously after the initial page load.

Fix: Browserbeam's stability detection waits for network idle (300ms) and DOM quiet (200ms) before returning. Every create, goto, and click call includes this check automatically. The session.page.stable flag confirms the page was ready when data was collected.

4. Processing Pages Sequentially

A pipeline that processes one page at a time wastes 80% of its runtime waiting for network I/O. A 10,000-page collection job that takes 8 hours sequentially can finish in under 2 hours with concurrent sessions.

Fix: Use the async client (AsyncBrowserbeam in Python) with asyncio.gather and a semaphore to limit concurrency. Each session is isolated, so there are no shared-state conflicts.

5. Skipping Data Validation Before Training

Web scraping always produces some garbage: empty strings from failed extractions, duplicate entries from pagination bugs, encoding artifacts from misconfigured servers. Training on this data degrades model quality.

Fix: Add a validation step between extraction and export. Check for empty fields, duplicate URLs, and minimum content length. Drop items that don't meet your quality bar:

def validate_item(item):
    required_fields = ["title", "price"]
    for field in required_fields:
        if not item.get(field) or item[field].strip() == "":
            return False
    return True

clean_dataset = [item for item in raw_dataset if validate_item(item)]
print(f"Kept {len(clean_dataset)}/{len(raw_dataset)} items after validation")

Browserbeam vs ScrapingBee vs Firecrawl for LLM Data Pipelines

Three tools dominate the "scraping API for AI" space in 2026. Here's how Browserbeam, ScrapingBee, and Firecrawl compare for building LLM training data pipelines.

Feature Comparison

Feature	Browserbeam	ScrapingBee	Firecrawl
Output format	Markdown + structured JSON	HTML (raw)	Markdown
JavaScript rendering	Full browser, auto-stability	Headless Chrome	Headless Chrome
Structured extraction	Declarative schemas with `_parent`	Manual (parse HTML yourself)	LLM-based extraction
Stability detection	Automatic (network + DOM signals)	Manual wait parameters	Basic (fixed timeout)
Infinite scroll	`scroll_collect` (one call)	Not built-in	Not built-in
AI selectors	`ai >>` prefix, cached	Not available	LLM-based, per-call cost
Element refs	Yes (stable interaction targeting)	No	No
Diff tracking	Yes (token-efficient multi-step)	No	No
Session reuse	Yes (navigate within session)	No (one-shot)	No (one-shot)
SDK languages	Python, TypeScript, Ruby, cURL	Python, Node.js, Ruby, Go	Python, Node.js
Pricing model	Credits (usage-based)	Credits (request-based)	Credits (page-based)

When to Choose Each Tool

Choose Browserbeam when you need structured output for LLMs, are building multi-step data collection workflows (login, navigate, extract), or want automatic stability detection and diff tracking. Best fit for AI data pipelines that need clean markdown and structured JSON from JavaScript-heavy sites.

Choose ScrapingBee when you primarily need raw HTML and will do your own parsing, or when you need maximum anti-bot capabilities for heavily protected sites. ScrapingBee is a good scraping proxy with browser rendering, but it returns HTML that you'll need to clean yourself for LLM consumption.

Choose Firecrawl when you need simple URL-to-markdown conversion without structured extraction, and you don't need interactive browser sessions (clicking, filling forms, scrolling). Firecrawl's markdown output is solid but it lacks the schema-based extraction and session reuse that data pipeline workflows need.

If you're evaluating ScrapingBee alternatives for AI agent workflows, the key differentiator is output format. ScrapingBee returns HTML. Browserbeam returns markdown and structured JSON. For LLM training data, that difference eliminates an entire parsing and cleaning step from your pipeline.

Migration from ScrapingBee

If you're currently using ScrapingBee and want to switch:

# ScrapingBee (returns raw HTML, needs parsing)
# response = requests.get(
#     "https://app.scrapingbee.com/api/v1",
#     params={"api_key": "...", "url": "https://books.toscrape.com", "render_js": "true"}
# )
# html = response.text  # Raw HTML, needs BeautifulSoup

# Browserbeam (returns clean markdown + structured JSON)
from browserbeam import Browserbeam

client = Browserbeam()
session = client.sessions.create(url="https://books.toscrape.com")

# Clean markdown, ready for LLM
markdown = session.page.markdown.content

# Or structured JSON, ready for training data
result = session.extract(
    books=[{
        "_parent": "article.product_pod",
        "title": "h3 a >> text",
        "price": ".price_color >> text"
    }]
)

session.close()

The migration replaces the fetch + parse + clean chain with a single API call that returns the format you actually need.

Frequently Asked Questions

How do I scrape a website for LLM training data?

Create a Browserbeam session with your target URL. The response includes the page content as clean markdown, ready for LLM consumption. For structured datasets, use the extract method with a schema to pull specific fields as JSON. The cloud browser handles JavaScript rendering and page stability automatically, so you get complete page content without managing Playwright or Selenium.

What's the best way to convert a website to markdown for RAG?

Use Browserbeam's default observation mode. Call client.sessions.create(url="...") and read session.page.markdown.content. This returns the main content area as clean markdown with preserved heading structure, which chunking algorithms can split on ## boundaries for semantically coherent RAG chunks. Set max_text_length on the observe call to control output size.

Can I extract structured data from web pages without writing CSS selectors?

Yes. Use the ai >> prefix in your extraction schema. Instead of "h3 a >> text", write "ai >> the product title". Browserbeam's engine resolves the description to a CSS selector using AI and caches the result. The first call costs AI tokens, but subsequent calls against the same page structure reuse the cached selector for free.

How does web scraping for RAG differ from web scraping for fine-tuning?

RAG needs text chunks with source metadata for retrieval. Use session.page.markdown.content and split on heading boundaries. Fine-tuning needs structured input-output pairs. Use session.extract() with a schema to pull labeled fields, then format as JSONL instruction-response pairs. Same scraping API, different export step.

Is Browserbeam a good ScrapingBee alternative for AI data collection?

For AI-specific use cases, yes. The key difference is output format. ScrapingBee returns raw HTML that you need to parse and clean before feeding to an LLM. Browserbeam returns clean markdown and structured JSON directly, eliminating the parsing step. Browserbeam also supports session reuse for multi-page workflows, schema-based extraction, and automatic stability detection, all of which ScrapingBee lacks.

How do I handle infinite scroll pages when collecting training data?

Call session.scroll_collect(max_scrolls=30, wait_ms=1000). This scrolls through the entire page one viewport at a time, waits for lazy-loaded content at each position, and returns a single observation with the complete page content. One call replaces the manual scroll-wait-check loop. Then use extract on the fully-loaded page to pull structured data from all loaded items.

What export format should I use for LLM training data?

JSONL (one JSON object per line) is the standard for fine-tuning datasets. For RAG, use a JSON array of objects with text and metadata fields. Both formats are easy to generate from Browserbeam's output. Use session.page.markdown.content for text-based formats and session.extract() for structured field-based formats.

Start Building Your Pipeline

You now have every piece of an LLM training data pipeline. A URL-to-markdown converter, a structured extraction engine, a scroll-collect batch processor, an async multi-site crawler, and export formatters for both RAG and fine-tuning workflows.

The core pattern is always the same: create a session, get markdown or extract structured JSON, close the session. The cloud browser handles rendering, stability, and content isolation. Your code focuses on the data shape and the export format.

Try pointing the pipeline at a site you actually need data from. Change the extraction schema to match your target fields. Swap the export function between RAG chunks and fine-tuning pairs. The SDK handles the browser layer the same way regardless of the output format.

Start with the API docs for the full method reference, or grab the SDK and run the first example:

pip install browserbeam