Browserbeam vs Raw HTML: Why AI Agents Prefer Structured Output

Your agent just received 23,847 tokens of raw HTML from a single product page. Inside that wall of markup, the information it actually needs is a product name, a price, and a stock status. That's 12 tokens.

The other 23,835 tokens? Inline CSS, script tags, SVG icons, tracking pixels, base64-encoded images, and 47 nested <div> wrappers that exist for layout, not content. Your LLM reads every single one of them, pays for every single one of them, and extracts zero useful signal from 99.95% of the input.

This isn't a minor optimization opportunity. It's the difference between an agent that costs $50 to process 1,000 pages and one that costs $2. Between an agent that fits 3 pages in its context window and one that fits 50. Between a web scraping pipeline that works in a demo and one that runs in production.

This guide breaks down exactly why raw HTML fails for AI agents, what structured output looks like in practice, and how to migrate from Puppeteer or BeautifulSoup to a system where your agent reads clean data instead of parsing DOM trees.

In this guide, you'll learn:

Why raw HTML is the wrong input format for LLMs and AI agents
How Browserbeam's structured output model works (markdown, refs, schemas)
Side-by-side code comparisons: Puppeteer vs Browserbeam for the same task
Real-world output examples from e-commerce, news, and dashboard pages
A working scraper that extracts product data with 10 lines vs 60
Token cost analysis with actual numbers at scale
Common mistakes developers make when feeding HTML to LLMs
How to migrate from BeautifulSoup, Cheerio, and Selenium to Browserbeam

TL;DR: Raw HTML costs 15,000-25,000 tokens per page for AI agents, with 95% of those tokens carrying zero useful information. Browserbeam's structured output returns the same data as clean markdown (1,500-3,000 tokens), stable element refs, and declarative JSON extraction. This cuts token costs by 90-95%, eliminates brittle CSS selectors, and lets your agent reason about content instead of parsing markup.

Challenges of Raw HTML Scraping for AI

Before we look at the solution, let's understand why raw HTML is such a poor fit for LLM-based agents. The problems go deeper than just "it's a lot of text."

Token Bloat from Raw HTML

Every HTML page carries overhead that humans never see. The browser renders it visually, stripping away the noise. Your LLM doesn't have that luxury. It reads every character as a token.

Here's a real breakdown of a typical web page's token distribution:

Content Category	Typical Token Count	Useful for AI?
Visible text content	500-2,000	Yes
CSS classes and `<style>` blocks	3,000-5,000	No
`<script>` tags and JS bundles	5,000-10,000	No
Tracking pixels and analytics	1,000-3,000	No
SVG icons and base64 images	1,000-3,000	No
Nested `<div>` layout wrappers	2,000-4,000	No
HTML attributes (data-, aria-)	500-1,500	Rarely
Total	15,000-25,000	~5-10%

At current GPT-4o pricing ($2.50 per million input tokens), processing 1,000 pages of raw HTML costs about $50 in input tokens alone. The same 1,000 pages as structured markdown costs about $5. Scale that to an agent running every hour across 100 sites, and the raw HTML approach costs $36,000 per year versus $3,600 for structured output.

Token bloat is a cost problem, a speed problem, and a quality problem all at once.

Parsing Complexity and Fragility

Even after paying the token cost, the LLM still has to find the useful content inside the HTML. That parsing step introduces three failure modes.

Class name sensitivity. A CSS selector like .product-list > .item > .price-container > span.price works until the site changes one class name. Web frameworks generate dynamic class names (_abc123), Tailwind uses utility classes (text-green-600 font-bold), and site redesigns rename everything. Your selector breaks with no warning.

Structural assumptions. XPath expressions and CSS selectors assume a specific DOM tree structure. A product price might live inside div > div > span > span today and section > p > strong tomorrow. The data hasn't changed, but your path to it has.

Encoding issues. HTML entities (&, '), Unicode characters, and mixed encodings require careful handling. A price displayed as £51.77 in the browser might appear as £51.77 or \u00A351.77 in the raw HTML. Your parser needs to handle all variants.

Why LLMs Struggle with Unstructured DOM

LLMs are text-processing systems. They're excellent at reasoning about natural language and structured data (JSON, markdown, tables). They're poor at parsing deeply nested tree structures with semantic meaning encoded in attribute values.

When you feed raw HTML to an LLM, you're asking it to:

Skip all <script>, <style>, and <noscript> blocks
Ignore CSS classes, data attributes, and tracking parameters
Reconstruct the visual hierarchy from nested <div> wrappers
Map form inputs to their labels (which may be siblings, parents, or connected via for attributes)
Determine which elements are clickable, which are decorative, and which are interactive
Handle all of this within a finite context window

That's asking the model to be an HTML parser before it can be a decision-maker. With a 128K context window, raw HTML limits your agent to processing 5-8 pages before hitting the ceiling. Structured output lets it handle 40-80 pages in the same window.

Raw HTML forces your LLM to be a parser. Structured output lets it be a thinker.

Browserbeam's Structured Output Model

Browserbeam solves the raw HTML problem at the API level. Instead of returning the DOM, it returns three structured representations of the page that LLMs can read and act on directly.

Markdown Page Content

The primary output is the page's visible content converted to clean markdown. Headings become ## markers, lists become bullets, links include their text and URL, and tables render as markdown tables. Everything invisible to a human user (scripts, styles, tracking, layout divs) is stripped.

Here's what Browserbeam returns for a bookstore page:

{
  "markdown": {
    "content": "## All products\n\n1. **A Light in the Attic** - £51.77 - In stock\n2. **Tipping the Velvet** - £53.74 - In stock\n3. **Soumission** - £50.10 - In stock\n4. **Sharp Objects** - £47.82 - In stock\n..."
  }
}

The same page as raw HTML is roughly 18,000 tokens. As Browserbeam markdown, it's about 800 tokens. The LLM reads a clean list of books with prices and stock status. No parsing required.

Markdown also preserves semantic structure. Headings indicate sections. Bold text indicates emphasis. Links indicate navigation targets. Your agent can reason about the page layout from the markdown alone, deciding which section to focus on without reading the full content.

Element Registry and Refs

Every interactive element on the page gets a stable ref: e1, e2, e3, and so on. Each ref includes the element's type, label, and role.

{
  "interactive_elements": [
    {"ref": "e1", "tag": "a", "label": "A Light in the Attic"},
    {"ref": "e2", "tag": "a", "label": "Tipping the Velvet"},
    {"ref": "e3", "tag": "a", "label": "Soumission"},
    {"ref": "e10", "tag": "a", "label": "next"},
    {"ref": "e11", "tag": "input", "label": "Search", "type": "text"}
  ]
}

Your agent clicks e1 to visit a book page. It fills e11 to search. It clicks e10 to paginate. No CSS selectors. No XPath. No brittle chains of class names that break when the site updates its Tailwind config.

The ref system is what makes Browserbeam different from just "HTML to markdown" converters. Raw markdown tells you what's on the page. Refs tell your agent what it can do.

Element Targeting	Raw HTML	Markdown Converter	Browserbeam
How to click	CSS/XPath selector	Not possible	`click(ref="e1")`
How to fill forms	Find input by ID/name	Not possible	`fill(ref="e11", value="...")`
Breaks on redesign?	Yes (selectors change)	N/A	No (refs are position-based)
Tokens per element	50-200 (full tag with attributes)	0 (no interactivity)	10-15 (ref + label)

JSON Extract with Schemas

For data extraction, Browserbeam offers a declarative schema system. You describe the data you want using CSS selectors with >> operators, and Browserbeam returns clean JSON.

from browserbeam import Browserbeam

client = Browserbeam(api_key="YOUR_API_KEY")
session = client.sessions.create(url="https://books.toscrape.com")

result = session.extract(
    _parent="article.product_pod",
    _limit=5,
    title="h3 a >> text",
    price=".price_color >> text",
    stock=".instock.availability >> text",
    url="h3 a >> href"
)

print(result.extraction)

Output:

[
  {"title": "A Light in the Attic", "price": "£51.77", "stock": "In stock", "url": "a-light-in-the-attic_1000/index.html"},
  {"title": "Tipping the Velvet", "price": "£53.74", "stock": "In stock", "url": "tipping-the-velvet_999/index.html"},
  {"title": "Soumission", "price": "£50.10", "stock": "In stock", "url": "soumission_998/index.html"},
  {"title": "Sharp Objects", "price": "£47.82", "stock": "In stock", "url": "sharp-objects_997/index.html"},
  {"title": "Sapiens", "price": "£54.23", "stock": "In stock", "url": "sapiens-a-brief-history-of-humankind_996/index.html"}
]

The extraction runs on the server. Your LLM never sees the page at all for data collection tasks. This is important: for scraping use cases, you don't need the LLM to read the page. You need structured data. The extract step delivers it in a single API call without burning any LLM tokens.

For a complete guide to the schema syntax and advanced extraction patterns, see the structured web scraping guide.

Side-by-Side Code Comparison

The difference between raw HTML scraping and Browserbeam's structured approach is best understood through code. Here's the same task implemented both ways: extract the top 5 books (title, price, stock status) from a bookstore.

Puppeteer + Raw HTML Approach

import puppeteer from "puppeteer";
import * as cheerio from "cheerio";

async function scrapeBooks() {
  const browser = await puppeteer.launch({ headless: true });
  const page = await browser.newPage();

  await page.goto("https://books.toscrape.com", {
    waitUntil: "networkidle2",
    timeout: 30000,
  });

  const html = await page.content();
  const $ = cheerio.load(html);

  const books: Array<{title: string; price: string; stock: string}> = [];

  $("article.product_pod").each((i, el) => {
    if (i >= 5) return false;

    const title = $(el).find("h3 a").attr("title") || $(el).find("h3 a").text().trim();
    const priceRaw = $(el).find(".price_color").text().trim();
    const stockRaw = $(el).find(".instock.availability").text().trim();

    const price = priceRaw || "N/A";
    const stock = stockRaw.replace(/\s+/g, " ").trim() || "Unknown";

    books.push({ title, price, stock });
  });

  await browser.close();
  return books;
}

scrapeBooks().then(books => console.log(JSON.stringify(books, null, 2)));

That's 33 lines of code. You're managing a browser binary, setting up Puppeteer, loading Cheerio for HTML parsing, writing CSS selectors, handling null values, cleaning whitespace, and limiting the result count manually.

Browserbeam Structured Approach

curl -X POST https://api.browserbeam.com/v1/sessions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://books.toscrape.com",
    "steps": [
      {
        "extract": {
          "_parent": "article.product_pod",
          "_limit": 5,
          "title": "h3 a >> text",
          "price": ".price_color >> text",
          "stock": ".instock.availability >> text"
        }
      }
    ]
  }'

from browserbeam import Browserbeam

client = Browserbeam(api_key="YOUR_API_KEY")
session = client.sessions.create(url="https://books.toscrape.com")
result = session.extract(
    _parent="article.product_pod",
    _limit=5,
    title="h3 a >> text",
    price=".price_color >> text",
    stock=".instock.availability >> text"
)
print(result.extraction)
session.close()

import Browserbeam from "@browserbeam/sdk";

const client = new Browserbeam({ apiKey: "YOUR_API_KEY" });
const session = await client.sessions.create({ url: "https://books.toscrape.com" });
const result = await session.extract({
  _parent: "article.product_pod",
  _limit: 5,
  title: "h3 a >> text",
  price: ".price_color >> text",
  stock: ".instock.availability >> text",
});
console.log(result.extraction);
await session.close();

require "browserbeam"

client = Browserbeam::Client.new(api_key: "YOUR_API_KEY")
session = client.sessions.create(url: "https://books.toscrape.com")
result = session.extract(
  _parent: "article.product_pod",
  _limit: 5,
  title: "h3 a >> text",
  price: ".price_color >> text",
  stock: ".instock.availability >> text"
)
puts result.extraction
session.close

The Python version is 9 lines. No browser binary. No HTML parser. No null checks. No whitespace cleaning. The schema defines the output shape, and Browserbeam returns clean JSON.

Lines of Code and Maintainability

Metric	Puppeteer + Cheerio	Browserbeam
Lines of code	33	9
Dependencies	2 (puppeteer, cheerio)	1 (browserbeam SDK)
Browser management	Manual (install, launch, close)	Managed (API call)
HTML parsing	Manual (cheerio/querySelector)	None needed
Null handling	Manual (fallback values)	Automatic
Whitespace cleaning	Manual (regex)	Automatic
What breaks on site redesign	CSS selectors, null checks	Possibly schema selectors
Recovery from breakage	Rewrite parser code	Update schema fields

The maintainability difference compounds over time. When the bookstore site updates its markup, the Puppeteer version requires updating Cheerio selectors, null checks, and whitespace handling. The Browserbeam version needs at most a schema field update. If the site changes its visual class names but keeps the same HTML structure (common with CSS framework migrations), the Browserbeam schema keeps working.

Real-World Output Examples

Abstract comparisons only go so far. Let's look at actual Browserbeam output for three common page types.

E-Commerce Product Page

A product page from books.toscrape.com:

Browserbeam markdown output (~300 tokens):

## A Light in the Attic

**Price:** £51.77
**Availability:** In stock (22 available)
**UPC:** a897fe39b1053632
**Product Type:** Books
**Tax:** £0.00
**Number of reviews:** 0

### Product Description
It's hard to imagine a world without A Light in the Attic...

Browserbeam extract output (from schema):

result = session.extract(
    title="h1 >> text",
    price=".price_color >> text",
    stock=".instock.availability >> text",
    description="#product_description + p >> text",
    upc="tr:nth-child(1) td >> text"
)

# Returns:
# {"title": "A Light in the Attic", "price": "£51.77",
#  "stock": "In stock (22 available)",
#  "description": "It's hard to imagine a world without...",
#  "upc": "a897fe39b1053632"}

The same page as raw HTML is roughly 12,000 tokens. The markdown version is 300. The extracted JSON is 50. Your agent picks the representation that matches the task: markdown for browsing and reasoning, JSON for data collection.

News Article with Dynamic Content

A page from Hacker News:

Browserbeam markdown output (~600 tokens for top 10 stories):

## Hacker News

1. **Story Title One** (example.org) - 142 points by user1 3 hours ago | 89 comments
2. **Story Title Two** (github.com) - 98 points by user2 5 hours ago | 45 comments
3. **Story Title Three** (arxiv.org) - 76 points by user3 2 hours ago | 23 comments
...

Browserbeam extract output:

result = session.extract(
    _parent=".athing",
    _limit=10,
    title=".titleline > a >> text",
    url=".titleline > a >> href",
    rank=".rank >> text"
)

Hacker News is a good test case because its content is entirely dynamic. The front page changes every few minutes. A static HTML scraper captures a snapshot. Browserbeam's live browser captures the current state, including content that loads asynchronously or updates after the initial render.

Multi-Tab Dashboard

For complex pages with multiple sections, Browserbeam's markdown preserves the visual hierarchy. A job board like Real Python's fake jobs has cards with titles, companies, locations, and dates:

Browserbeam extract output:

session = client.sessions.create(url="https://realpython.github.io/fake-jobs/")

result = session.extract(
    _parent=".card",
    _limit=5,
    title="h2.title >> text",
    company="h3.company >> text",
    location=".location >> text",
    date="time >> text"
)

# Returns structured data for each job card:
# [{"title": "Senior Python Developer", "company": "Payne, Roberts and Davis",
#   "location": "Stewartbury, AA", "date": "2021-04-08"}, ...]

The raw HTML for this page includes CSS framework classes, card layout divs, responsive breakpoints, and JavaScript event handlers. None of that matters to your agent. The extract schema pulls the four fields that matter.

Example: Scrape a Product List (Puppeteer vs Browserbeam)

Here's a complete, runnable comparison. Both scripts scrape 3 pages of books from the same bookstore, handling pagination.

Puppeteer + Cheerio (60 lines):

import puppeteer from "puppeteer";
import * as cheerio from "cheerio";

interface Book {
  title: string;
  price: string;
  stock: string;
  url: string;
}

async function scrapeWithPuppeteer(): Promise {
  const browser = await puppeteer.launch({ headless: true });
  const page = await browser.newPage();
  const allBooks: Book[] = [];

  let currentUrl = "https://books.toscrape.com";

  for (let pageNum = 0; pageNum < 3; pageNum++) {
    await page.goto(currentUrl, { waitUntil: "networkidle2", timeout: 30000 });
    const html = await page.content();
    const $ = cheerio.load(html);

    $("article.product_pod").each((_, el) => {
      const title = $(el).find("h3 a").attr("title") || $(el).find("h3 a").text().trim();
      const price = $(el).find(".price_color").text().trim() || "N/A";
      const stockRaw = $(el).find(".instock.availability").text().trim();
      const stock = stockRaw.replace(/\s+/g, " ").trim() || "Unknown";
      const url = $(el).find("h3 a").attr("href") || "";

      allBooks.push({ title, price, stock, url });
    });

    const nextLink = $("li.next a").attr("href");
    if (nextLink) {
      currentUrl = new URL(nextLink, currentUrl).href;
    } else {
      break;
    }
  }

  await browser.close();
  return allBooks;
}

scrapeWithPuppeteer().then(books => {
  console.log(`Scraped ${books.length} books`);
  console.log(JSON.stringify(books.slice(0, 3), null, 2));
});

Browserbeam (20 lines):

from browserbeam import Browserbeam

client = Browserbeam(api_key="YOUR_API_KEY")
session = client.sessions.create(url="https://books.toscrape.com")
all_books = []

for page_num in range(3):
    result = session.extract(
        _parent="article.product_pod",
        title="h3 a >> text",
        price=".price_color >> text",
        stock=".instock.availability >> text",
        url="h3 a >> href"
    )
    all_books.extend(result.extraction)

    elements = session.page.interactive_elements
    next_btn = next((e for e in elements if "next" in e.get("label", "").lower()), None)
    if next_btn:
        session.click(ref=next_btn["ref"])
    else:
        break

session.close()
print(f"Scraped {len(all_books)} books")

The Puppeteer version manages a browser, loads Cheerio, writes CSS selectors, handles nulls, cleans whitespace, and reconstructs pagination URLs manually. The Browserbeam version defines a schema, clicks "next" by ref, and gets clean JSON at each step.

Metric	Puppeteer + Cheerio	Browserbeam
Lines of code	~50 (plus type definition)	20
Browser management	`puppeteer.launch()`, `browser.close()`	API handles it
Pagination	Manual URL construction	Click ref `e10`
Null handling	4 fallback conditions	None needed
Result format	Custom object construction	JSON from schema
Runs on	Machine with Chrome installed	Any machine with HTTP

Performance and Cost Comparison

Token counts and dollar amounts tell the story more clearly than code comparisons.

Metric	Raw HTML (Puppeteer/Selenium)	Browserbeam Markdown	Browserbeam Extract
Tokens per page	15,000-25,000	1,500-3,000	50-500 (data only)
Cost per 1,000 pages (GPT-4o input)	~$50	~$5	~$0.50
Pages per 128K context	5-8	40-80	250+
Parse time (LLM reasoning)	High (must find data in noise)	Low (clean markdown)	Zero (data already structured)
Parsing accuracy	Variable (depends on HTML complexity)	High (markdown is readable)	Perfect (schema-defined)

The extract path is the most efficient for data collection. The LLM never sees the page at all. Browserbeam handles rendering, stability detection, and data extraction on the server. Your LLM only receives the structured JSON output.

For LLM-powered browsing tasks where the agent needs to make decisions (click, navigate, fill forms), the markdown representation is the right choice. It gives the model enough context to reason about the page while keeping token costs 90% lower than raw HTML.

For pure data extraction (monitoring prices, scraping listings, collecting datasets), use extract directly and skip the LLM entirely. You still get structured output, but without any LLM token cost.

Common Mistakes with Raw HTML Agents

Five patterns that cost developers time, money, and reliability when building AI agents with browser access.

Parsing Entire DOM Trees

The most common mistake: sending page.content() or document.documentElement.outerHTML directly to the LLM. This includes everything: <head> metadata, <script> bundles, inline styles, hidden elements, and the actual content somewhere in the middle.

Fix: If you must use raw tools, at minimum strip <script>, <style>, <noscript>, and hidden elements before sending to the LLM. Better: use Browserbeam's markdown output, which handles all filtering automatically.

Relying on CSS Selectors for LLM Input

Developers often build a "smart" layer that extracts content via CSS selectors and formats it for the LLM. This works until the selectors break, which happens on every site redesign.

Fix: Use Browserbeam's element refs instead of CSS selectors. Refs are positional, not structural. They survive markup changes because they're assigned dynamically based on the rendered page, not the DOM tree structure.

Not Handling Dynamic Content

Raw HTML scrapers that use requests.get() or fetch() miss everything rendered by JavaScript. That includes React/Vue/Angular content, lazy-loaded images, and AJAX-fetched data. The scraper gets an empty shell.

Fix: Use a real browser (Browserbeam runs Chromium) and wait for stability. Browserbeam's stable: true signal tells your agent when all JavaScript has executed, all network requests have completed, and the page is ready to read.

Ignoring Token Costs

Many developers don't track how many tokens their agent consumes per task. An agent that runs 100 tasks per day at 20,000 tokens per page costs $5/day in LLM input tokens alone. That's $1,800/year just for reading pages, before any reasoning or output tokens.

Fix: Monitor token usage per task. Compare markdown (2,000 tokens) vs raw HTML (20,000 tokens) for each page type your agent visits. Use diff tracking for multi-step workflows to avoid re-reading unchanged content. Log the token count for each LLM call and set alerts for spikes.

Building Custom Parsers Instead of Using Extract

Some teams write custom HTML-to-JSON parsers for each data type they need to extract. A product parser, a job listing parser, a news article parser. Each one is 50-100 lines of code with its own set of CSS selectors, null handling, and edge cases.

Fix: Replace custom parsers with Browserbeam's extract schemas. One schema per data type, 3-5 lines each, no parsing code. The schema is declarative ("what data, not how to get it"), so it's easier to maintain, test, and update.

# Instead of 80 lines of BeautifulSoup parsing:
result = session.extract(
    _parent="article.product_pod",
    title="h3 a >> text",
    price=".price_color >> text",
    stock=".instock.availability >> text"
)

Migration from Raw Scraping to Browserbeam

If you have existing scrapers built with BeautifulSoup, Cheerio, or Selenium, here's how to migrate them.

Replacing BeautifulSoup/Cheerio with Extract

BeautifulSoup and Cheerio scrapers follow a pattern: fetch HTML, load it into a parser, select elements, extract text, handle nulls. The Browserbeam equivalent replaces all of that with a schema.

Before (BeautifulSoup):

import requests
from bs4 import BeautifulSoup

response = requests.get("https://books.toscrape.com")
soup = BeautifulSoup(response.text, "html.parser")

books = []
for article in soup.select("article.product_pod")[:5]:
    title_el = article.select_one("h3 a")
    price_el = article.select_one(".price_color")
    stock_el = article.select_one(".instock.availability")

    books.append({
        "title": title_el.get("title", title_el.text.strip()) if title_el else "N/A",
        "price": price_el.text.strip() if price_el else "N/A",
        "stock": stock_el.text.strip().replace("\n", " ").strip() if stock_el else "Unknown"
    })

After (Browserbeam):

from browserbeam import Browserbeam

client = Browserbeam(api_key="YOUR_API_KEY")
session = client.sessions.create(url="https://books.toscrape.com")
result = session.extract(
    _parent="article.product_pod",
    _limit=5,
    title="h3 a >> text",
    price=".price_color >> text",
    stock=".instock.availability >> text"
)
books = result.extraction
session.close()

The selector patterns are similar (h3 a, .price_color), so migration is usually mechanical. The key differences: no null handling (Browserbeam returns empty strings, not None), no whitespace cleaning (automatic), and JavaScript-rendered content works without adding Selenium or Playwright.

Replacing Selenium/Puppeteer with Session API

Selenium and Puppeteer scrapers use a browser, but they still return raw HTML that you parse yourself. Migration means replacing the browser management and HTML parsing layers while keeping your extraction logic.

Selenium/Puppeteer Concept	Browserbeam Equivalent
`webdriver.Chrome()` / `puppeteer.launch()`	`client.sessions.create(url=...)`
`driver.get(url)` / `page.goto(url)`	`session.goto(url=...)`
`driver.find_element(By.CSS_SELECTOR, ...)`	`session.click(ref="e1")`
`element.send_keys("text")`	`session.fill(ref="e1", value="text")`
`driver.page_source` / `page.content()`	`session.page.markdown.content`
`WebDriverWait(driver, 10)`	Automatic (`stable: true`)
`driver.quit()` / `browser.close()`	`session.close()`

The biggest win is eliminating WebDriverWait and explicit sleep timers. Browserbeam's stability detection handles page readiness automatically, so you never wait longer than necessary and never act on a half-loaded page.

Migration Checklist

Follow this checklist when migrating an existing scraper to Browserbeam:

Inventory your selectors. List every CSS selector and XPath expression in your current scraper. These map to Browserbeam extract schema fields.
Replace fetch + parse with sessions.create. One API call replaces requests.get + BeautifulSoup or puppeteer.launch + page.goto + page.content.
Convert selectors to extract schemas. soup.select(".price") becomes price=".price >> text". el.get("href") becomes url="a >> href".
Remove null handling. Browserbeam returns empty strings for missing elements, so you can remove if el else "N/A" patterns.
Remove wait logic. Delete sleep(), WebDriverWait, and waitForSelector calls. Browserbeam handles stability automatically.
Replace pagination logic. Instead of constructing URLs, use session.click(ref=next_button_ref) and extract from the new page.
Test with your target sites. Run the new scraper against the same URLs and compare output. The data should match (or be cleaner) with less code.
Remove browser dependencies. Uninstall Puppeteer, Selenium, ChromeDriver, and any browser binaries from your CI/CD pipeline.

Token Cost Analysis

Hard numbers on what structured output saves. These costs are based on GPT-4o pricing ($2.50 per million input tokens, $10 per million output tokens) as of early 2026.

Raw HTML vs Markdown vs Extract

Output Type	Avg. Tokens/Page	1,000 Pages	10,000 Pages	100,000 Pages
Raw HTML	20,000	$50.00	$500.00	$5,000.00
Browserbeam Markdown	2,000	$5.00	$50.00	$500.00
Browserbeam Markdown (truncated 1K)	400	$1.00	$10.00	$100.00
Browserbeam Extract (data only)	100	$0.25	$2.50	$25.00

At scale, the savings are striking. A web scraping pipeline that processes 100,000 pages per month saves $4,500/month by switching from raw HTML to Browserbeam markdown. If the task is pure data extraction (no LLM reasoning needed), the extract path costs $25 versus $5,000.

Cost Savings at Scale

Beyond per-page token costs, structured output saves money in three other ways:

Fewer LLM calls. When your agent receives clean, readable data, it makes better decisions in fewer iterations. A raw HTML agent might take 8-10 iterations to find the right data on a page. A structured output agent typically finishes in 2-3 iterations because the information is immediately visible.

Smaller context windows. Raw HTML fills context windows fast. With a 128K token limit, you can fit about 6 pages of raw HTML or about 60 pages of Browserbeam markdown. That means fewer sessions, fewer re-prompts, and more room for the model's reasoning tokens.

No parsing failures. Raw HTML parsing can fail silently. The LLM might extract the wrong price, miss a field, or hallucinate a value that looks plausible but doesn't exist on the page. Structured extraction schemas return exactly the fields you defined, and empty strings when elements are missing. No hallucinated data.

When Raw HTML Is Acceptable

Raw HTML isn't always wrong. Three situations where it makes sense:

Testing your own app. If you're running E2E tests against your own codebase, you control the markup and can write stable selectors. The token cost doesn't matter because you're not sending HTML to an LLM.
One-off debugging. When you need to inspect the full DOM to diagnose a rendering issue, raw HTML gives you the complete picture that markdown strips away.
Custom parsing requirements. If you need to extract data from HTML attributes, data-* properties, or inline styles that Browserbeam's markdown doesn't surface, raw HTML gives you access to everything.

For every other use case, and especially for AI agent workflows, structured output is the better choice.

Frequently Asked Questions

What is web scraping with structured output?

Web scraping with structured output means extracting data from websites as clean, typed JSON instead of raw HTML strings. Instead of fetching a page and parsing HTML with BeautifulSoup or Cheerio, you define a declarative schema (field names mapped to CSS selectors) and the browser API returns structured data directly. Browserbeam's extract step does this in a single API call.

How does Browserbeam convert HTML to JSON?

Browserbeam runs a real Chromium browser, renders the page (including JavaScript), waits for stability, then extracts data using the schema you provide. The >> operator in selectors specifies what to extract: >> text for text content, >> href for link URLs, >> src for image sources, >> content for meta tag content. The result is a JSON array of objects matching your schema.

Can I scrape JavaScript-rendered pages without Puppeteer?

Yes. Browserbeam runs a managed Chromium browser in the cloud. It handles JavaScript execution, waits for SPAs to render, and returns the fully rendered page state. You don't need to install Puppeteer, Playwright, or any browser binary. Install the Python SDK, make API calls, and get structured data back.

How does structured output reduce LLM token costs?

Raw HTML contains 15,000-25,000 tokens per page, with 90-95% of those tokens being CSS, scripts, and layout markup that carry no useful information. Browserbeam's markdown output strips all of that, returning 1,500-3,000 tokens of clean, readable content. The extract step goes further, returning only the specific fields you requested (50-500 tokens). Lower token counts mean lower LLM API costs and more pages per context window.

Is web scraping with Browserbeam legal?

Web scraping legality depends on what you scrape, how you use it, and the target site's terms of service. Browserbeam is a browser automation tool, not a legal framework. Generally, scraping publicly available data for personal use or research is acceptable in most jurisdictions. Always review the target site's robots.txt and terms of service. For more on responsible scraping practices, see our structured scraping guide.

What is the difference between web scraping and using an API?

A web API provides structured data directly (JSON or XML) through documented endpoints. Web scraping extracts data from the rendered HTML of a web page. APIs are more reliable and efficient when available. Web scraping is necessary when no API exists, when the API lacks the data you need, or when API rate limits are too restrictive. Browserbeam bridges the gap by providing API-quality structured output from web pages that don't offer APIs.

How do Browserbeam's element refs work?

Element refs (e1, e2, e3...) are stable identifiers assigned to every interactive element on the page. Each ref includes the element's tag type, label, and role. Your agent clicks ref="e1" instead of constructing CSS selectors. Refs are assigned dynamically based on the rendered page, so they survive markup changes, CSS framework migrations, and site redesigns. They persist within a session and update on navigation.

Can I migrate my existing BeautifulSoup scrapers to Browserbeam?

Yes. The migration is mostly mechanical. CSS selectors from BeautifulSoup translate directly to Browserbeam extract schema fields: soup.select_one(".price").text becomes price=".price >> text". The main changes are removing null checks (Browserbeam handles missing elements), removing wait logic (stability detection is automatic), and replacing requests.get + BeautifulSoup with client.sessions.create + session.extract. Most scrapers migrate in under an hour.

Conclusion

The gap between raw HTML and structured output is not a performance optimization. It's a category difference. One approach sends your LLM 20,000 tokens of noise and asks it to find the signal. The other sends the signal directly.

For AI agent builders, the choice is straightforward. Use Browserbeam's markdown when your agent needs to reason about a page (browse, navigate, decide). Use extract when you need structured data (scrape, monitor, collect). Use raw HTML only when you're testing your own app and the LLM isn't involved.

The numbers make the case: 90-95% fewer tokens, 10x lower cost per page, 60 lines of parsing code replaced by 5 lines of schema definition, and zero broken selectors when a site changes its CSS framework.

Start with the Browserbeam API docs to explore the full endpoint reference. Install the SDK and try extracting data from a real page:

pip install browserbeam        # Python
npm install @browserbeam/sdk   # TypeScript
gem install browserbeam        # Ruby

Sign up for a free account and run the book scraper from this guide. Compare the output to your current Puppeteer or BeautifulSoup setup. The difference speaks for itself.