Data Extraction with Browserbeam: From Browser to JSON

Most web scraping code is parsing code. You write 10 lines to fetch a page and 60 lines to find, extract, clean, and validate the data inside it. When the site changes its markup, those 60 lines break. When you target a new site, you write 60 new lines. The ratio never improves.

There's a different model. Instead of writing parsing logic, you describe the data shape you want, and the extraction engine finds it, cleans it, and returns it as typed JSON. The parsing problem becomes a schema design problem. And schemas are three lines, not sixty.

This guide goes under the hood of Browserbeam's extract engine. You'll learn how schema-driven extraction works, how to design schemas that handle everything from flat pages to deeply nested data structures, and how to debug extractions when they don't return what you expect. This is the "how it works and why it works that way" companion to the structured web scraping guide, which covers the step-by-step tutorial.

In this guide, you'll learn:

How Browserbeam's extract engine processes schemas against a rendered DOM
The three schema types: flat, nested, and array, with patterns for each
Advanced schema patterns: conditional fields, computed values, and multi-page pipelines
A working example that extracts a product table into structured JSON
Extraction accuracy benchmarks across different page types
How to debug schemas when fields come back empty or incorrect
Five common extraction mistakes and how to fix them

TL;DR: Browserbeam's extract engine runs declarative schemas against a rendered DOM, matching CSS selectors to extract typed JSON. Schemas use >> text, >> href, >> src, and >> content operators to pull specific attributes. The _parent key scopes extraction to repeating elements, _limit controls result count, and js >> expressions run JavaScript for computed fields. The engine handles null fields, whitespace cleaning, and type coercion automatically.

Beyond Parsing: Semantic Data Extraction

Traditional web scraping splits into two phases: getting the page and getting the data out of the page. The first phase is straightforward (make an HTTP request or automate a browser). The second phase is where the real work happens.

With BeautifulSoup, you write imperative code: find the element, check if it exists, get the text, strip whitespace, handle encoding, convert types. With Cheerio or lxml, the syntax changes but the pattern is identical. Every field requires a selector, a null check, and a cleanup step.

Browserbeam's extract engine works differently. It takes a declarative schema (a mapping of field names to selector expressions) and returns structured JSON. The schema describes what you want. The engine figures out how to get it.

Here's what that looks like in practice:

curl -X POST https://api.browserbeam.com/v1/sessions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://books.toscrape.com",
    "steps": [{
      "extract": {
        "_parent": "article.product_pod",
        "_limit": 5,
        "title": "h3 a >> text",
        "price": ".price_color >> text",
        "url": "h3 a >> href"
      }
    }]
  }'

from browserbeam import Browserbeam

client = Browserbeam(api_key="YOUR_API_KEY")
session = client.sessions.create(url="https://books.toscrape.com")

result = session.extract(
    _parent="article.product_pod",
    _limit=5,
    title="h3 a >> text",
    price=".price_color >> text",
    url="h3 a >> href"
)
print(result.extraction)

import Browserbeam from "@browserbeam/sdk";

const client = new Browserbeam({ apiKey: "YOUR_API_KEY" });
const session = await client.sessions.create({ url: "https://books.toscrape.com" });

const result = await session.extract({
  _parent: "article.product_pod",
  _limit: 5,
  title: "h3 a >> text",
  price: ".price_color >> text",
  url: "h3 a >> href",
});
console.log(result.extraction);

require "browserbeam"

client = Browserbeam::Client.new(api_key: "YOUR_API_KEY")
session = client.sessions.create(url: "https://books.toscrape.com")

result = session.extract(
  _parent: "article.product_pod",
  _limit: 5,
  title: "h3 a >> text",
  price: ".price_color >> text",
  url: "h3 a >> href"
)
puts result.extraction

The key difference from imperative parsing: no loops, no null checks, no whitespace cleaning, no type conversion. The schema is the entire extraction logic. Understanding how the engine interprets that schema is what this guide is about.

How the Extract Engine Works

The extract step runs after the page has been rendered by Chromium, after JavaScript has executed, and after stability detection confirms the page is ready. By the time your schema runs, the DOM is complete. No race conditions, no missing elements from lazy loading.

Schema-Driven Extraction

The engine processes a schema in three phases:

Scope resolution. If the schema includes a _parent key, the engine finds all matching elements and processes each one independently. Without _parent, the entire document is the scope.
Field extraction. For each scope, the engine evaluates every field. A field like "title": "h3 a >> text" splits into two parts:
- Selector (h3 a): a CSS selector evaluated within the current scope
- Operator (>> text): what to extract from the matched element
Result assembly. Extracted values are assembled into a JSON object (or array, if _parent is used). Missing elements produce empty strings, not null values or errors.

The >> operator is the core of the syntax. It separates the "find" step from the "read" step:

Operator	What It Returns	Example
`>> text`	Visible text content (trimmed, whitespace-normalized)	`"h1 >> text"` returns `"A Light in the Attic"`
`>> href`	The `href` attribute (for links)	`"a >> href"` returns `"/catalogue/page-2.html"`
`>> src`	The `src` attribute (for images, scripts)	`"img >> src"` returns `"https://..."`
`>> content`	The `content` attribute (for meta tags)	`"meta[name=keywords] >> content"`
`>> data-*`	Any `data-` attribute	`".product >> data-sku"` returns `"SKU-123"`
`>> value`	The current value of an input field	`"input[name=qty] >> value"` returns `"1"`
`>> class`	The class attribute	`"p.star-rating >> class"` returns `"star-rating Three"`
(omitted)	Raw inner HTML of the element	`"div.content"` returns the HTML string

When you omit the >> operator entirely, the engine returns the raw inner HTML. This is useful when you need the markup itself (for rendering or further processing), but for most data extraction tasks, >> text is what you want.

LLM-Powered Field Mapping

For simple schemas with clear CSS selectors, the engine runs direct DOM queries. But for AI agents that generate schemas dynamically, the extract engine can also interpret natural-language-style field descriptions.

The key insight: an AI agent reading a page through Browserbeam's markdown output sees element labels and structure. When the agent constructs a schema, it uses the CSS selectors visible in the interactive elements list. The system works together: the observe step provides the page map, the agent decides what to extract, and the extract step pulls it.

This matters because it means the schema can come from a human developer (who knows CSS) or from an LLM (that reads the page state and builds selectors). The extract engine doesn't care who wrote the schema. It runs the same extraction pipeline either way.

Fallback and Confidence Scoring

When a selector matches no elements, the engine returns an empty string for that field. It does not throw an error. This design choice is intentional: partial extraction is more useful than failed extraction.

Consider extracting 5 fields from a product page. If field 4 ("discount price") doesn't exist on this particular product, you want the other 4 fields to come through. An exception-based model would abort the entire extraction.

Scenario	Engine Behavior	Result
Selector matches one element	Extracts the specified attribute	`"A Light in the Attic"`
Selector matches multiple elements	Returns first match (or all, for `[array]` syntax)	`"A Light in the Attic"`
Selector matches nothing	Returns empty string	`""`
`_parent` matches nothing	Returns empty array	`[]`
Invalid CSS selector	Returns error in response	`{"error": "invalid_selector"}`

This means you can write "optimistic" schemas that include fields which may not exist on every page. Extract the discount price, the review count, the seller rating. If any of those fields are missing on a specific page, the other fields still populate correctly.

Defining Schemas for Robust Extraction

Schema design determines extraction quality. A well-designed schema is concise, resilient to minor markup changes, and readable by both humans and LLMs. Here are the three schema types.

Flat Schemas for Simple Pages

A flat schema maps field names directly to selector expressions. No nesting, no arrays. Use it when you need individual data points from a page.

session = client.sessions.create(
    url="https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html"
)

result = session.extract(
    title="h1 >> text",
    price=".price_color >> text",
    stock=".instock.availability >> text",
    description="#product_description + p >> text",
    upc="tr:nth-child(1) td >> text",
    category=".breadcrumb li:nth-child(3) a >> text"
)

print(result.extraction)
# {"title": "A Light in the Attic", "price": "£51.77",
#  "stock": "In stock (22 available)",
#  "description": "It's hard to imagine a world without...",
#  "upc": "a897fe39b1053632", "category": "Poetry"}

Each field is independent. The engine evaluates them in parallel against the full page DOM. Field names are arbitrary (you choose them), so the output JSON matches whatever shape your downstream code expects.

Two design tips for flat schemas:

Use the most specific selector that still works. h1 >> text is better than .product-title >> text because h1 survives CSS class renames. But if the page has multiple h1 elements, you need a more specific path.
Prefer attribute selectors over nth-child. tr:nth-child(1) td works but breaks if the table adds a new row at the top. If the row has a distinguishing class or text, use that instead.

Nested Object Schemas

When extracted data has a natural hierarchy, nested schemas keep the structure clear without post-processing.

result = session.extract(
    book={
        "title": "h1 >> text",
        "price": ".price_color >> text"
    },
    breadcrumb={
        "category": ".breadcrumb li:nth-child(3) a >> text",
        "category_url": ".breadcrumb li:nth-child(3) a >> href"
    },
    details={
        "upc": "tr:nth-child(1) td >> text",
        "product_type": "tr:nth-child(2) td >> text",
        "tax": "tr:nth-child(5) td >> text",
        "reviews": "tr:nth-child(7) td >> text"
    }
)

The output mirrors the schema structure:

{
  "book": {"title": "A Light in the Attic", "price": "£51.77"},
  "breadcrumb": {"category": "Poetry", "category_url": "catalogue/category/books/poetry_23/index.html"},
  "details": {"upc": "a897fe39b1053632", "product_type": "Books", "tax": "£0.00", "reviews": "0"}
}

Nested schemas are useful when you're feeding the extracted data into a typed system (a database, a TypeScript interface, a Pydantic model). The structure maps directly to your data model.

Array Schemas for Lists and Tables

The _parent key transforms a schema from "extract one item" to "extract all matching items." Every element matching the _parent selector becomes a scope, and the child fields are extracted within each scope.

session = client.sessions.create(url="https://books.toscrape.com")

result = session.extract(
    _parent="article.product_pod",
    title="h3 a >> text",
    price=".price_color >> text",
    stock=".instock.availability >> text",
    url="h3 a >> href",
    image="img >> src"
)

print(f"Found {len(result.extraction)} books")
for book in result.extraction[:3]:
    print(f"  {book['title']}: {book['price']}")

The _parent selector works like a loop without the loop. The engine finds every article.product_pod on the page (there are 20 on the first page of books.toscrape.com), then runs the child selectors within each one. The output is a JSON array of objects.

For HTML tables, the engine has a shortcut. Pass a table selector without child fields, and it auto-parses the table using header row values as keys:

result = session.extract(
    table_data="table.table-striped"
)
# Returns: [{"UPC": "a897fe39b1053632", "Product Type": "Books", ...}]

This auto-parsing reads the <thead> cells as field names and each <tbody> row as an object. No schema needed for standard HTML tables.

Advanced Schema Patterns

Beyond the three basic types, schemas support patterns that handle the trickier corners of web extraction.

Conditional Fields

Some pages have fields that only appear under certain conditions (sale prices, out-of-stock badges, promotional labels). You can include these in your schema safely because the engine returns empty strings for missing selectors.

result = session.extract(
    _parent="article.product_pod",
    title="h3 a >> text",
    regular_price=".price_color >> text",
    sale_price=".price_sale >> text",
    badge=".badge >> text",
    stock=".instock.availability >> text"
)

for item in result.extraction:
    price = item["sale_price"] if item["sale_price"] else item["regular_price"]
    badge = f" [{item['badge']}]" if item["badge"] else ""
    print(f"{item['title']}: {price}{badge}")

The schema includes both regular_price and sale_price. On items without a sale, sale_price comes back as an empty string. Your code makes the decision about which to use. The extraction doesn't fail.

Computed Values and Transformations

For values that don't exist directly in the DOM but can be computed from it, use JavaScript expressions with the js >> prefix:

result = session.extract(
    total_books="js >> document.querySelectorAll('article.product_pod').length",
    page_number="js >> document.querySelector('.current')?.textContent?.trim()",
    has_next="js >> !!document.querySelector('.next a')",
    base_url="js >> window.location.origin"
)

JavaScript expressions execute in the browser context after all JavaScript has run. This means you can access runtime values (localStorage, computed styles, JavaScript variables) that pure CSS selectors cannot reach.

Selector Type	Can Access	Best For
CSS + `>>` operators	DOM elements and attributes	Visible content, links, images
`js >>` expressions	Full browser context	Computed values, counts, runtime state
Array syntax `["selector"]`	All matching elements	Link lists, heading lists, tag clouds
Table shortcut	Standard HTML tables	Tabular data with headers

Multi-Page Extraction Pipelines

For datasets that span multiple pages, combine extraction with navigation. The pattern is: extract, check for next page, navigate, repeat.

from browserbeam import Browserbeam

client = Browserbeam(api_key="YOUR_API_KEY")
session = client.sessions.create(url="https://books.toscrape.com")

all_books = []
max_pages = 5

for page_num in range(max_pages):
    result = session.extract(
        _parent="article.product_pod",
        title="h3 a >> text",
        price=".price_color >> text",
        url="h3 a >> href"
    )
    all_books.extend(result.extraction)

    has_next = session.extract(
        next_url=".next a >> href"
    )

    if has_next.extraction.get("next_url"):
        session.click(text="next")
    else:
        break

session.close()
print(f"Extracted {len(all_books)} books across {page_num + 1} pages")

Each iteration extracts the data and checks for a "next" link. The navigation happens through Browserbeam's click step (using the interactive element), so you never construct pagination URLs manually. The session maintains state (cookies, scroll position) across page transitions.

For high-volume extraction pipelines across many pages, see the scaling web automation guide for parallel session patterns.

Example: Extracting a Products Table into JSON

Here's a complete, end-to-end example that pulls a structured product dataset from a bookstore, handles pagination, and produces clean JSON output.

import json
from browserbeam import Browserbeam

client = Browserbeam(api_key="YOUR_API_KEY")

def extract_full_catalog(max_pages=3):
    session = client.sessions.create(url="https://books.toscrape.com")
    catalog = []

    for page in range(max_pages):
        # Extract all books on the current page
        result = session.extract(
            _parent="article.product_pod",
            title="h3 a >> text",
            price=".price_color >> text",
            stock=".instock.availability >> text",
            url="h3 a >> href",
            image="img >> src"
        )

        # Add page number to each record
        for book in result.extraction:
            book["page"] = page + 1
        catalog.extend(result.extraction)

        # Check for next page
        elements = session.page.interactive_elements
        next_btn = next(
            (e for e in elements if "next" in e.get("label", "").lower()),
            None
        )
        if next_btn:
            session.click(ref=next_btn["ref"])
        else:
            break

    session.close()
    return catalog

books = extract_full_catalog()
print(f"Total books: {len(books)}")
print(json.dumps(books[:2], indent=2))

Expected output:

[
  {
    "title": "A Light in the Attic",
    "price": "£51.77",
    "stock": "In stock",
    "url": "a-light-in-the-attic_1000/index.html",
    "image": "media/cache/2c/da/2cdad67c44b002e7c0cc7bf14a5d38c1.jpg",
    "page": 1
  },
  {
    "title": "Tipping the Velvet",
    "price": "£53.74",
    "stock": "In stock",
    "url": "tipping-the-velvet_999/index.html",
    "image": "media/cache/26/0c/260c6ae16bce31c8f8c95b0f6dda9f.jpg",
    "page": 1
  }
]

The schema does all the heavy lifting. No BeautifulSoup loops, no null handling, no whitespace regex. The _parent scopes each extraction to one article.product_pod, and the child selectors pull the five fields from within that scope.

Real-World Extraction Benchmarks

How well does schema-based extraction perform across different page types? Here are measurements from testing against real websites.

Accuracy Across Page Types

Page Type	Example Site	Fields Extracted	Accuracy	Notes
E-commerce listing	books.toscrape.com	title, price, stock, URL	100%	Simple, consistent structure
Job board	realpython.github.io/fake-jobs/	title, company, location, date	100%	Card-based layout
News aggregator	news.ycombinator.com	headline, URL, rank	100%	Flat table structure
Quote collection	quotes.toscrape.com	quote, author, tags	100%	Consistent `.quote` container
Product detail page	books.toscrape.com (single book)	title, price, UPC, description, category	100%	Mixed selectors (h1, table, sibling)

Schema-based extraction reaches 100% accuracy when the selectors are correct. The engine doesn't guess or infer. It runs exact CSS queries against the rendered DOM. The accuracy question is really a schema design question: did you write the right selector for the right field?

Speed and Token Cost

Metric	BeautifulSoup + Requests	Selenium + Manual Parse	Browserbeam Extract
Time to first result	200-500ms (no JS)	3-8s (browser startup)	2-4s (first session)
Subsequent pages	100-300ms each	1-3s each	500ms-1s each
Tokens to process (if LLM involved)	15,000-25,000/page	15,000-25,000/page	0 (extract bypasses LLM)
Lines of extraction code	30-60	40-80	3-10
JavaScript support	No	Yes	Yes

The most significant number: Browserbeam's extract step uses zero LLM tokens. The extraction happens on the server, against the rendered DOM, using CSS selectors. If your task is pure data collection (not browsing or decision-making), the LLM never needs to see the page at all. That's the fundamental cost advantage of declarative extraction.

Comparison with Manual Parsing

Here's the same extraction task implemented three ways:

Aspect	BeautifulSoup	Cheerio (Node.js)	Browserbeam Extract
Setup	`pip install beautifulsoup4 requests`	`npm install cheerio node-fetch`	`pip install browserbeam`
Fetch + Parse	`requests.get()` + `BeautifulSoup()`	`fetch()` + `cheerio.load()`	`client.sessions.create()`
Select elements	`soup.select(".product")`	`$(".product")`	`_parent=".product"`
Extract text	`.text.strip()`	`.text().trim()`	`>> text` (auto-trimmed)
Handle nulls	`if el else "N/A"`	`el ? el.text() : "N/A"`	Returns `""` automatically
Handle JS content	Not possible	Not possible	Automatic (Chromium)
Total code	40-60 lines	35-50 lines	5-10 lines

The code reduction is 80-90%. But the bigger advantage is reliability. BeautifulSoup and Cheerio break on JavaScript-rendered content. They require explicit null handling for every field. And they return raw strings that need manual cleanup. The extract engine handles all three automatically.

Tips for Complex Data Structures

Some pages have data structures that don't map neatly to flat arrays. Here's how to handle the tricky cases.

Nested repeating elements. A page with categories, each containing products:

result = session.extract(
    _parent=".category-section",
    category_name="h2 >> text",
    products=[{
        "_parent": ".product-card",
        "name": "h3 >> text",
        "price": ".price >> text"
    }]
)

The outer _parent finds each category section. The inner _parent (nested inside the array) finds products within each section. The result is an array of category objects, each with a nested array of products.

Extracting from multiple page regions. Use separate extract calls for different sections:

header_data = session.extract(
    title="h1 >> text",
    breadcrumb=".breadcrumb >> text"
)

product_data = session.extract(
    _parent="article.product_pod",
    _limit=10,
    title="h3 a >> text",
    price=".price_color >> text"
)

sidebar_data = session.extract(
    categories=[".side_categories li a >> text"]
)

Three extract calls, each focused on a different page region. Combine the results in your application code. This is more readable and maintainable than trying to capture everything in a single massive schema.

Array fields without _parent. Use bracket syntax to collect all matches of a selector:

result = session.extract(
    all_links=["a >> href"],
    all_headings=["h2 >> text"],
    all_images=["img >> src"]
)

Each field returns an array of all matching values on the page. This is useful for auditing (find all links), analysis (list all headings), or bulk collection (download all images).

Debugging Extraction Issues

When an extraction doesn't return what you expect, the problem is almost always in the schema, not the engine. Here's a systematic approach to finding and fixing issues.

Schema Validation Errors

If your CSS selector has a syntax error, the engine returns an error response instead of data. Common mistakes:

Error	Cause	Fix
`invalid_selector`	Malformed CSS	Check bracket matching, colon placement
Missing `>>` operator	Selector returns HTML instead of text	Add `>> text` or the appropriate operator
Wrong attribute name	`>> title` instead of `>> text`	Check the operator table above

Test selectors in your browser's DevTools console before putting them in a schema. document.querySelectorAll("your-selector") shows you exactly what matches.

Missing or Incorrect Fields

When a field returns an empty string and you expected data:

Check the selector scope. If you're using _parent, the child selectors run within each parent, not against the full page. A selector like h1 >> text inside a _parent=".product-card" scope looks for h1 inside .product-card, not the page's main h1.
Check selector specificity. .price >> text matches the first element with class price. If the price is inside .price-container .current-price, your selector needs to be more specific.
Check dynamic rendering. Some content loads asynchronously after the initial render. Browserbeam waits for stability, but extremely delayed content (loaded on scroll or on hover) might not be present yet. Use the page.stable signal to verify.

Dynamic Content Not Captured

If the page has content that loads on user interaction (clicking a tab, scrolling to a section, hovering over an element), you need to trigger that interaction before extracting.

session = client.sessions.create(url="https://books.toscrape.com")

# Content behind a tab or click
session.click(ref="e5")

# Now extract from the newly visible content
result = session.extract(
    details="#product_description + p >> text"
)

The pattern is: navigate, interact, wait for stability, then extract. Each action triggers an auto-observation, so the page state is always fresh after a click or fill.

Using page.map to Verify Page State

Before writing an extraction schema, inspect the page structure with page.map. This lightweight section map shows you what regions the page contains and gives hints about their content.

session = client.sessions.create(url="https://books.toscrape.com")

if session.page.map:
    for section in session.page.map:
        print(f"[{section['tag']}] {section.get('hint', 'no hint')}")

# Output:
# [nav] Navigation bar with links
# [main] Product listings with 20 items
# [aside] Category sidebar with 50 categories
# [footer] Site footer with links

The page map tells you where data lives before you write selectors. If you expected product data in the main section but the map shows it in an aside, your selectors are targeting the wrong region.

For AI agents that build schemas dynamically, the page map is the input that tells the agent what sections to target. The agent reads the map, picks the relevant section, and constructs a schema for that region.

Common Extraction Mistakes

Five patterns that produce poor or no results, and how to fix each one.

Over-Specifying Schemas

Writing overly specific selectors that break on minor changes:

Bad: div.container > div.row > div.col-md-8 > article:first-child > h3 > a >> text

Good: article.product_pod h3 a >> text

The bad selector traces the exact DOM path and breaks when any wrapper div changes. The good selector targets the structural elements (article, h3, a) that carry semantic meaning and survive layout changes.

Rule of thumb: Use the shortest selector that uniquely identifies the element. If .price >> text works, don't write .product-card .price-wrapper .price-display span.price-value >> text.

Extracting Before Page Stability

Running extract immediately after navigation, before JavaScript has finished rendering:

# This works with Browserbeam (stability detection is automatic)
session = client.sessions.create(url="https://books.toscrape.com")
result = session.extract(title="h1 >> text")

# With Selenium, this would often fail:
# driver.get("https://books.toscrape.com")
# title = driver.find_element(By.CSS_SELECTOR, "h1").text  # might be empty

Browserbeam handles this automatically. The create_session call waits for stability before returning. But if you're navigating within a session (goto, click), the subsequent page state is also auto-observed after each action. You never need explicit waits or sleep timers.

Not Using page.map for Discovery

Jumping straight to extraction without understanding the page structure. Developers guess selectors, get empty results, tweak, retry. The cycle wastes time.

Fix: Always check page.map first. Read the markdown content to understand the page layout. Then write selectors based on what you can see, not what you assume.

Ignoring Extraction Confidence Signals

Not checking whether extraction returned meaningful data. An empty string might mean the selector is wrong, or it might mean the field genuinely doesn't exist on this page.

Fix: After extraction, validate the results:

result = session.extract(
    _parent="article.product_pod",
    _limit=3,
    title="h3 a >> text",
    price=".price_color >> text"
)

if not result.extraction:
    print("Warning: _parent selector matched nothing")
elif any(not item["title"] for item in result.extraction):
    print("Warning: some titles are empty, check the title selector")
else:
    print(f"Extracted {len(result.extraction)} items successfully")

This three-line check catches the two most common problems: wrong _parent (empty array) and wrong field selectors (empty strings in populated items).

Treating Extract Like a CSS Selector Engine

Trying to use extract for page interaction or state management instead of data collection. Extract is read-only. It pulls data from the DOM but doesn't modify it.

For clicking, filling, and navigating, use Browserbeam's session actions (click, fill, select, goto). For reading page content, use observe (markdown) or extract (structured JSON). Each tool has a specific purpose:

Tool	Purpose	Returns
`observe`	Read page content as markdown	Text + interactive elements
`extract`	Pull structured data by schema	JSON matching your schema
`click` / `fill` / `select`	Interact with the page	Updated page state
`goto`	Navigate to a new URL	New page state
`scroll_collect`	Auto-scroll and collect all content	Full page text

Frequently Asked Questions

How do I extract data from a website as JSON?

Define a schema mapping field names to CSS selector expressions using Browserbeam's extract step. For example, title="h1 >> text" extracts the page title, and _parent=".product-card" with child selectors extracts a list of products. The result comes back as clean JSON. One API call replaces the entire fetch-parse-extract workflow of tools like BeautifulSoup or Cheerio.

What is the difference between extract and observe in Browserbeam?

Observe returns the full page as markdown text with interactive elements, which is useful for browsing and LLM-based decision making. Extract returns specific fields as structured JSON based on a declarative schema, which is ideal for data collection. Use observe when your agent needs to reason about the page. Use extract when you know exactly what data you want.

Can Browserbeam extract data from JavaScript-rendered pages?

Yes. Browserbeam runs a full Chromium browser and waits for JavaScript execution and page stability before extraction. Content rendered by React, Vue, Angular, or any client-side framework is fully available. This is a major advantage over static parsers like BeautifulSoup and Scrapy, which only see the initial HTML response.

How do I handle pages where elements are sometimes missing?

The extract engine returns empty strings for missing selectors instead of throwing errors. Include all possible fields in your schema (regular price, sale price, badge, discount percentage). On pages where some fields don't exist, those fields return "" and the rest populate normally. Check for empty strings in your application code to handle conditional fields.

What is the _parent key in an extraction schema?

_parent defines a CSS selector for repeating container elements. The engine finds all matching containers and extracts child fields within each one. For a product listing, _parent="article.product_pod" finds every product card, and fields like title="h3 a >> text" run within each card's scope. The result is a JSON array of objects.

How does automated data extraction compare to manual web scraping?

Schema-based extraction with Browserbeam uses 80-90% fewer lines of code than manual parsing with BeautifulSoup or Cheerio. It eliminates null handling (automatic empty strings), whitespace cleaning (automatic trimming), and JavaScript rendering (Chromium handles it). A typical extraction schema is 3-10 lines versus 30-60 lines of manual parsing code.

Can I extract data from HTML tables automatically?

Yes. Pass a table selector without child fields (table_data="table.results"), and the engine auto-parses the table using header cells as field names. Each row becomes a JSON object with keys matching the column headers. For non-standard tables, use _parent with explicit field selectors.

What selectors work with the extract engine?

The extract engine supports standard CSS selectors: class (.class), ID (#id), tag (h1), attribute ([data-id]), pseudo-selectors (:nth-child(), :first-child), combinators (>, +, ~), and compound selectors (article.product_pod h3 a). It also supports js >> expressions for JavaScript-computed values. XPath is not supported; use CSS selectors or JavaScript expressions instead.

Conclusion

The extract engine turns a 60-line parsing problem into a 5-line schema definition. That's not just a code reduction. It changes which part of the system you maintain.

With imperative parsing, you maintain code: loops, null checks, whitespace regex, type converters. When the site changes, you update code. With declarative extraction, you maintain a schema: field names mapped to selectors. When the site changes, you update one selector. The engine handles loops, nulls, whitespace, and types.

Now that you understand how the engine processes schemas, how selectors resolve against the rendered DOM, and how to debug when things go wrong, you can design extraction schemas that work across page types and survive site updates.

Start with the Browserbeam API docs for the full extract reference. Try the Python SDK to run your first extraction, or explore the structured web scraping guide for complete scraping workflows with pagination and CSV export.

pip install browserbeam        # Python
npm install @browserbeam/sdk   # TypeScript
gem install browserbeam        # Ruby

Sign up for a free account, write your first schema, and watch it return clean JSON from a page that would have taken 60 lines of BeautifulSoup code.