Building a Web Scraping Agent in Python with Browserbeam

By the end of this tutorial, you'll have a Python web scraping agent that logs into a website, navigates through authenticated pages, extracts structured data, handles pagination, and exports the results to JSON. All without installing a browser, writing HTML parsers, or managing WebDriver binaries.

We'll build it step by step: set up the SDK, create sessions, navigate and interact with pages, handle login flows, extract data with schemas, and wire up the patterns that make a scraper reliable in production. Every code example runs as-is against real websites.

This guide is for Python developers who want to build web scraping agents that go beyond simple HTTP requests. If your target pages require JavaScript rendering, authentication, or multi-step navigation, you need a real browser. Browserbeam gives you one through a Python API.

In this tutorial, we'll build:

A configured Python client with proper error handling and session management
A login-and-scrape workflow that authenticates and extracts protected data
A multi-page scraping pipeline with pagination and data export
Three production agent recipes: price monitor, lead builder, and change detector
A debugging toolkit for diagnosing extraction failures
A migration path from Selenium and Playwright with side-by-side code
Performance-optimized patterns with async, session reuse, and call minimization

TL;DR: Build a web scraping agent in Python using the Browserbeam SDK. Install with pip install browserbeam, create sessions with client.sessions.create(url=...), interact via click, fill, and extract methods using stable element refs. The SDK handles browser management, JavaScript rendering, and page stability detection. Supports both sync and async, with typed error handling for production use.

Setting Up Browserbeam's Python SDK

Let's start with installation and configuration. The setup takes 30 seconds.

pip install browserbeam

No browser binary, no ChromeDriver, no system dependencies. The SDK talks to Browserbeam's cloud browsers over HTTPS.

Set your API key as an environment variable so every script picks it up automatically:

export BROWSERBEAM_API_KEY="your_api_key_here"

Now create a client:

from browserbeam import Browserbeam

client = Browserbeam()

That's the full setup. For scripts that need explicit configuration:

client = Browserbeam(
    api_key="your_api_key_here",
    timeout=120.0
)

Option	Default	What It Controls
`api_key`	`BROWSERBEAM_API_KEY` env var	Authentication
`base_url`	`https://api.browserbeam.com`	API endpoint
`timeout`	`120.0`	HTTP request timeout (seconds)

If you're new to the SDK, the getting started guide covers every method in detail. This tutorial focuses on building a complete agent, so we'll move fast through the basics.

Creating and Managing Browser Sessions

A session is an isolated browser tab in the cloud. It has its own cookies, local storage, and viewport. Let's create one and explore what comes back.

session = client.sessions.create(
    url="https://books.toscrape.com",
    auto_dismiss_blockers=True
)

print(f"Title: {session.page.title}")
print(f"Stable: {session.page.stable}")
print(f"Elements: {len(session.page.interactive_elements)}")

That single call creates a Chromium instance, navigates to the URL, waits for stability, and returns the full page state. The auto_dismiss_blockers option handles cookie consent banners without extra code.

Session Options and Configuration

Sessions accept options that control viewport, locale, proxy, and resource blocking:

session = client.sessions.create(
    url="https://books.toscrape.com",
    viewport={"width": 1280, "height": 720},
    locale="en-US",
    timezone="America/New_York",
    block_resources=["image", "font"],
    auto_dismiss_blockers=True,
    timeout=300
)

Option	What It Does	When to Use
`viewport`	Sets browser window size	Responsive content, mobile scraping
`locale`	Sets browser language/region	Geo-targeted content
`timezone`	Sets browser timezone	Location-specific pricing, events
`proxy`	Routes traffic through a proxy	Geo-targeting, anonymity
`block_resources`	Blocks image/font/stylesheet loading	Faster scraping, lower bandwidth
`auto_dismiss_blockers`	Dismisses cookie banners	Almost always `True` for scraping
`timeout`	Session timeout in seconds	Long-running workflows

Pro tip: Set block_resources=["image", "font"] for scraping tasks. It speeds up page loads by 30-50% and reduces bandwidth. Only skip this for tasks where visual rendering matters (screenshots, visual verification).

Navigating to Pages

Once you have a session, use goto to navigate within it. Cookies and local storage persist across navigations:

session.goto(url="https://books.toscrape.com/catalogue/page-2.html")
print(f"New page: {session.page.title}")

For JavaScript-heavy pages that render content after the initial load, add wait conditions:

session.goto(
    url="https://books.toscrape.com/catalogue/category/books/mystery_3/index.html",
    wait_until="document.querySelectorAll('article.product_pod').length > 0",
    wait_timeout=10000
)

The wait_until parameter takes a JavaScript expression that must become truthy before the navigation completes. This is more precise than Browserbeam's default stability detection for pages with complex loading sequences.

Reading Page State with observe()

After any action (click, fill, goto), the SDK auto-observes the page. But you can also manually refresh the page state:

# Default: main content area
state = session.observe()

# Full page including nav, sidebar, footer
full_state = session.observe(mode="full", max_text_length=20000)

# Just a specific section
scoped = session.observe(scope="#product-details")

The page.map property (included on the first observe) gives you a structural outline of the page:

for section in session.page.map:
    print(f"[{section['tag']}] {section.get('hint', '')}")

# [nav] Navigation bar with links
# [main] Product listings with 20 items
# [aside] Category sidebar with 50 categories
# [footer] Site footer

This is invaluable when scraping an unfamiliar site. You see the page structure before writing a single selector.

Most valuable data sits behind a login wall. Let's build a scraper that authenticates, navigates to a protected page, and extracts data.

We'll use the quotes.toscrape.com login page, which has a simple username/password form:

from browserbeam import Browserbeam

client = Browserbeam()
session = client.sessions.create(url="https://quotes.toscrape.com/login")

# Check the page state to find form fields
for el in session.page.interactive_elements:
    print(f"  {el['ref']}: [{el['tag']}] {el.get('label', el.get('type', 'unknown'))}")

# Fill the login form
session.fill(ref="e1", value="testuser")
session.fill(ref="e2", value="testpassword")
session.click(ref="e3")  # Login button

print(f"After login: {session.page.title}")
print(f"URL: {session.url}")

After the click, Browserbeam navigates to the post-login page, waits for stability, and returns the new page state. The session now carries the authentication cookies for all subsequent requests.

For forms with many fields, fill_form handles everything in one call:

session.fill_form(
    fields={
        "Username": "testuser",
        "Password": "testpassword"
    },
    submit=True
)

Extracting Data After Authentication

With the session authenticated, we can navigate to protected pages and extract data. The cookies persist across goto calls:

# We're now logged in. Extract quotes from the authenticated view.
result = session.extract(
    _parent=".quote",
    text=".text >> text",
    author=".author >> text",
    tags=".keywords >> content"
)

print(f"Found {len(result.extraction)} quotes")
for quote in result.extraction[:3]:
    print(f"  \"{quote['text'][:50]}...\" - {quote['author']}")

The extraction schema works the same whether or not the page required authentication. The session handles the auth state. The schema handles the data.

Handling Multi-Page Flows

Many web scraping tasks require navigating through multiple pages after login. Here's the pattern: authenticate once, then loop through pages:

from browserbeam import Browserbeam
import json

client = Browserbeam()
session = client.sessions.create(url="https://quotes.toscrape.com/login")

# Authenticate
session.fill_form(fields={"Username": "testuser", "Password": "testpassword"}, submit=True)

# Scrape multiple pages
all_quotes = []
max_pages = 5

for page_num in range(max_pages):
    result = session.extract(
        _parent=".quote",
        text=".text >> text",
        author=".author >> text",
        tags=".keywords >> content"
    )
    all_quotes.extend(result.extraction)
    print(f"Page {page_num + 1}: {len(result.extraction)} quotes")

    # Check for next page link
    elements = session.page.interactive_elements
    next_btn = next(
        (e for e in elements if "next" in e.get("label", "").lower()),
        None
    )
    if next_btn:
        session.click(ref=next_btn["ref"])
    else:
        break

session.close()
print(f"\nTotal: {len(all_quotes)} quotes")

# Export to JSON
with open("quotes.json", "w") as f:
    json.dump(all_quotes, f, indent=2)

One session, one login, multiple pages. The auth cookies carry through every click navigation. We find the "next" button by scanning the interactive elements list for a label containing "next", which works regardless of the button's CSS class or element type.

Leveraging Browserbeam's Stability and Diff Features

Two features make Browserbeam's python web scraping particularly reliable: stability detection and diff tracking. Understanding how they work helps you build agents that don't break on slow or dynamic pages.

Stability detection monitors three signals: network idle (no pending HTTP requests for 300ms), DOM quiet (no DOM mutations for 500ms), and CSS animations complete. The session.page.stable flag is True only when all three conditions pass. This replaces time.sleep() and Selenium's WebDriverWait:

session = client.sessions.create(url="https://books.toscrape.com")

# No sleep needed. page.stable is already True by the time create() returns.
print(f"Stable: {session.page.stable}")

session.click(ref="e1")
# After the click, the SDK waits for stability before returning.
# session.page now reflects the fully-loaded new state.
print(f"Still stable: {session.page.stable}")

Diff tracking reduces data transfer and token costs on multi-step workflows. After the first observation, subsequent observations include a changes object showing what's different:

session = client.sessions.create(url="https://books.toscrape.com")

# First observation: full page state
print(f"Full content length: {len(session.page.markdown.content)}")

# Click a book
session.click(ref="e1")

# Second observation: changes only
if session.page.changes:
    print(f"Content changed: {session.page.changes.get('content_changed', False)}")
    print(f"Elements added: {session.page.changes.get('elements_added', 0)}")
    print(f"Elements removed: {session.page.changes.get('elements_removed', 0)}")

For AI-powered agents that send page state to an LLM, diff tracking cuts token usage by 60-80% on multi-step workflows. Instead of re-reading the full page after every click, the agent only processes what changed.

Common Agent Patterns and Recipes

These three recipes solve the most common python web scraping use cases. Each one is a complete, runnable script.

Price Monitoring Bot

Track prices across multiple product pages and detect changes over time:

from browserbeam import Browserbeam
import json
from pathlib import Path

client = Browserbeam()

products = [
    {"name": "A Light in the Attic", "url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html"},
    {"name": "Tipping the Velvet", "url": "https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html"},
    {"name": "Soumission", "url": "https://books.toscrape.com/catalogue/soumission_998/index.html"},
]

baseline_file = Path("prices.json")
baseline = json.loads(baseline_file.read_text()) if baseline_file.exists() else {}
changes = []

for product in products:
    session = client.sessions.create(url=product["url"])
    result = session.extract(
        price=".price_color >> text",
        stock=".instock.availability >> text"
    )
    session.close()

    current = result.extraction["price"]
    previous = baseline.get(product["name"])

    if previous and current != previous:
        changes.append(f"{product['name']}: {previous} -> {current}")

    baseline[product["name"]] = current

baseline_file.write_text(json.dumps(baseline, indent=2))

if changes:
    print("Price changes detected:")
    for change in changes:
        print(f"  {change}")
else:
    print("No price changes")

Schedule this with cron or a task scheduler. Add notification logic (Slack webhook, email) for the price change alert, and you have a production monitoring system.

Lead List Builder

Visit multiple websites, extract structured information, and build a lead database:

from browserbeam import Browserbeam
import json

client = Browserbeam()

sources = [
    {"name": "Fake Jobs", "url": "https://realpython.github.io/fake-jobs/"},
    {"name": "Books to Scrape", "url": "https://books.toscrape.com"},
    {"name": "Quotes", "url": "https://quotes.toscrape.com"},
]

leads = []

for source in sources:
    session = client.sessions.create(
        url=source["url"],
        auto_dismiss_blockers=True
    )

    # Extract page-level info
    page_info = session.extract(
        title="h1 >> text",
        description="meta[name=description] >> content",
        all_links=["a >> href"]
    )

    leads.append({
        "source": source["name"],
        "url": source["url"],
        "title": page_info.extraction.get("title", ""),
        "description": page_info.extraction.get("description", ""),
        "link_count": len(page_info.extraction.get("all_links", []))
    })

    session.close()

with open("leads.json", "w") as f:
    json.dump(leads, f, indent=2)

print(f"Built {len(leads)} leads")
for lead in leads:
    print(f"  {lead['source']}: {lead['title']} ({lead['link_count']} links)")

Each session is independent. If one site fails, the others continue. The auto_dismiss_blockers option handles cookie consent popups automatically.

Content Change Detection

Monitor pages for content changes by comparing the current markdown output against a stored baseline:

from browserbeam import Browserbeam
import hashlib
import json
from pathlib import Path

client = Browserbeam()

pages = [
    {"name": "Hacker News", "url": "https://news.ycombinator.com"},
    {"name": "Quotes", "url": "https://quotes.toscrape.com"},
]

state_file = Path("content_hashes.json")
stored = json.loads(state_file.read_text()) if state_file.exists() else {}
changes = []

for page in pages:
    session = client.sessions.create(url=page["url"])
    content = session.page.markdown.content
    session.close()

    content_hash = hashlib.md5(content.encode()).hexdigest()
    previous_hash = stored.get(page["name"])

    if previous_hash and content_hash != previous_hash:
        changes.append(page["name"])

    stored[page["name"]] = content_hash

state_file.write_text(json.dumps(stored, indent=2))

if changes:
    print(f"Content changed on: {', '.join(changes)}")
else:
    print("No content changes detected")

This uses Browserbeam's markdown output as the content fingerprint. Markdown strips scripts, styles, and layout markup, so you're comparing actual content changes, not CSS updates or tracking pixel rotations.

Debugging and Troubleshooting

When a scraper doesn't return the data you expect, here's a systematic approach to finding and fixing the problem.

Connection Errors

If sessions.create() fails with a connection error, check these first:

Symptom	Likely Cause	Fix
`ConnectionError`	No internet or API down	Check `api.browserbeam.com` status
`AuthenticationError`	Invalid API key	Verify `BROWSERBEAM_API_KEY` env var
`RateLimitError`	Too many concurrent sessions	Wait for `retry_after` seconds
`TimeoutError`	Session creation took too long	Increase `timeout` parameter

Wrap session creation in error handling:

from browserbeam import Browserbeam, RateLimitError
import time

client = Browserbeam()

def create_session_with_retry(url, max_retries=3):
    for attempt in range(max_retries):
        try:
            return client.sessions.create(url=url)
        except RateLimitError as e:
            wait = getattr(e, "retry_after", 5)
            print(f"Rate limited. Waiting {wait}s...")
            time.sleep(wait)
        except Exception as e:
            print(f"Attempt {attempt + 1} failed: {e}")
            if attempt == max_retries - 1:
                raise
            time.sleep(2 ** attempt)

session = create_session_with_retry("https://books.toscrape.com")

Timeout and Stability Issues

If a page takes too long to stabilize (the create or goto call hangs), the page might have:

Continuous network polling (analytics, chat widgets, ad refreshes)
CSS animations that never complete
WebSocket connections that keep the "network idle" signal from firing

Fix: Use wait_for or wait_until to override default stability detection with a specific condition:

session = client.sessions.create(
    url="https://books.toscrape.com",
    timeout=30
)

# If a goto hangs, use explicit wait conditions
session.goto(
    url="https://books.toscrape.com/catalogue/page-2.html",
    wait_for="article.product_pod",
    wait_timeout=10000
)

The wait_for parameter takes a CSS selector. The wait_until parameter takes a JavaScript expression. Both are faster and more reliable than relying solely on stability detection for complex pages.

Unexpected Extraction Results

When extract returns empty strings or wrong data:

Check the selector. Open your browser's DevTools, run document.querySelectorAll("your-selector") in the console, and verify it matches the right elements.
Check the scope. If using _parent, child selectors run within each parent scope. "h1 >> text" inside _parent=".product-card" looks for h1 inside .product-card, not the page's main heading.
Check stability. Is the content dynamically loaded after the initial render? Print session.page.stable to verify the page was ready when extraction ran.

# Debug an extraction issue
session = client.sessions.create(url="https://books.toscrape.com")

# First, see what the page contains
print(f"Stable: {session.page.stable}")
print(f"Content preview:\n{session.page.markdown.content[:500]}")

# Try a small extraction to test selectors
test = session.extract(
    _parent="article.product_pod",
    _limit=1,
    title="h3 a >> text",
    price=".price_color >> text"
)
print(f"Test result: {test.extraction}")
session.close()

Reading page.map for Diagnosis

The page.map reveals the page's structural layout. When your selectors return nothing, the content might be in a different section than you expect:

session = client.sessions.create(url="https://books.toscrape.com")

print("Page sections:")
for section in session.page.map:
    print(f"  [{section['tag']}] {section.get('hint', 'no hint')}")

print(f"\nInteractive elements ({len(session.page.interactive_elements)}):")
for el in session.page.interactive_elements[:10]:
    print(f"  {el['ref']}: [{el['tag']}] {el.get('label', 'unlabeled')}")

session.close()

If page.map shows the data in an aside section but your selector targets main, that explains the empty results. Adjust your selector or use the scope parameter in observe() to focus on the right section.

Migrating from Selenium/Playwright

If you have existing scrapers built with Selenium or Playwright, here's how to translate them to Browserbeam.

Selenium to Browserbeam

Selenium Pattern	Browserbeam Equivalent
`webdriver.Chrome()`	`client.sessions.create(url=...)`
`driver.get(url)`	`session.goto(url=...)`
`driver.find_element(By.CSS_SELECTOR, sel)`	`session.click(ref="e1")` or `session.click(text="...")`
`element.send_keys("text")`	`session.fill(ref="e1", value="text")`
`WebDriverWait(driver, 10).until(...)`	Automatic (stability detection)
`driver.page_source`	`session.page.markdown.content`
`driver.quit()`	`session.close()`

Playwright to Browserbeam

Playwright Pattern	Browserbeam Equivalent
`browser = playwright.chromium.launch()`	`client = Browserbeam()`
`page = browser.new_page()`	`session = client.sessions.create(url=...)`
`page.goto(url)`	`session.goto(url=...)`
`page.click("selector")`	`session.click(ref="e1")`
`page.fill("selector", "value")`	`session.fill(ref="e1", value="value")`
`page.wait_for_selector("sel")`	Automatic, or `wait_for="sel"` on goto
`page.content()`	`session.page.markdown.content`
`browser.close()`	`session.close()`

Side-by-Side Code Comparison

The same task (scrape 3 pages of book data) in all three frameworks:

Selenium (45 lines):

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()
all_books = []

try:
    driver.get("https://books.toscrape.com")

    for page_num in range(3):
        WebDriverWait(driver, 10).until(
            EC.presence_of_all_elements_located((By.CSS_SELECTOR, "article.product_pod"))
        )
        articles = driver.find_elements(By.CSS_SELECTOR, "article.product_pod")

        for article in articles:
            title_el = article.find_element(By.CSS_SELECTOR, "h3 a")
            price_el = article.find_element(By.CSS_SELECTOR, ".price_color")
            title = title_el.get_attribute("title") or title_el.text.strip()
            price = price_el.text.strip()
            all_books.append({"title": title, "price": price})

        try:
            next_btn = driver.find_element(By.CSS_SELECTOR, ".next a")
            next_btn.click()
        except Exception:
            break
finally:
    driver.quit()

print(f"Scraped {len(all_books)} books")

Playwright (35 lines):

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    all_books = []

    page.goto("https://books.toscrape.com")

    for page_num in range(3):
        page.wait_for_selector("article.product_pod")
        articles = page.query_selector_all("article.product_pod")

        for article in articles:
            title = article.query_selector("h3 a").get_attribute("title") or ""
            price = article.query_selector(".price_color").text_content().strip()
            all_books.append({"title": title, "price": price})

        next_btn = page.query_selector(".next a")
        if next_btn:
            next_btn.click()
            page.wait_for_load_state("networkidle")
        else:
            break

    browser.close()
    print(f"Scraped {len(all_books)} books")

Browserbeam (20 lines):

from browserbeam import Browserbeam

client = Browserbeam()
session = client.sessions.create(url="https://books.toscrape.com")
all_books = []

for page_num in range(3):
    result = session.extract(
        _parent="article.product_pod",
        title="h3 a >> text",
        price=".price_color >> text"
    )
    all_books.extend(result.extraction)

    elements = session.page.interactive_elements
    next_btn = next((e for e in elements if "next" in e.get("label", "").lower()), None)
    if next_btn:
        session.click(ref=next_btn["ref"])
    else:
        break

session.close()
print(f"Scraped {len(all_books)} books")

Metric	Selenium	Playwright	Browserbeam
Lines of code	~45	~35	~20
Browser management	Install Chrome + ChromeDriver	Install Playwright browsers	None (cloud)
Wait logic	Explicit `WebDriverWait`	`wait_for_selector` / `wait_for_load_state`	Automatic
Element targeting	CSS selectors	CSS selectors	Refs or text
Data extraction	Manual `.text`, `.get_attribute()`	Manual `.text_content()`, `.get_attribute()`	Declarative schema
Null handling	Manual try/except per element	Manual `or ""` per element	Automatic (empty strings)

The biggest difference isn't lines of code. It's maintenance. When books.toscrape.com changes its markup, the Selenium and Playwright versions need selector updates and null-check adjustments. The Browserbeam version needs at most a schema field update.

Performance Tips

Three patterns that make your python web scraping scripts faster and cheaper.

Async vs Sync Client

For single-page scrapes, sync is fine. For batch processing (10+ URLs), async cuts total runtime significantly by running sessions in parallel:

import asyncio
from browserbeam import AsyncBrowserbeam

client = AsyncBrowserbeam()

async def scrape_book(url):
    session = await client.sessions.create(url=url)
    result = await session.extract(
        title="h1 >> text",
        price=".price_color >> text",
        stock=".instock.availability >> text"
    )
    await session.close()
    return result.extraction

async def main():
    urls = [
        "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
        "https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html",
        "https://books.toscrape.com/catalogue/soumission_998/index.html",
    ]
    results = await asyncio.gather(*[scrape_book(url) for url in urls])
    for book in results:
        print(f"{book['title']}: {book['price']}")

asyncio.run(main())

Three pages scraped concurrently. Each session is isolated, so there's no shared state or cookie conflicts. For more on parallel processing patterns, see the scaling web automation guide.

Session Reuse Patterns

Creating a new session for every page is expensive. If you're scraping multiple pages on the same site, reuse a single session:

session = client.sessions.create(url="https://books.toscrape.com")

# Reuse the same session for multiple pages
urls = [
    "https://books.toscrape.com/catalogue/category/books/mystery_3/index.html",
    "https://books.toscrape.com/catalogue/category/books/romance_8/index.html",
    "https://books.toscrape.com/catalogue/category/books/science_22/index.html",
]

for url in urls:
    session.goto(url=url)
    result = session.extract(
        _parent="article.product_pod",
        _limit=5,
        title="h3 a >> text",
        price=".price_color >> text"
    )
    print(f"{url.split('/')[-2]}: {len(result.extraction)} books")

session.close()

One session, three goto calls. Cookies persist, the browser context stays warm, and you skip session creation overhead for pages 2 and 3.

When to create new sessions vs reuse:

Scenario	Approach
Same site, multiple pages	Reuse session (`goto`)
Different sites	New session per site
Need fresh cookies/state	New session
Long-running workflow (30+ min)	New session (avoids timeout)

Minimizing API Calls

Every method call is an API round trip. Reduce calls by combining operations:

# Instead of: create + observe + extract (3 calls)
session = client.sessions.create(url="https://books.toscrape.com")
# create() already includes an observation, so page state is ready

# Extract immediately (1 additional call)
result = session.extract(
    _parent="article.product_pod",
    _limit=5,
    title="h3 a >> text",
    price=".price_color >> text"
)
session.close()

You don't need to call observe() after create(). The creation response already includes the full page state. Similarly, after click() or fill(), the page state auto-updates. Explicit observe() calls are only needed when you want to change the observation mode or scope.

Common Mistakes

Five patterns that trip up developers building python web scraping agents.

Not Closing Sessions

Open sessions consume cloud resources and keep billing active. If your script crashes without calling session.close(), the session stays open until it times out.

Fix: Use try/finally or a context manager pattern:

session = client.sessions.create(url="https://books.toscrape.com")
try:
    result = session.extract(title="h1 >> text")
    print(result.extraction)
finally:
    session.close()

Ignoring page.stable Signal

Extracting from a page that hasn't finished loading returns partial or empty data. Browserbeam handles this automatically (it waits for stability before returning), but if you're using explicit goto with custom wait_for conditions, check session.page.stable after navigation.

Overusing observe() Instead of extract()

Calling observe() to get the markdown, then parsing it in Python to find specific data. This wastes time and adds fragile parsing code.

Fix: Use extract() with a schema. Let Browserbeam's engine find the data on the server side. You get clean JSON instead of markdown that you need to parse yourself.

# Don't do this:
content = session.page.markdown.content
# ... 20 lines of regex or string parsing ...

# Do this:
result = session.extract(
    title="h1 >> text",
    price=".price_color >> text"
)
# result.extraction is clean JSON

Missing Error Handling

A single failed click() or extract() call crashes the entire script. Production scrapers need error handling around every interaction.

Fix: Wrap interactions in try/except and use the SDK's typed exceptions:

from browserbeam import Browserbeam, RateLimitError, SessionNotFoundError

try:
    session = client.sessions.create(url="https://books.toscrape.com")
    session.click(ref="e99")
except RateLimitError as e:
    print(f"Rate limited. Retry after {e.retry_after}s")
except SessionNotFoundError:
    print("Session expired. Creating new one...")
    session = client.sessions.create(url="https://books.toscrape.com")
except Exception as e:
    print(f"Unexpected error: {e}")

Hardcoding Selectors Instead of Using Refs

Writing session.click(text="Submit") everywhere instead of using the element refs from the page state. Text matching is fragile (button labels change, get translated, or become icons).

Fix: Read the interactive elements list and use refs:

for el in session.page.interactive_elements:
    if el["tag"] == "button" and "submit" in el.get("label", "").lower():
        session.click(ref=el["ref"])
        break

Refs are stable within a session. They don't depend on text content or CSS classes.

Frequently Asked Questions

How do I get started with web scraping in Python using Browserbeam?

Install the SDK with pip install browserbeam, set your API key as the BROWSERBEAM_API_KEY environment variable, and create a client with Browserbeam(). Then call client.sessions.create(url="https://...") to open a browser session. Use session.extract() with a declarative schema to pull structured data. See the full Python SDK guide for detailed setup.

Can Browserbeam handle JavaScript-rendered pages?

Yes. Every session runs a full Chromium browser in the cloud. JavaScript executes, SPAs render, and the SDK waits for page stability before returning data. Content from React, Vue, Angular, and any client-side framework is fully available. This is the main advantage over static parsers like BeautifulSoup and lxml.

How does Browserbeam compare to Selenium for Python web scraping?

Selenium requires installing Chrome and ChromeDriver, writing explicit wait conditions, and manually extracting data from raw HTML. Browserbeam handles browser management in the cloud, detects page stability automatically, and returns structured data via declarative schemas. A typical scraper is 20 lines with Browserbeam versus 45 with Selenium. See the side-by-side comparison above.

Do I need to install a browser to use Browserbeam?

No. Browserbeam runs managed Chromium instances in the cloud. The Python SDK communicates with these browsers over HTTPS. Your local machine needs only Python 3.8+ and the browserbeam pip package. No Chrome, no ChromeDriver, no Playwright browser binaries.

How do I handle authentication and login flows?

Create a session, navigate to the login page, fill the form fields using session.fill() or session.fill_form(), and click the submit button. Authentication cookies persist for the entire session, so subsequent goto and click calls access protected pages automatically.

What is the best Python web scraping framework for beginners?

For beginners who need JavaScript support and don't want to manage browsers, Browserbeam is the simplest option: pip install, one method call to create a session, and declarative schemas for data extraction. For static HTML pages without JavaScript, BeautifulSoup with Requests is a good starting point but won't work on modern dynamic sites.

Can I run Browserbeam scrapers on a schedule?

Yes. Browserbeam scripts are standard Python scripts with no browser dependencies. Deploy them anywhere Python runs: cron jobs, AWS Lambda, Google Cloud Functions, GitHub Actions, or any task scheduler. The cloud browser runs on Browserbeam's infrastructure, so your deployment environment stays lightweight.

How do I extract data from a table or list of items?

Use the _parent key in your extraction schema. Set _parent to a CSS selector matching the repeating container (.product-card, tr, .quote), and define child selectors for each field. Browserbeam extracts every matching item and returns a JSON array. For HTML tables with headers, pass just the table selector and the engine auto-parses rows using column headers as keys. See the data extraction guide for advanced patterns.

Conclusion

You now have a complete python web scraping agent. A configured SDK, session management with proper cleanup, login and multi-page flows, three production recipes, a debugging toolkit, and the patterns to migrate from Selenium or Playwright.

The core workflow is always the same: create a session, interact with the page (click, fill, navigate), extract structured data with a schema, and close the session. The SDK handles the browser, stability detection, JavaScript rendering, and cookie management. Your code focuses on the data.

Try changing the extraction schema in the book scraper example. Pull different fields, target different pages, combine the price monitor with the change detector. The SDK handles all of these the same way.

Start with the API docs for the full method reference, or jump straight into the code:

pip install browserbeam

Sign up for a free account and run the login-and-scrape example against quotes.toscrape.com. You'll have structured JSON from a protected page in under 5 minutes.

Building a Web Scraping Agent in Python with Browserbeam

Setting Up Browserbeam's Python SDK

Creating and Managing Browser Sessions

Session Options and Configuration

Navigating to Pages

Reading Page State with observe()

Extracting Data After Authentication

Handling Multi-Page Flows

Leveraging Browserbeam's Stability and Diff Features

Common Agent Patterns and Recipes

Price Monitoring Bot

Lead List Builder

Content Change Detection

Debugging and Troubleshooting

Connection Errors

Timeout and Stability Issues

Unexpected Extraction Results

Reading page.map for Diagnosis

Migrating from Selenium/Playwright

Selenium to Browserbeam

Playwright to Browserbeam

Side-by-Side Code Comparison

Performance Tips

Async vs Sync Client

Session Reuse Patterns

Minimizing API Calls

Common Mistakes

Not Closing Sessions

Ignoring page.stable Signal

Overusing observe() Instead of extract()

Missing Error Handling

Hardcoding Selectors Instead of Using Refs

Frequently Asked Questions

Conclusion

You might also like:

Data Extraction with Browserbeam: From Browser to JSON

Browserbeam vs Raw HTML: Why AI Agents Prefer Structured Output

LLM-Powered Browser Automation: A Tutorial

Give your AI agent a faster, leaner browser

Building a Web Scraping Agent in Python with Browserbeam

Setting Up Browserbeam's Python SDK

Creating and Managing Browser Sessions

Session Options and Configuration

Navigating to Pages

Reading Page State with observe()

Example: Login and Scrape Data

Filling Login Forms

Extracting Data After Authentication

Handling Multi-Page Flows

Leveraging Browserbeam's Stability and Diff Features

Common Agent Patterns and Recipes

Price Monitoring Bot

Lead List Builder

Content Change Detection

Debugging and Troubleshooting

Connection Errors

Timeout and Stability Issues

Unexpected Extraction Results

Reading page.map for Diagnosis

Migrating from Selenium/Playwright

Selenium to Browserbeam

Playwright to Browserbeam

Side-by-Side Code Comparison

Performance Tips

Async vs Sync Client

Session Reuse Patterns

Minimizing API Calls

Common Mistakes

Not Closing Sessions

Ignoring page.stable Signal

Overusing observe() Instead of extract()

Missing Error Handling

Hardcoding Selectors Instead of Using Refs

Frequently Asked Questions

Conclusion

You might also like:

Data Extraction with Browserbeam: From Browser to JSON

Browserbeam vs Raw HTML: Why AI Agents Prefer Structured Output

LLM-Powered Browser Automation: A Tutorial

Give your AI agent a faster, leaner browser