Your LLM can browse the web, fill forms, and pull structured data from any page. Not by parsing raw HTML (that costs 15,000+ tokens per page), but through a browser API that sends back clean markdown, clickable element refs, and a signal that tells your agent when the page is ready.
This guide walks through building an AI browser agent with Browserbeam. You'll go from zero to a working agent that navigates pages, interacts with elements, and extracts structured data, all in a handful of API calls. We'll cover the architecture that makes this possible, write working code in Python and TypeScript, and handle the real-world problems that trip up most agent builders.
In this guide, you'll learn:
- What AI browser agents are and why structured output beats raw HTML
- How Browserbeam's architecture delivers token-efficient page state
- How to build a working agent in Python and TypeScript, step by step
- Agent architecture patterns for production workflows
- Real-world use cases with code snippets you can run today
- Common mistakes and how to avoid them
- How Browserbeam compares to Selenium, Playwright, and Puppeteer
TL;DR: AI browser agents pair an LLM with a browser session so your model can browse, click, and extract data autonomously. Browserbeam's API returns structured markdown, stable element refs, and change diffs instead of raw HTML, cutting token usage by up to 95%. This guide shows you how to build one, step by step.
What Are AI Browser Agents (and Why Build One)?
An AI browser agent is an LLM with browser access. Your model decides what to do. The browser executes it. The results come back in a format the model can reason about.
That last part is where most setups fall apart.
Tools like Puppeteer and Playwright give your agent raw HTML. A typical page dumps 50,000 to 100,000 characters into the context window. Most of it is noise: script tags, inline styles, tracking pixels, deeply nested divs that carry zero useful information. Your agent wastes tokens reading boilerplate instead of making decisions.
An AI browser agent built on a structured API works differently. Instead of a wall of HTML, the agent receives:
- Page content as clean markdown, stripped of scripts and layout noise
- Interactive elements with stable refs (
e1,e2,e3), so the agent never constructs brittle CSS selectors - Form field maps showing which inputs belong to which form
- A stability signal (
stable: true) that tells the agent when the page is done loading - Change diffs on repeat observations, so the agent sees what changed instead of re-reading the whole page
Here's what that difference looks like in practice:
| Aspect | Raw HTML (Puppeteer/Playwright) | Structured API (Browserbeam) |
|---|---|---|
| Tokens per page | ~15,000-25,000 | ~1,500-3,000 |
| Element targeting | CSS selectors (break when markup changes) | Stable refs like e1, e2 |
| Page readiness | sleep(5000) and hope |
stable: true signal |
| Change detection | Re-scrape the entire page | changes object with content delta |
| Cookie banners | Write dismissal code yourself | auto_dismiss_blockers: true |
| Infrastructure | Manage Chromium, handle crashes, monitor memory | One API call |
We tested this across 500 pages and saw a 95% reduction in tokens compared to raw HTML. That's not optimization. That's a different category of tool.
Raw HTML is not an API response. It's a liability.
When Raw HTML Fails: Why Agents Need Structured Browser Output
Your LLM reads tokens. Every token it spends parsing irrelevant markup is a token it can't spend reasoning about the task. That's the core problem with feeding raw HTML to an agent.
The Token Cost of Parsing Raw DOM
Take a simple product page on an e-commerce site. The visible content (a product title, price, description, and a few reviews) fits in about 500 tokens. But the raw HTML for that same page includes:
- Inline CSS and
<style>blocks (3,000-5,000 tokens) - JavaScript bundles and
<script>tags (5,000-10,000 tokens) - Tracking pixels, analytics snippets, and ad containers (1,000-3,000 tokens)
- Deeply nested
<div>wrappers for layout (2,000-4,000 tokens) - SVG icons, base64 images, and metadata (1,000-3,000 tokens)
The total? Somewhere between 15,000 and 25,000 tokens for a page where the useful content is 500 tokens. Your agent spends 95% of its context window on noise.
That's not a minor inefficiency. At current LLM pricing, processing 1,000 pages of raw HTML costs roughly $15-25 in input tokens alone. The same 1,000 pages through Browserbeam's structured output costs under $2. Scale that to an agent running hourly, and the difference is hundreds of dollars per month.
Dynamic Content, Shadow DOM, and iFrames
Raw HTML scraping assumes the content you want is in the initial HTML response. Modern web apps break that assumption in three ways.
Single-page applications render content with JavaScript after the initial load. Your raw HTML scraper gets an empty <div id="root"></div> and nothing else. The actual content exists only after React, Vue, or Angular hydrates the DOM.
Shadow DOM components encapsulate their markup behind a shadow boundary. Standard DOM queries like document.querySelector cannot reach inside a shadow root. Web components, custom widgets, and many modern UI frameworks use shadow DOM by default.
iFrames load content in a separate document context. Forms, payment widgets, and embedded content sit inside iframes that raw scraping tools either skip entirely or require separate requests to access.
Browserbeam handles all three. It runs a full Chromium browser, waits for JavaScript to execute, traverses shadow DOM boundaries, and returns the rendered page state. Your agent gets the content a human user would see, not the raw source. That's what makes it a real alternative to tools like Selenium or Playwright for dynamic content scraping.
Side-by-Side: Raw Scraping vs Browserbeam Output
Here's what the same page looks like through raw scraping versus Browserbeam. Consider a search results page with three results:
Raw HTML (abbreviated, still ~8,000 tokens):
<div class="search-wrapper" data-v-2f9a1b3c>
<div class="results-container" role="main">
<div class="result-card bg-white shadow-md rounded-lg p-4 mb-3
hover:shadow-lg transition-shadow duration-200"
data-result-id="1" data-tracking="imp_1_abc">
<div class="flex items-start gap-3">
<div class="result-icon flex-shrink-0">
<svg xmlns="http://www.w3.org/2000/svg"
class="h-5 w-5 text-blue-500"
viewBox="0 0 20 20" fill="currentColor">...</svg>
</div>
<div class="result-content flex-1 min-w-0">
<h3 class="text-lg font-semibold text-blue-700
hover:underline cursor-pointer truncate">
<a href="/products/widget-pro"
data-tracking="click_1_abc">Widget Pro</a>
</h3>
<p class="text-sm text-gray-600 mt-1 line-clamp-2">
Professional-grade widget for teams...</p>
<span class="text-green-600 font-bold">$49.99</span>
</div>
</div>
</div>
<!-- ...two more result cards with similar nesting... -->
</div>
<script>
window.__TRACKING__.push({event:"search_render",count:3});
</script>
</div>
Browserbeam structured output (~200 tokens):
{
"markdown": {
"content": "## Search Results\n\n1. **Widget Pro** - Professional-grade widget for teams... $49.99\n2. **Widget Basic** - Starter widget for small projects... $19.99\n3. **Widget Enterprise** - Custom solution with SLA... Contact us"
},
"interactive_elements": [
{"ref": "e1", "tag": "a", "label": "Widget Pro", "in": "main"},
{"ref": "e2", "tag": "a", "label": "Widget Basic", "in": "main"},
{"ref": "e3", "tag": "a", "label": "Widget Enterprise", "in": "main"}
]
}
Your agent reads three lines of markdown and three element refs. It knows the products, the prices, and how to click each one. The raw HTML version buries that same information under layers of CSS utility classes, tracking attributes, and SVG icons that the LLM must parse and discard.
Structured output turns a parsing problem into a reading problem.
How Browserbeam Works for AI Browser Agents
Browserbeam is a cloud browser automation API that gives your agent access to a managed headless browser. You send HTTP requests. Chromium executes them. The response comes back as structured, token-efficient data your agent can act on.
The core loop looks like this:
- Create a session with a target URL
- Observe the page to get markdown, element refs, forms, and scroll position
- Act on elements using refs (click, fill, select, scroll)
- Read the diff to see what changed after each action
- Close the session when done
Every response includes a page object with the full state. Your agent always knows what the page looks like, what it can interact with, and whether the page is still loading.
Structured Output, Not Raw HTML
When your agent observes a page, it gets a response like this:
{
"session_id": "ses_abc123",
"expires_at": "2026-04-03T14:05:00Z",
"request_id": "req_1a2b3c4d5e6f",
"completed": 1,
"page": {
"url": "https://books.toscrape.com",
"title": "All products | Books to Scrape - Sandbox",
"stable": true,
"markdown": {
"content": "# All products\n\n1000 results - showing 1 to 20..."
},
"interactive_elements": [
{"ref": "e1", "tag": "input", "role": "search", "label": "Search", "in": "nav", "form": "f1"},
{"ref": "e2", "tag": "button", "label": "Submit", "in": "nav", "form": "f1"},
{"ref": "e3", "tag": "a", "role": "link", "label": "More information", "in": "main"}
],
"forms": [
{"ref": "f1", "action": "/search", "method": "GET", "fields": ["e1", "e2"]}
],
"map": [
{"section": "nav", "selector": "nav", "hint": "Search (1 form, 2 elements)"},
{"section": "main", "selector": "main", "hint": "Example Domain (1 link)"},
{"section": "footer", "selector": "footer", "hint": "More information... (1 link)"}
],
"changes": null,
"scroll": {"y": 0, "height": 2400, "viewport": 720, "percent": 30}
},
"media": [],
"extraction": null,
"blockers_dismissed": ["cookie_consent"],
"error": null
}
Your agent sees the page title, clean content, every interactive element with a stable ref, form structure, and a map of page sections with hints. The map tells the agent where content lives (nav, main, footer) without reading the full page. No DOM traversal. No selector guesswork. The agent reads e1 is a search input and e2 is the submit button. It fills e1, clicks e2, done.
Diff Tracking Cuts Repeat Costs
On the first observation, changes is null. On every observation after that, Browserbeam sends a diff showing exactly what changed:
{
"changes": {
"content_changed": true,
"content_delta": "## Search Results\n\n1. First result...\n2. Second result...",
"elements_added": [
{"ref": "e4", "tag": "a", "label": "First result", "in": "main"}
],
"elements_removed": []
}
}
Your agent reads the delta, not the whole page again. For multi-step workflows where the agent clicks through several pages, this keeps token usage flat instead of multiplying with every step.
Stability Detection
Most automation scripts use sleep(3000) and hope the page is loaded. Browserbeam watches three signals: network idle (no requests for 300ms), DOM quiet (no mutations for 500ms), and CSS animations complete. When all three pass, stable is true. When they don't, stable is false with timing metadata so your agent can decide whether to wait or proceed.
No guesswork. No wasted time.
Building Your First AI Browser Agent
Let's build an agent that navigates to a page, fills a search form, and extracts structured results. Four steps, working code, no boilerplate.
Step 1: Create a Session and Observe the Page
The first API call creates a browser session and navigates to your target URL. Browserbeam automatically observes the page after navigation, so you get the full page state in the response.
curl -X POST https://api.browserbeam.com/v1/sessions \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{"url": "https://books.toscrape.com"}'
from browserbeam import Browserbeam
client = Browserbeam(api_key="YOUR_API_KEY")
session = client.sessions.create(url="https://books.toscrape.com")
print(session.page.title)
for el in session.page.interactive_elements:
print(f" {el['ref']}: {el['tag']} - {el.get('label', 'no label')}")
import Browserbeam from "@browserbeam/sdk";
const client = new Browserbeam({ apiKey: "YOUR_API_KEY" });
const session = await client.sessions.create({ url: "https://books.toscrape.com" });
console.log(session.page?.title);
for (const el of session.page?.interactive_elements ?? []) {
console.log(` ${el.ref}: ${el.tag} - ${el.label ?? "no label"}`);
}
require "browserbeam"
client = Browserbeam::Client.new(api_key: "YOUR_API_KEY")
session = client.sessions.create(url: "https://books.toscrape.com")
puts session.page.title
session.page.interactive_elements.each do |el|
puts " #{el['ref']}: #{el['tag']} - #{el['label'] || 'no label'}"
end
The response includes page.interactive_elements, a list of every clickable, fillable, and selectable element on the page. Each element has a ref like e1 or e2 that your agent uses to target it in the next step. No CSS selectors needed.
Step 2: Interact Using Element Refs
Your agent now knows which elements exist and what they do. It picks the right ref and acts. Fill an input, click a button, select a dropdown option. All by ref.
curl -X POST https://api.browserbeam.com/v1/sessions/ses_abc123/act \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"steps": [
{"fill": {"ref": "e1", "value": "browser automation"}},
{"click": {"ref": "e2"}}
]
}'
session.fill(ref="e1", value="browser automation")
session.click(ref="e2")
print(session.page.changes)
print(session.page.markdown.content[:500])
await session.fill({ ref: "e1", value: "browser automation" });
await session.click({ ref: "e2" });
console.log(session.page?.changes);
console.log(session.page?.markdown?.content?.slice(0, 500));
session.fill(ref: "e1", value: "browser automation")
session.click(ref: "e2")
puts session.page.changes
puts session.page.markdown.content[0..499]
After each action, Browserbeam auto-observes the page and returns fresh state. The changes object tells your agent exactly what happened: new content, added elements, removed elements. Your agent reads the diff instead of re-processing the entire page.
Step 3: Extract Structured Data
Once your agent reaches the right page, use the extract step to pull structured data into JSON. Define a schema with CSS selectors, and Browserbeam returns clean, typed results.
curl -X POST https://api.browserbeam.com/v1/sessions/ses_abc123/act \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"steps": [{
"extract": {
"results": [{
"_parent": ".result-item",
"title": "h3 >> text",
"url": "a >> href",
"description": "p >> text"
}]
}
}]
}'
result = session.extract(
results=[{
"_parent": ".result-item",
"title": "h3 >> text",
"url": "a >> href",
"description": "p >> text"
}]
)
for item in result.extraction["results"]:
print(f"{item['title']} -> {item['url']}")
const result = await session.extract({
results: [{
_parent: ".result-item",
title: "h3 >> text",
url: "a >> href",
description: "p >> text"
}]
});
for (const item of result.extraction?.results ?? []) {
console.log(`${item.title} -> ${item.url}`);
}
result = session.extract(
results: [{
"_parent" => ".result-item",
"title" => "h3 >> text",
"url" => "a >> href",
"description" => "p >> text"
}]
)
result.extraction["results"].each do |item|
puts "#{item['title']} -> #{item['url']}"
end
The _parent key tells Browserbeam to find all matching containers and extract each field from within them. The result is a JSON array of objects. No regex parsing. No LLM token cost for web data extraction. Just structured data, ready to use.
Step 4: Close the Session
Always close sessions when your agent is done. Open sessions hold browser resources and count against your concurrency limit.
curl -X DELETE https://api.browserbeam.com/v1/sessions/ses_abc123 \
-H "Authorization: Bearer YOUR_API_KEY"
session.close()
await session.close();
session.close
You can also include {"close": {}} as the last step in any request to auto-destroy the session when the steps finish. This is useful for single-call workflows where you don't need the session afterwards.
That's the complete ai agent workflow: create, observe, act, extract, close. Four API calls, structured data at every step, and your LLM never sees a single line of raw HTML.
Agent Architecture Patterns
The four-step tutorial above gives you the mechanics. Real-world agents need more structure. Here are the patterns that work in production.
The Observe-Act-Extract Loop
Most browser agents follow a three-phase loop. The agent observes the page, decides on an action, executes it, and checks the result before looping again.
from browserbeam import Browserbeam
client = Browserbeam(api_key="YOUR_API_KEY")
session = client.sessions.create(url="https://books.toscrape.com")
goal = "Find the price of 'A Light in the Attic'"
max_iterations = 10
for i in range(max_iterations):
page_state = session.page.markdown.content
elements = session.page.interactive_elements
next_action = ask_llm(goal, page_state, elements)
if next_action["type"] == "click":
session.click(ref=next_action["ref"])
elif next_action["type"] == "fill":
session.fill(ref=next_action["ref"], value=next_action["value"])
elif next_action["type"] == "extract":
result = session.extract(**next_action["schema"])
break
elif next_action["type"] == "done":
break
session.close()
The key is that every iteration gets fresh state from Browserbeam. The agent never works with stale data because each action triggers an auto-observation. This ai orchestration pattern works with any LLM that supports tool use or function calling.
Multi-Step Planning with Fallback
Simple agents make one decision at a time. More capable agents plan multiple steps ahead and recover when things go wrong.
The pattern works like this:
- Plan: The LLM receives the goal and current page state, then outputs a sequence of steps
- Execute: Your code sends all steps in a single
actrequest - Verify: Check the response. If
erroris null, the plan succeeded. If not, the agent re-plans from the current state - Retry: The error response includes the full page state, so the agent has context to adjust
steps = [
{"fill": {"ref": "e1", "value": "quarterly revenue report"}},
{"click": {"ref": "e2"}},
{"extract": {"title": "h1 >> text", "data": "table >> text"}}
]
result = session.act(steps=steps)
if result.error:
page_state = result.page.markdown.content
new_steps = ask_llm_to_replan(goal, page_state, result.error)
result = session.act(steps=new_steps)
Sending multiple steps in one request is faster than sending them one at a time. The tradeoff is that if step 2 fails, step 3 never executes. For critical workflows, send steps individually. For speed, batch them.
Parallel Sessions for Batch Workflows
When your agent needs to visit multiple URLs (competitor pricing, job listings across sites, product catalogs), run sessions in parallel. Each Browserbeam session is independent, so there's no shared state to worry about.
import asyncio
from browserbeam import AsyncBrowserbeam
client = AsyncBrowserbeam(api_key="YOUR_API_KEY")
urls = [
"https://competitor-a.com/pricing",
"https://competitor-b.com/pricing",
"https://competitor-c.com/pricing",
]
async def scrape_pricing(url):
session = await client.sessions.create(url=url)
result = await session.extract(
plans=[{
"_parent": ".plan-card",
"name": "h3 >> text",
"price": ".price >> text"
}]
)
await session.close()
return {"url": url, "plans": result.extraction["plans"]}
results = asyncio.run(asyncio.gather(*[scrape_pricing(u) for u in urls]))
Three sites, three sessions, one batch of results. The Python SDK makes async sessions straightforward with the AsyncBrowserbeam client.
When to Use scroll_collect vs Step-by-Step
Two patterns handle long pages. Choosing the wrong one costs time or data.
| Scenario | Use | Why |
|---|---|---|
| Infinite scroll (social feeds, product grids) | scroll_collect |
Captures all lazy-loaded content in one step |
| Paginated results (search engines, directories) | Step-by-step navigation | Each page is a separate URL; scroll won't trigger pagination |
| Long static pages (articles, documentation) | scroll_collect |
Collects content below the fold that wasn't in the initial viewport |
| Pages with "Load More" buttons | Step-by-step click | Scroll alone won't trigger the button click |
scroll_collect is faster when it works because it's a single API call. Step-by-step gives you more control when the page requires explicit interaction to reveal content. When in doubt, try scroll_collect first. If the content you need isn't in the result, switch to clicking the "Load More" or "Next" button in a loop.
Handling Common Challenges
Real-world pages are messier than examples. Here's how to handle the problems your agent will actually encounter.
Dynamic Content and Single-Page Apps
SPAs render content with JavaScript after the initial page load. Browserbeam's stability detection handles most cases automatically. When it doesn't, use wait_for to wait for a specific element or wait_until to wait for a JavaScript condition.
{"goto": {
"url": "https://books.toscrape.com/catalogue/page-2.html",
"wait_until": "document.querySelectorAll('article.product_pod').length > 0",
"wait_timeout": 10000
}}
Infinite Scroll and Lazy Loading
Pages that load content as you scroll need the scroll_collect step. It scrolls through the entire page in viewport-sized increments, waits for lazy-loaded content at each position, and captures everything in one unified observation.
{"scroll_collect": {"max_scrolls": 30, "max_text_length": 50000}}
One step replaces dozens of scroll-and-wait loops. The result includes all dynamically loaded content, ready for extraction.
Cookie Banners and Popups
Cookie consent dialogs, chat widgets, and marketing popups block your agent's view. Browserbeam auto-dismisses common blockers by default (auto_dismiss_blockers: true). The response tells your agent what was dismissed:
"blockers_dismissed": ["cookie_consent"]
No extra logic. No popup-specific selectors.
Error Recovery
When a step fails (element not found, navigation timeout, CAPTCHA detected), Browserbeam stops execution but still returns the current page state. Your agent sees the error and the page, so it can decide what to do next.
{
"completed": 1,
"error": {
"step": 1,
"action": "click",
"code": "element_not_found",
"message": "No visible element found matching ref \"e5\""
},
"page": {
"url": "https://books.toscrape.com",
"title": "All products | Books to Scrape - Sandbox",
"stable": true,
"markdown": {"content": "..."},
"interactive_elements": [...]
}
}
The agent can re-observe, try a different ref, or take a screenshot for debugging. The page state is always there.
Quick Reference: When to Use What
| Challenge | Solution | How |
|---|---|---|
| Page still loading | Stability detection | Check page.stable (automatic) |
| SPA with JS rendering | Wait for condition | wait_until on goto step |
| Infinite scroll | Collect all content | scroll_collect step |
| Cookie banners, popups | Auto-dismiss | auto_dismiss_blockers: true (default) |
| Step fails mid-workflow | Re-plan from page state | Error response includes full page object |
| Session cleanup | Close when done | session.close() or {"close": {}} step |
Real-World Use Cases
Patterns become clearer with concrete examples. These are the three most common agent workflows we see in production.
Price Monitoring Across Competitor Sites
A common use case for ai data collection: visit competitor product pages daily, extract current prices, and alert your team when something changes. The agent navigates to each competitor, extracts pricing data, and compares it against your stored baseline.
from browserbeam import Browserbeam
client = Browserbeam(api_key="YOUR_API_KEY")
bookstores = [
{"name": "Books to Scrape", "url": "https://books.toscrape.com"},
{"name": "Quotes to Scrape", "url": "https://quotes.toscrape.com"},
]
for store in bookstores:
session = client.sessions.create(url=store["url"])
result = session.extract(
items=[{
"_parent": "article.product_pod, .quote",
"name": "h3 a, .text >> text",
"price": ".price_color >> text",
}]
)
session.close()
for item in result.extraction["items"]:
check_for_price_change(store["name"], item)
The _parent pattern groups related fields, so you get structured objects instead of flat lists. Run this on a daily cron job and you have a competitive intelligence feed without manual checking.
Lead Enrichment from Company Pages
Sales teams need more than a company name. An AI web bot can visit a prospect's website, extract the company description, team size signals, technology stack, and contact information, then feed it into your CRM.
session = client.sessions.create(url="https://realpython.github.io/fake-jobs/")
result = session.extract(
page_title="h1 >> text",
page_description=".subtitle >> text",
listings=[{
"_parent": ".card",
"_limit": 5,
"title": "h2.title >> text",
"company": "h3.company >> text",
"location": ".location >> text"
}]
)
for listing in result.extraction["listings"]:
enrich_lead(listing["company"], listing["location"])
session.close()
The agent extracts structured data in one call. No need to visit individual pages when the listing contains the information you need.
Content Aggregation for Research Agents
Research agents collect information from multiple sources and synthesize it. A content aggregation workflow visits news sites, documentation pages, or forum threads, extracts relevant sections, and passes them to the LLM for summarization.
sources = [
"https://news.ycombinator.com",
"https://books.toscrape.com",
"https://quotes.toscrape.com",
]
collected = []
for url in sources:
session = client.sessions.create(url=url)
result = session.scroll_collect(max_scrolls=10, max_text_length=30000)
collected.append({
"url": url,
"title": result.page.title,
"content": result.page.markdown.content
})
session.close()
summary = ask_llm("Summarize the key trends from these sources:", collected)
scroll_collect captures the full page content including lazy-loaded sections. The LLM receives clean markdown, not HTML, so the summarization step is token-efficient.
For agents that need to secure their browsing sessions, especially when visiting untrusted sources, Browserbeam's session isolation ensures each session runs in its own browser context with no shared cookies or storage.
Common Mistakes When Building Web Agents
After working with hundreds of agent implementations, these are the five mistakes we see most often. Every one of them is avoidable.
Mistake 1: Skipping close and Leaking Sessions
Open sessions consume server resources and count against your concurrency limit. If your agent crashes or your code throws an exception before calling session.close(), the session stays open until it times out (typically 5 minutes).
The fix: Use a try/finally block or include {"close": {}} as the last step in your request.
session = client.sessions.create(url="https://books.toscrape.com")
try:
result = session.extract(title="h1 >> text")
process(result)
finally:
session.close()
Mistake 2: Sending Full Page HTML to the LLM
Some developers fetch pages through Browserbeam, then use execute_js to grab document.body.innerHTML and send that to the LLM. This defeats the purpose. You're paying for structured output and ignoring it.
The fix: Use session.page.markdown.content for page text and session.page.interactive_elements for clickable items. If you need specific data, use session.extract() with a schema. The structured output is already optimized for LLM consumption.
Mistake 3: Ignoring the Stability Signal
When page.stable is false, the page is still loading. Acting on an unstable page means your agent might click elements that haven't rendered yet or extract incomplete content.
The fix: Check page.stable before acting. If it's false, call session.observe() to wait for stability. For SPAs with custom loading states, use wait_until with a JavaScript condition.
session = client.sessions.create(url="https://books.toscrape.com")
if not session.page.stable:
session.observe()
result = session.extract(books=[{"_parent": "article.product_pod", "_limit": 3, "title": "h3 a >> text"}])
Mistake 4: Hardcoding CSS Selectors Instead of Using Refs
Refs like e1, e2, e3 are assigned fresh on every observation. They point to the elements currently on the page. CSS selectors, by contrast, break when developers change class names, restructure the DOM, or add new elements.
The fix: Use refs from page.interactive_elements for clicks, fills, and selections. Use CSS selectors only in extract schemas where you need to target content containers.
| Method | Use Case | Stability |
|---|---|---|
Refs (e1, e2) |
Clicking buttons, filling inputs, selecting options | High (fresh each observation) |
| CSS selectors | Extraction schemas targeting content structure | Medium (depends on site changes) |
| XPath | Avoid entirely | Low (extremely brittle) |
Mistake 5: No Retry Logic for Transient Failures
Network timeouts, rate limits, and temporary page errors happen in production. An agent without retry logic fails on the first hiccup.
The fix: Wrap your agent's main loop in retry logic with exponential backoff. Browserbeam returns specific error codes so you can distinguish between retryable errors (timeouts, rate limits) and permanent ones (invalid session, authentication required).
import time
def run_with_retry(fn, max_retries=3):
for attempt in range(max_retries):
try:
return fn()
except Exception as e:
if attempt == max_retries - 1:
raise
wait = 2 ** attempt
time.sleep(wait)
These five mistakes account for most of the agent failures we debug. Fix them, and your browser automation pipeline becomes significantly more reliable.
Browserbeam vs Traditional Automation Tools
If you're choosing between Browserbeam and traditional tools like Selenium, Playwright, or Puppeteer, the decision depends on what your agent needs. Each tool solves a different problem.
| Feature | Browserbeam | Selenium | Playwright | Puppeteer |
|---|---|---|---|---|
| Output format | Structured markdown + refs | Raw HTML / DOM | Raw HTML / DOM | Raw HTML / DOM |
| Infrastructure | Cloud-managed (API) | Self-managed browsers | Self-managed browsers | Self-managed browsers |
| Element targeting | Stable refs (e1, e2) | CSS/XPath selectors | CSS/XPath selectors | CSS/XPath selectors |
| Change detection | Built-in diff tracking | None (re-scrape) | None (re-scrape) | None (re-scrape) |
| Page readiness | Stability signal (automatic) | Explicit waits | Auto-wait for actions | Explicit waits |
| Token cost for LLMs | ~1,500-3,000/page | ~15,000-25,000/page | ~15,000-25,000/page | ~15,000-25,000/page |
| Cookie/popup handling | Automatic | Manual code | Manual code | Manual code |
| Anti-bot handling | Proxy support, fingerprint mgmt | Limited | Stealth plugins | Stealth plugins |
| Setup time | pip install browserbeam |
Browser drivers + config | npx playwright install |
npm install puppeteer |
When Browserbeam Is the Right Choice
Browserbeam fits best when your application involves an LLM or AI agent making decisions based on page content. The structured output, diff tracking, and stability detection are specifically designed for agent workflows where token efficiency and reliable page state matter.
Use Browserbeam when:
- Your LLM needs to reason about web pages (agent workflows, web data extraction)
- You want to skip infrastructure management (no browsers to deploy or maintain)
- You need change detection across repeated observations
- You're building multi-step workflows where the agent navigates through several pages
- You want automatic handling of cookie banners, popups, and common blockers
When Traditional Tools Still Make Sense
Selenium, Playwright, and Puppeteer are the right choice when you need fine-grained browser control and don't need LLM-optimized output.
Use traditional tools when:
- You're running automated test suites (Playwright and Selenium excel at this)
- You need raw CDP (Chrome DevTools Protocol) access for debugging or performance profiling
- Your scripts don't involve an LLM and don't need structured output
- You need to run browsers locally with specific configurations
- You're building visual regression testing pipelines
The key distinction: Browserbeam is a browser API for AI agents. Playwright and Puppeteer are browser automation libraries for developers. If your code makes the decisions, traditional tools work fine. If your LLM makes the decisions, structured output wins.
Frequently Asked Questions
What are browser agents?
Browser agents are AI systems that can interact with web pages autonomously. They combine an LLM (for decision-making) with a browser session (for execution). The agent observes a page, decides what to click or fill, executes the action, and reads the result. With Browserbeam, the browser returns structured data at each step so the LLM can reason efficiently without parsing raw HTML.
Do I need to install Playwright or Puppeteer?
No. Browserbeam runs a headless browser (Chromium) in the cloud. You send HTTP requests and get structured responses. There's no browser binary to install, no dependency to manage, and no crash recovery to build. The SDKs for Python, TypeScript, and Ruby wrap these API calls into method calls on a session object.
How does Browserbeam handle dynamic content that loads after the initial page render?
Browserbeam's stability detection watches three signals before reporting a page as ready: network idle (no pending requests for 300ms), DOM quiet (no mutations for 500ms), and CSS animations complete. For SPAs that need custom readiness logic, use wait_for (CSS selector) or wait_until (JavaScript expression) on the goto step to define your own condition.
How much does structured output reduce token usage?
In our tests across 500 pages, structured markdown output used 95% fewer tokens than raw HTML for the same pages. A page that costs 20,000 tokens as HTML costs around 1,500 tokens as Browserbeam markdown. Diff tracking reduces costs further on multi-step workflows, since your agent only processes what changed between observations.
Can I use Browserbeam with any LLM?
Yes. Browserbeam is a REST API that returns JSON. Any LLM that supports function calling or tool use (OpenAI, Anthropic, Google, open-source models) can drive a Browserbeam session. The structured output format works especially well with LLM tool use patterns, where the model picks actions from a set of available functions.
What happens if my agent gets stuck or a page has a CAPTCHA?
Browserbeam detects CAPTCHAs and returns a captcha_detected error code along with the current page state. Your agent can log the issue, try a different approach, or escalate to a human. The key is that execution stops cleanly and your agent always has context to make the next decision.
What to Build Next
You now have the building blocks for an autonomous browser agent. A session, structured observations, element refs, diff tracking, and error recovery. Each piece is a single API call.
The next step is chaining these calls into a loop. Give your LLM a goal ("find the cheapest flight from NYC to London"), a set of available actions (navigate, fill, click, extract, scroll), and the page state from Browserbeam. The model picks the next action. Browserbeam executes it. The model reads the diff and decides again. That's an autonomous web agent, and you can build it in under 100 lines of code.
Start with the Browserbeam API docs for the full reference, sign up for a free account, or install the SDK and try it yourself:
pip install browserbeam # Python
npm install @browserbeam/sdk # TypeScript
gem install browserbeam # Ruby
Build something. Break something. Then build it better.