By the end of this guide, you'll have a working web scraping pipeline that extracts structured JSON from any website. No regex. No BeautifulSoup parsing loops. No Selenium WebDriver setup. Just a declarative schema that tells Browserbeam what data you want, and it comes back as clean JSON.
We'll start with the basics of schema-based extraction, build a complete web scraping python project step by step, handle pagination and infinite scroll, and cover the patterns that work for e-commerce, job boards, and news sites. Every code example runs as-is. Copy it, change the URL, change the schema, and you have a scraper for your use case.
This guide is for developers who are tired of writing fragile CSS selector chains and XPath expressions that break whenever a site changes its markup. If you've ever spent more time maintaining a scraper than building one, structured extraction will change how you think about web scraping.
In this guide, you'll build:
- A declarative extraction schema that pulls structured JSON from any page
- A complete web scraping python project with navigation, extraction, and CSV export
- Pagination and infinite scroll handlers for multi-page data collection
- Real-world scrapers for e-commerce products, job listings, and news articles
- A migration path from BeautifulSoup and Scrapy to Browserbeam
- Performance-optimized scrapers for web scraping at scale
TL;DR: Structured web scraping replaces manual HTML parsing with declarative schemas. Define what you want ("title": "h1 >> text", "price": ".price >> text") and Browserbeam extracts it as typed JSON. Works on JavaScript-rendered pages, handles cookie banners and stability detection automatically, and scales from a single page to thousands of URLs with parallel sessions.
Why Structured Scraping Matters in 2026
Traditional web scraping python workflows follow the same pattern: fetch HTML, parse it with BeautifulSoup or lxml, write CSS selectors or XPath expressions, handle edge cases when elements are missing, and hope the site doesn't change its markup next week.
That workflow has three problems:
- It's fragile. CSS selectors break when a site redesigns, changes class names, or wraps content in a new div. A selector like
.product-list > .item > .price-container > span.pricecan break from a single markup change. - It can't handle JavaScript. Static HTML parsers like BeautifulSoup and Scrapy only see the initial HTML response. They miss content rendered by React, Vue, Angular, or any client-side framework. For dynamic pages, you need a real browser.
- It's verbose. Extracting a product listing with name, price, URL, and image from 50 items means writing loops, null checks, and string cleaning for each field. That's 30-50 lines of parsing code for every data type.
Structured extraction flips this model. Instead of writing parsing code, you define a schema:
{
"products": [{
"_parent": ".product-card",
"name": "h3 >> text",
"price": ".price >> text",
"url": "a >> href",
"image": "img >> src"
}]
}
Browserbeam runs a real browser, renders JavaScript, waits for the page to stabilize, and returns structured JSON matching your schema. One API call replaces the entire fetch-parse-extract cycle.
| Approach | JavaScript Support | Lines of Code | Breaks When Site Changes |
|---|---|---|---|
| BeautifulSoup + Requests | No | 30-50 per data type | Frequently |
| Scrapy | No (without Splash) | 40-60 per spider | Frequently |
| Selenium + manual parsing | Yes | 50-80 per scraper | Often |
| Browserbeam structured extract | Yes | 5-10 per schema | Rarely (attribute-based) |
The >> text and >> href selectors target element attributes, not deeply nested CSS paths. When a site wraps its price in a new <div>, the >> text selector still works because it reads the text content of whatever element matches .price. That's a fundamental difference in how to scrape websites reliably.
Defining a Schema for Extraction
The extraction schema is the core of structured web scraping with Browserbeam. Every schema maps field names to CSS selector + attribute pairs. Let's walk through the three schema types, from simple to complex.
Flat Schemas
A flat schema extracts individual fields from the page. Each key is your field name. Each value is a "selector >> attribute" string.
curl -X POST https://api.browserbeam.com/v1/sessions/$SESSION_ID/act \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"steps": [{
"action": "extract",
"schema": {
"title": "h1 >> text",
"description": "meta[name=description] >> content",
"canonical_url": "link[rel=canonical] >> href"
}
}]
}'
result = session.extract(
title="h1 >> text",
description="meta[name=description] >> content",
canonical_url="link[rel=canonical] >> href"
)
print(result.extraction)
# {"title": "Example Product", "description": "...", "canonical_url": "https://..."}
const result = await session.extract({
title: "h1 >> text",
description: "meta[name=description] >> content",
canonical_url: "link[rel=canonical] >> href"
});
console.log(result.extraction);
result = session.extract(
title: "h1 >> text",
description: "meta[name=description] >> content",
canonical_url: "link[rel=canonical] >> href"
)
puts result.extraction
The >> operator separates the CSS selector from the attribute you want. Common attributes:
| Attribute | What It Extracts | Example |
|---|---|---|
>> text |
Visible text content of the element | "h1 >> text" returns "Product Name" |
>> href |
Link URL from <a> tags |
"a.nav-link >> href" returns "/about" |
>> src |
Image source URL | "img.hero >> src" returns "https://..." |
>> content |
Content attribute from <meta> tags |
"meta[name=description] >> content" |
>> data-* |
Any data attribute | ".product >> data-sku" returns "SKU-123" |
>> value |
Input field value | "input[name=email] >> value" |
Nested Object Schemas
When you need to group related fields, nest them inside an object:
result = session.extract(
product={
"name": "h1.product-title >> text",
"price": ".price-current >> text",
"seller": {
"name": ".seller-info .name >> text",
"rating": ".seller-info .rating >> text",
"url": ".seller-info a >> href"
}
}
)
# result.extraction["product"]["seller"]["name"] => "TechStore"
Nested schemas keep your data organized without post-processing. The JSON output mirrors the schema structure exactly.
Array/List Schemas
Most web scraping targets repeating elements: product listings, search results, table rows. The _parent key tells Browserbeam which container to repeat over:
result = session.extract(
products=[{
"_parent": ".product-card",
"_limit": 5,
"name": "h3 >> text",
"price": ".price >> text",
"url": "a >> href",
"image": "img >> src"
}]
)
for product in result.extraction["products"]:
print(f"{product['name']}: {product['price']}")
The _parent selector finds every .product-card element on the page. For each match, Browserbeam extracts the nested fields relative to that container. The _limit key caps the results, which is useful for testing your schema on a small sample before running the full extraction.
Pro Tip: Start every new extraction with _limit: 3. Verify the output matches what you expect, then remove the limit to extract everything.
Tutorial: Scraping a Real-World Site
Let's build a complete web scraping python project from scratch. We'll scrape a product listing page, extract structured data, handle multiple pages, and export to JSON and CSV.
Setting Up the Project
First, let's install the SDK and set up our project:
pip install browserbeam
export BROWSERBEAM_API_KEY="your_api_key_here"
Create a new file called scraper.py:
from browserbeam import Browserbeam
import json
import csv
client = Browserbeam()
That's the entire setup. No browser binary to download. No WebDriver to configure. The Browserbeam SDK connects to cloud browsers through the web scraping API, so your local environment stays clean.
Navigating and Observing the Page
Let's create a session and see what's on the page:
session = client.sessions.create(
url="https://books.toscrape.com",
auto_dismiss_blockers=True
)
print(f"Title: {session.page.title}")
print(f"Stable: {session.page.stable}")
print(f"Elements: {len(session.page.interactive_elements)}")
# Check the page map to understand the structure
if session.page.map:
for section in session.page.map:
print(f" [{section['tag']}] {section.get('hint', '')}")
The auto_dismiss_blockers option handles cookie consent banners automatically. The page.stable flag tells us the page has finished rendering. No sleep(5) needed.
The page map is especially useful when you're scraping a site for the first time. It shows you the page sections (nav, main, aside, footer) with hints about their content, so you know where to target your extraction schema.
Extracting Data with a Schema
Now let's extract the book listings:
result = session.extract(
books=[{
"_parent": "article.product_pod",
"title": "h3 a >> text",
"price": ".price_color >> text",
"availability": ".availability >> text",
"url": "h3 a >> href",
"rating": "p.star-rating >> class"
}]
)
books = result.extraction["books"]
print(f"Found {len(books)} books")
for book in books[:3]:
print(f" {book['title']}: {book['price']}")
One method call. No loops. No null checks. No BeautifulSoup find_all chains. Browserbeam finds every article.product_pod, extracts the specified fields from each one, and returns a list of objects.
Exporting to JSON or CSV
With structured data in hand, exporting is straightforward:
# Export to JSON
with open("books.json", "w") as f:
json.dump(books, f, indent=2)
# Export to CSV
if books:
with open("books.csv", "w", newline="") as f:
writer = csv.DictWriter(f, fieldnames=books[0].keys())
writer.writeheader()
writer.writerows(books)
session.close()
print(f"Exported {len(books)} books to books.json and books.csv")
Because the extraction returns clean dictionaries, Python's built-in json and csv modules handle the export directly. No data cleaning step needed.
Dealing with Pagination and Infinite Scroll
Most real-world web scraping targets span multiple pages. Browserbeam handles both click-based pagination and infinite scroll.
Click-based pagination follows the pattern: extract data, find the "Next" button, click it, repeat.
all_books = []
session = client.sessions.create(url="https://books.toscrape.com")
while True:
result = session.extract(
books=[{
"_parent": "article.product_pod",
"title": "h3 a >> text",
"price": ".price_color >> text"
}]
)
all_books.extend(result.extraction["books"])
try:
session.click(text="next")
session.wait(ms=1000)
except Exception:
break # No more pages
session.close()
print(f"Scraped {len(all_books)} books across multiple pages")
Infinite scroll pages load content as you scroll down. The scroll_collect method handles this automatically:
session = client.sessions.create(url="https://quotes.toscrape.com/scroll")
# Scroll through the entire page, loading lazy content
collected = session.scroll_collect(
max_scrolls=20,
wait_ms=500
)
# Now extract from the fully-loaded page
result = session.extract(
quotes=[{
"_parent": ".quote",
"text": ".text >> text",
"author": ".author >> text",
"tags": ".keywords >> content"
}]
)
session.close()
The scroll_collect method scrolls the page, waits for new content to load, scrolls again, and repeats until no new content appears or it hits the max_scrolls limit. The default max_text_length is 100,000 characters, enough for most long pages.
Common Scraping Patterns and Recipes
These three patterns cover the majority of web scraping use cases. Each one includes a working schema you can adapt to your target site.
E-Commerce Product Listings
E-commerce web scraping is the most common use case. Products have consistent structures: name, price, image, rating, URL.
session = client.sessions.create(url="https://books.toscrape.com")
result = session.extract(
products=[{
"_parent": "article.product_pod",
"name": "h3 a >> text",
"price": ".price_color >> text",
"rating": "p.star-rating >> class",
"url": "h3 a >> href",
"image": "img >> src",
"in_stock": ".instock.availability >> text"
}]
)
session.close()
This schema maps directly to the HTML structure on Books to Scrape. The p.star-rating >> class selector returns the rating class (e.g. "star-rating Three"), which you can parse into a numeric value. The same pattern works on any product listing page once you identify the right parent container and field selectors.
Job Board Scraping
Job listings follow a predictable structure across most boards. Here's an example using Real Python's Fake Jobs practice site:
session = client.sessions.create(url="https://realpython.github.io/fake-jobs/")
result = session.extract(
jobs=[{
"_parent": ".card",
"_limit": 10,
"title": "h2.title >> text",
"company": "h3.company >> text",
"location": ".location >> text",
"posted": "time >> text",
"url": ".card-footer-item >> href"
}]
)
for job in result.extraction["jobs"]:
print(f"{job['title']} at {job['company']} - {job['location']}")
session.close()
News and Article Collection
For news aggregation, we extract article metadata and optionally the full text:
session = client.sessions.create(url="https://news.ycombinator.com")
# Extract story listings from the front page
result = session.extract(
stories=[{
"_parent": ".athing",
"_limit": 10,
"headline": ".titleline > a >> text",
"url": ".titleline > a >> href",
"rank": ".rank >> text"
}]
)
# Optionally visit each story for full text
for story in result.extraction["stories"][:3]:
if story["url"].startswith("http"):
session.goto(url=story["url"])
full = session.extract(
body="article, main >> text"
)
story["full_text"] = full.extraction.get("body", "")
session.close()
Debugging Failed Extractions
Extraction doesn't always work on the first try. Here are the four most common issues and how to fix each one.
Schema Mismatch Errors
The most frequent problem: your CSS selector doesn't match any elements on the page. The extraction returns null or an empty array.
How to debug: Use browserbeam_observe with format: "html" to see the actual HTML structure:
result = session.observe(format="html", scope=".product-list")
print(result.page.markdown) # Shows HTML with actual class names
Compare the class names in the HTML output against your schema selectors. Sites often use dynamic class names (like .css-1a2b3c) that change on every deploy. In those cases, target semantic elements (article, h2, a) instead of classes.
Dynamic Content Not Loaded
If your extraction returns empty results on pages that clearly have content, the issue is usually timing. JavaScript-rendered content takes time to appear in the DOM.
How to fix: The stable: true signal usually handles this, but some single-page apps load content lazily after the page reports as stable. Use wait to hold for the specific content you need:
session = client.sessions.create(url="https://books.toscrape.com")
session.wait(selector="article.product_pod", timeout=10000)
result = session.extract(books=[{"_parent": "article.product_pod", "title": "h3 a >> text"}])
Handling Anti-Scraping Measures
Some sites serve different content to automated browsers. If your extraction returns empty results or a captcha page, check these settings:
- User agent: Set a realistic user agent in session options
- Viewport size: Use a standard viewport (1280x720 is the default)
- Resource blocking: Avoid blocking images or stylesheets if the site checks for them
- Rate limiting: Add delays between requests with
session.wait(ms=2000)
For more on this topic, see the securing AI browser agents guide.
session = client.sessions.create(
url="https://books.toscrape.com",
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
auto_dismiss_blockers=True,
viewport_width=1280,
viewport_height=720
)
Using page.map for Troubleshooting
When you can't figure out where content lives on a page, the page map shows you every section and its contents:
session = client.sessions.create(url="https://books.toscrape.com")
# The first observe includes a page map automatically
for section in session.page.map:
print(f"[{section['tag']}] {section.get('selector', '')} - {section.get('hint', '')}")
# If the content you want is in a sidebar or footer, use mode="full"
full_result = session.observe(mode="full")
print(full_result.page.markdown)
The page map reveals content that's hidden in sidebars, footers, or nested iframes. If your extraction targets main content but the data lives in an aside, the page map tells you where to look.
Migrating from BeautifulSoup and Scrapy
If you're currently using BeautifulSoup or Scrapy for web scraping python projects, migrating to Browserbeam is straightforward. The core change: you replace parsing code with a declarative schema.
BeautifulSoup to Browserbeam
Here's the same extraction task in both tools:
BeautifulSoup (traditional approach):
import requests
from bs4 import BeautifulSoup
response = requests.get("https://books.toscrape.com")
soup = BeautifulSoup(response.text, "html.parser")
books = []
for card in soup.select("article.product_pod"):
title = card.select_one("h3 a")
price = card.select_one(".price_color")
link = card.select_one("h3 a")
image = card.select_one("img")
books.append({
"title": title.text.strip() if title else None,
"price": price.text.strip() if price else None,
"url": link["href"] if link else None,
"image": image["src"] if image else None
})
Browserbeam (structured approach):
from browserbeam import Browserbeam
client = Browserbeam()
session = client.sessions.create(url="https://books.toscrape.com")
result = session.extract(
books=[{
"_parent": "article.product_pod",
"title": "h3 a >> text",
"price": ".price_color >> text",
"url": "h3 a >> href",
"image": "img >> src"
}]
)
books = result.extraction["books"]
session.close()
| BeautifulSoup | Browserbeam |
|---|---|
requests.get(url) + BeautifulSoup(html) |
client.sessions.create(url=url) |
soup.select("article.product_pod") loop |
_parent: "article.product_pod" in schema |
card.select_one("h3 a").text.strip() |
"title": "h3 a >> text" |
card.select_one("h3 a")["href"] |
"url": "h3 a >> href" |
| Manual null checks for every field | Automatic (returns null if missing) |
| No JavaScript support | Full JavaScript rendering |
| ~15 lines of parsing code | 5 lines of schema |
Scrapy to Browserbeam
Scrapy's spider model is more complex but follows the same pattern. Here's how the concepts map:
| Scrapy Concept | Browserbeam Equivalent |
|---|---|
scrapy.Request(url, callback=self.parse) |
session = client.sessions.create(url=url) |
response.css(".title::text").get() |
session.extract(title=".title >> text") |
response.css(".item").getall() + loop |
session.extract(items=[{"_parent": ".item", ...}]) |
yield scrapy.Request(next_url) |
session.goto(url=next_url) |
| Spider middleware + pipelines | Python functions + try/finally |
scrapy crawl spider_name |
python scraper.py |
The biggest difference: Scrapy is a framework. You write spiders, configure pipelines, manage settings files, and run a crawl command. Browserbeam is a library. You write Python scripts. For simple to moderate scraping tasks, the library approach is faster to set up and easier to debug.
When Traditional Parsers Still Make Sense
Browserbeam is not always the right tool. Traditional parsers win in specific scenarios:
| Scenario | Best Tool | Why |
|---|---|---|
| Static HTML, no JavaScript | BeautifulSoup + Requests | Faster, no browser overhead |
| Large-scale crawling with sitemaps | Scrapy | Built-in crawl management, politeness |
| Simple RSS/XML feeds | feedparser or lxml | Purpose-built for structured feeds |
| Pages behind strict anti-bot that blocks API access | Manual Selenium with profile | Full local browser control |
| JavaScript-rendered content | Browserbeam | Cloud browser, structured extraction |
| Pages with forms, logins, interactions | Browserbeam | High-level interaction methods |
| Data extraction for LLM/AI pipelines | Browserbeam | Token-efficient output, agent integration |
If your target is a static HTML page that doesn't run JavaScript and you don't need to interact with it, requests + BeautifulSoup is simpler and faster. For everything else, structured extraction saves time and maintenance effort.
Performance Tips for Large-Scale Scraping
When your web scraping at scale grows from a handful of pages to thousands, these three techniques keep things fast and cost-effective.
Parallel Sessions for Speed
Browserbeam runs each session on isolated cloud infrastructure, so parallel sessions don't compete for local resources. Use Python's asyncio for concurrent scraping:
import asyncio
from browserbeam import AsyncBrowserbeam
client = AsyncBrowserbeam()
async def scrape_url(url, schema):
session = await client.sessions.create(url=url)
try:
result = await session.extract(**schema)
return {"url": url, "data": result.extraction}
finally:
await session.close()
async def batch_scrape(urls, schema, concurrency=10):
semaphore = asyncio.Semaphore(concurrency)
async def limited(url):
async with semaphore:
return await scrape_url(url, schema)
return await asyncio.gather(*[limited(u) for u in urls])
# Scrape 100 URLs with 10 concurrent sessions
urls = ["https://books.toscrape.com/catalogue/page-" + str(i) + ".html" for i in range(1, 51)]
schema = {"title": "h1 >> text", "books": [{"_parent": "article.product_pod", "_limit": 5, "title": "h3 a >> text", "price": ".price_color >> text"}]}
results = asyncio.run(batch_scrape(urls, schema))
The semaphore controls how many sessions run at once. Start with 5-10 concurrent sessions and increase based on your Browserbeam plan limits.
Minimizing Token Usage with Selective Extraction
If you're feeding scraped data to an LLM for web scraping for ai workflows, the page markdown can be larger than necessary. Use extract with a targeted schema to pull only the fields you need:
# Expensive: reading full page markdown (5,000+ tokens)
observed = session.observe()
page_content = observed.page.markdown
# Cheap: extracting exactly what you need (~50 tokens)
result = session.extract(price=".price_color >> text", title="h3 a >> text")
data = result.extraction
For AI agent workflows, extract first, then send the structured data to the model. This cuts token costs by 90% or more.
Session Reuse and Caching
Opening a new browser session for every URL is expensive. When scraping multiple pages on the same domain, reuse the session:
session = client.sessions.create(url="https://books.toscrape.com")
# Scrape category listing from the sidebar
categories = session.extract(
cats=[{"_parent": ".side_categories ul li a", "_limit": 5, "name": "a >> text", "url": "a >> href"}]
)
# Visit each category in the same session
for cat in categories.extraction["cats"]:
session.goto(url=cat["url"])
books = session.extract(
items=[{"_parent": "article.product_pod", "title": "h3 a >> text", "price": ".price_color >> text"}]
)
print(f"{cat['name']}: {len(books.extraction['items'])} books")
session.close()
One session, multiple pages. Cookies persist, authentication state carries over, and you avoid the overhead of creating and destroying browser instances for every request. For large-scale scraping pipelines, see the infrastructure best practices guide.
Common Scraping Mistakes
Five mistakes that waste time and cause silent data loss. Each one is easy to avoid once you know what to look for.
Scraping Without Waiting for Stability
The most common mistake in web scraping python scripts: extracting data before the page has finished rendering.
What happens: You get empty results or partial data because JavaScript hasn't populated the DOM yet.
The fix: Browserbeam's stable: true signal handles this automatically for most pages. For single-page apps that load data asynchronously after the initial render, add an explicit wait:
session = client.sessions.create(url="https://quotes.toscrape.com/scroll")
session.wait(selector=".quote", timeout=10000)
result = session.extract(quotes=[{"_parent": ".quote", "text": ".text >> text"}])
Over-Extracting Unnecessary Fields
Pulling every field from every element when you only need three fields wastes bandwidth and makes your code harder to debug.
The fix: Start with the minimum fields you need. Add more only when your downstream process requires them. Use _limit: 3 to test schemas before running full extractions.
Ignoring robots.txt and Rate Limits
Automated web scraping without respecting robots.txt and rate limits can get your IP blocked or create legal issues.
The fix: Check the site's robots.txt before scraping. Add delays between requests (session.wait(ms=2000)). Use Browserbeam's built-in rate limiting, which automatically backs off when the API returns rate limit errors. For responsible web scraping best practices, always identify yourself with a realistic user agent and respect Crawl-delay directives.
Hardcoding CSS Selectors
Selectors like .css-1a2b3c or div:nth-child(3) > span.price are guaranteed to break. Dynamic class names change on every build. Position-based selectors break when elements are added or removed.
The fix: Target semantic elements and stable attributes:
# Fragile (will break):
result = session.extract(price=".css-1a2b3c >> text")
# Robust (targets semantic meaning):
result = session.extract(price="[data-testid=price] >> text")
# Also robust (targets element type and role):
result = session.extract(price=".price, [itemprop=price] >> text")
Prefer data-testid, itemprop, aria-label, and semantic class names over generated or deeply nested selectors.
Not Handling Edge Cases (Empty Pages, 404s)
Production scrapers encounter empty pages, 404 errors, captcha challenges, and timeout failures. If your scraper crashes on the first error, you lose all the data collected so far.
The fix: Wrap extractions in try/except and collect results incrementally:
results = []
errors = []
for url in urls:
try:
session = client.sessions.create(url=url)
result = session.extract(title="h1 >> text", price=".price >> text")
results.append({"url": url, **result.extraction})
except Exception as e:
errors.append({"url": url, "error": str(e)})
finally:
try:
session.close()
except Exception:
pass
print(f"Success: {len(results)}, Errors: {len(errors)}")
Frequently Asked Questions
How do I scrape a website with Python?
Install the Browserbeam SDK with pip install browserbeam, create a session with your target URL, and use the extract method with a declarative schema. For example: session.extract(title="h1 >> text", price=".price >> text"). Browserbeam handles JavaScript rendering, cookie banners, and page stability automatically. See the Python SDK guide for a complete walkthrough.
How do I scrape dynamic websites that use JavaScript?
Browserbeam runs a real Chrome browser in the cloud, so it renders JavaScript the same way your browser does. React, Vue, Angular, and other frontend frameworks work out of the box. Create a session with the URL, and the page content in the response includes all JavaScript-rendered content. Use session.wait(selector=".my-element") if the content loads asynchronously.
What is the best web scraping API in 2026?
For structured data extraction, Browserbeam's web scraping API combines a cloud browser with declarative schemas. You define what data you want, and it returns typed JSON. For raw HTML access without structured extraction, tools like Browserless or ScrapingBee are alternatives. For static HTML only, a simple HTTP client with BeautifulSoup works. The right choice depends on whether your target pages use JavaScript and how structured you need the output.
How do I scrape without getting blocked?
Use a realistic user agent, respect robots.txt directives, add delays between requests, and avoid aggressive concurrent scraping. Browserbeam's auto_dismiss_blockers handles cookie banners, and the cloud-hosted browser uses standard browser fingerprints. For sites with stronger anti-bot measures, see the security guide for session isolation and proxy strategies.
Can I use web scraping for machine learning and LLM training data?
Yes, structured web scraping is one of the most efficient ways to collect training data. Browserbeam's extract method returns clean JSON that feeds directly into data pipelines without HTML cleaning steps. For LLM workflows, the structured output is already token-efficient. See our guide on building intelligent web agents for integrating scraped data with AI pipelines.
Is web scraping legal?
Web scraping public data is generally legal, but you must respect terms of service, robots.txt, and data protection regulations like GDPR and CCPA. Avoid scraping personal data without consent, respect rate limits, and don't circumvent access controls. The legality depends on what you scrape, how you use it, and which jurisdiction applies. When in doubt, consult legal counsel.
How does Browserbeam compare to Selenium for web scraping?
Selenium requires installing a browser driver, managing browser binaries, and writing low-level interaction code (find element, send keys, click). Browserbeam runs browsers in the cloud and provides high-level methods like extract, fill, and click with stable element refs. The key difference for scraping: Browserbeam's extract returns structured JSON directly, while Selenium requires you to write parsing code on top of the browser output.
How do I handle pagination when web scraping?
For click-based pagination, extract data from the current page, click the "Next" button using session.click(text="Next"), wait for the new content, and repeat. For infinite scroll pages, use session.scroll_collect() which automatically scrolls and loads lazy content. Both approaches work within a single Browserbeam session, so cookies and state persist across pages.
Start Scraping
You've got the schema syntax, a working project template, and patterns for the three most common scraping use cases. Here's what to do next.
Pick a site you need to scrape. Start with a flat schema for a single page. Test it with _limit: 3. Once the data looks right, remove the limit and add pagination. Then scale it with parallel sessions.
If you're building scrapers for an AI agent workflow, the Python SDK guide covers the full API surface. For production deployments, the scaling guide covers queue architecture and capacity planning.
The API docs have the complete reference for every extraction attribute and session option. Sign up for a free account to get your API key, or install the SDK and start building:
pip install browserbeam # Python
npm install @browserbeam/sdk # TypeScript / Node.js
gem install browserbeam # Ruby
Try it on a real site. Change the schema. Break something. Fix it. That's how web scraping gets good.