Web Scraping in 2026: What Changed and What Still Works (Guide)

The way developers approach web scraping changed in the last two years. Teams that relied on BeautifulSoup and Requests for everything now find that half the web requires a real browser to render. AI-powered scrapers are replacing hand-written CSS selectors. Anti-bot systems got smarter, and the legal picture around scraping got clearer (in some jurisdictions) and murkier (in others).

The old playbook still works for static HTML. But for the growing majority of sites built on React, Vue, and Angular, the rules are different. The teams that adapted early are extracting data faster, more reliably, and at lower cost than the ones still patching Selenium scripts.

This guide covers what changed in web scraping since 2024, which tools still work and why, when you need a real browser, and how to build a modern scraping stack that handles the web as it exists in 2026.

What you'll learn:

What shifted in the web scraping world between 2024 and 2026 (and why it matters)
Which web scraping tools still work for different use cases
How to decide between HTTP-only scraping and browser-based scraping
How to build a modern scraping stack with structured data extraction
A working Browserbeam example that scrapes a real site in four languages
Legal and ethical boundaries every scraper should respect
Performance patterns for scaling from 10 pages to 10,000

TL;DR: Web scraping in 2026 splits into two worlds: static HTML (where BeautifulSoup and Requests still dominate) and JavaScript-rendered content (where you need a real browser). Cloud browser APIs like Browserbeam handle the browser infrastructure so you focus on data extraction instead of Chromium crashes. The biggest shift is AI-powered scraping, where LLMs replace hand-written selectors with schema-based extraction.

The Web Scraping World in 2026

Web scraping is a $8.3 billion industry in 2026, according to Grand View Research. The demand keeps growing because the web keeps growing. But the techniques that worked two years ago don't always work today.

What Changed Since 2024

Four things shifted:

JavaScript-first web apps became the default. Over 80% of the top 10,000 websites now rely on client-side JavaScript rendering, according to HTTP Archive data. Static HTML pages are the minority. If your scraper only makes HTTP requests and parses HTML, it misses most modern sites.
AI-powered extraction replaced manual selectors. Instead of writing CSS selectors by hand and updating them every time a site redesigns, teams now pass a schema to an LLM or extraction API and get structured JSON back. The selector maintenance burden dropped to near zero for teams that made the switch.
Anti-bot systems got significantly better. Cloudflare, DataDome, and PerimeterX (now HUMAN Security) deploy behavioral analysis, TLS fingerprinting, and browser environment checks that catch most basic automation. Simple requests with a spoofed User-Agent stopped working on protected sites somewhere around mid-2025.
Cloud browser APIs matured. Running headless browsers locally or on your own servers is no longer the only option. Managed services handle browser lifecycle, crash recovery, and scaling. You send API calls, you get structured data back.

The Rise of AI Web Scraping

AI web scraping is the biggest shift in how teams extract data. Instead of:

# The old way: brittle CSS selectors
titles = soup.select("div.product-card > h2.title > a")
prices = soup.select("div.product-card > span.price")

Teams now describe what they want:

# The new way: schema-based extraction
result = session.extract(
    products=[{
        "_parent": "article.product_pod",
        "title": "h3 a >> text",
        "price": ".price_color >> text"
    }]
)

The second approach survives site redesigns. When a site changes its class names from product-card to product-item, hand-written selectors break. Schema-based extraction adapts because it targets the content structure, not the CSS implementation.

This matters at scale. Teams maintaining scrapers for 100+ sites used to spend 30-40% of engineering time on selector maintenance. With AI-powered extraction, that drops to under 5%.

New Anti-Bot Challenges

Anti-bot detection in 2026 checks for signals that are hard to fake:

Detection Method	What It Checks	Difficulty to Bypass
TLS fingerprinting	Cipher suites, extensions, protocol version	Hard (requires custom TLS stack)
Browser environment	Navigator properties, WebGL, Canvas	Medium (requires real browser)
Behavioral analysis	Mouse movement, scroll patterns, timing	Hard (requires realistic automation)
IP reputation	Datacenter IP ranges, request patterns	Medium (requires residential proxies)
JavaScript challenges	Dynamic tokens, obfuscated scripts	Medium-Hard (requires JS execution)

The key takeaway: if a site uses modern anti-bot protection, you need a real browser, not an HTTP client pretending to be one. More on this in the "When You Need a Real Browser" section below.

Tools That Still Work (and Why)

Not every scraping job needs a browser. The right tool depends on the target site and the data you need.

BeautifulSoup and Requests

For static HTML pages, BeautifulSoup with Requests is still the fastest and simplest option. It's lightweight, well-documented, and handles 90% of static sites.

Best for: Government data portals, documentation sites, wikis, blogs without JavaScript frameworks, RSS/XML feeds, and any site where the HTML source contains the data you need.

Limitation: Zero JavaScript execution. If the data loads after the initial HTML response (via AJAX, React rendering, or dynamic imports), BeautifulSoup will never see it.

import requests
from bs4 import BeautifulSoup

response = requests.get("https://books.toscrape.com")
soup = BeautifulSoup(response.text, "html.parser")

for book in soup.select("article.product_pod"):
    title = book.select_one("h3 a")["title"]
    price = book.select_one(".price_color").text
    print(f"{title}: {price}")

This still works perfectly for books.toscrape.com because the data is in the HTML source. No JavaScript required.

Selenium, Playwright, Puppeteer

When you need a real browser, these three are the established options. Each runs a full Chromium (or Firefox/WebKit) instance and lets you control it programmatically.

Selenium has the longest track record and the widest language support (Python, Java, C#, Ruby, JavaScript). It's the default choice for QA teams. For scraping, it works but carries overhead: WebDriver protocol latency, clunky API for data extraction, and limited built-in wait strategies.

Playwright (by Microsoft) is the modern choice for new projects. It supports Chromium, Firefox, and WebKit. It has better auto-waiting, network interception, and a cleaner API than Selenium. Most teams starting fresh in 2026 pick Playwright over Selenium.

Puppeteer is Chrome-only and JavaScript-only. It's lighter than Playwright but less versatile. Google maintains it, and it integrates well with the Chrome DevTools Protocol (CDP). If you're building in Node.js and only need Chrome, Puppeteer is a solid option.

The catch with all three: you manage the browser infrastructure. Chromium uses 200-500MB of RAM per instance. At 50 concurrent sessions, that's 10-25GB of RAM for browsers alone. Add crash recovery, memory leak management, and browser version updates, and you're spending engineering time on infrastructure instead of scraping logic. For scaling patterns and how to avoid this trap, see our scaling web automation guide.

Cloud Browser APIs

Cloud browser APIs handle the browser infrastructure for you. You send HTTP requests, the API runs a browser in the cloud, and you get structured data back. No Chromium to install, no crashes to recover from, no memory leaks to debug.

Browserbeam is one such API. It returns structured markdown, element refs for interaction, and supports schema-based extraction, all through a REST API. Other options include Browserless, Browserbase, and ScrapingBee.

The tradeoff: you pay per session instead of managing your own servers. For most teams, the per-session cost is lower than the engineering time saved. For a detailed comparison, see our cloud browser APIs comparison.

When You Need a Real Browser

The decision between HTTP-only scraping and browser-based scraping comes down to one question: is the data in the HTML source?

JavaScript-Rendered Content

Open the target URL in your browser. Right-click, View Source. If the data you need is in the HTML, use Requests + BeautifulSoup. If the data only appears after JavaScript runs (the HTML source shows empty divs or loading spinners), you need a browser.

Scenario	HTML Source	After JS	Tool Needed
Static blog	Data present	Same	HTTP client (Requests, httpx)
Server-rendered React (SSR)	Data present	Enhanced	HTTP client (data is in source)
Client-rendered React (CSR)	Empty shell	Data loads	Browser (Playwright, Browserbeam)
Single Page App (SPA)	Minimal HTML	Full app	Browser
Static file downloads (CSV, PDF)	Direct file	N/A	HTTP client
API with CORS restrictions	N/A	Data via XHR	HTTP client (call API directly)

Pro tip: before using a browser for a JavaScript-rendered site, check the Network tab in DevTools. The data often comes from an API endpoint that you can call directly with an HTTP client. This is faster and cheaper than rendering the full page.

Authentication Flows

Sites that require login present a challenge for HTTP-only scraping. You can sometimes replicate the login by posting credentials to the auth endpoint and managing cookies manually. But OAuth flows, CAPTCHA-protected logins, and multi-factor authentication require a real browser.

With Browserbeam, authentication flows become straightforward:

curl -X POST https://api.browserbeam.com/v1/sessions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://quotes.toscrape.com/login",
    "steps": [
      {"fill": {"label": "Username", "value": "testuser"}},
      {"fill": {"label": "Password", "value": "testpass"}},
      {"click": {"text": "Login"}}
    ]
  }'

from browserbeam import Browserbeam

client = Browserbeam(api_key="YOUR_API_KEY")
session = client.sessions.create(
    url="https://quotes.toscrape.com/login",
    steps=[
        {"fill": {"label": "Username", "value": "testuser"}},
        {"fill": {"label": "Password", "value": "testpass"}},
        {"click": {"text": "Login"}},
    ]
)
print(session.page.title)

import Browserbeam from "@browserbeam/sdk";

const client = new Browserbeam({ apiKey: "YOUR_API_KEY" });
const session = await client.sessions.create({
  url: "https://quotes.toscrape.com/login",
  steps: [
    { fill: { label: "Username", value: "testuser" } },
    { fill: { label: "Password", value: "testpass" } },
    { click: { text: "Login" } },
  ],
});
console.log(session.page?.title);

require "browserbeam"

client = Browserbeam::Client.new(api_key: "YOUR_API_KEY")
session = client.sessions.create(
  url: "https://quotes.toscrape.com/login",
  steps: [
    { fill: { label: "Username", value: "testuser" } },
    { fill: { label: "Password", value: "testpass" } },
    { click: { text: "Login" } },
  ]
)
puts session.page.title

The session handles the login flow, including any JavaScript form validation, and the session stays authenticated for subsequent requests.

Dynamic Pagination and Infinite Scroll

Infinite scroll pages and "Load More" buttons are invisible to HTTP clients. The content only appears after user interaction triggers JavaScript to fetch and render new items.

Browserbeam's scroll_collect method handles this automatically. It scrolls the page, waits for new content to load, and collects the full page content:

from browserbeam import Browserbeam

client = Browserbeam(api_key="YOUR_API_KEY")
session = client.sessions.create(url="https://quotes.toscrape.com/scroll")
session.scroll_collect(max_scrolls=10, wait_ms=1000)
print(session.page.markdown.content)

No manual scroll loops. No guessing when to stop. The method handles the timing and content detection for you. For a deeper look at handling pagination patterns, check our structured web scraping guide.

Building a Modern Web Scraping Stack

A production scraping stack in 2026 typically has three layers: an HTTP layer for static content, a browser layer for JavaScript-rendered content, and a structuring layer that turns raw HTML or markdown into clean JSON.

Task Queue (Redis / SQS)

↓ distribute URLs

Static sites HTTP Client (httpx)

↓

BeautifulSoup

JS-rendered sites Browser API (Browserbeam)

↓

Schema Extraction

↓ structured output

Structured JSON → Storage

Choosing Your HTTP Layer

For static scraping in Python, you have two main options:

Library	Speed	Async Support	Best For
`requests`	Good	No (use `requests-futures`)	Simple scripts, quick prototypes
`httpx`	Good	Yes (native `async/await`)	Production scrapers, high concurrency

For new projects, httpx is the better choice. It supports async natively, has a nearly identical API to requests, and works well with asyncio for concurrent scraping.

import httpx
from bs4 import BeautifulSoup

async def scrape_page(client, url):
    response = await client.get(url)
    soup = BeautifulSoup(response.text, "html.parser")
    return [h2.text for h2 in soup.select("h2")]

async def main():
    async with httpx.AsyncClient() as client:
        urls = [f"https://books.toscrape.com/catalogue/page-{i}.html" for i in range(1, 6)]
        tasks = [scrape_page(client, url) for url in urls]
        results = await asyncio.gather(*tasks)
    return results

Adding Browser Rendering

When you hit a JavaScript-rendered page, route it to your browser layer. The cleanest approach is a routing function that checks the target site and picks the right tool:

from browserbeam import Browserbeam
import httpx
from bs4 import BeautifulSoup

BROWSER_REQUIRED = {"app.example.com", "dashboard.example.com"}

def scrape_url(url, bb_client, http_client):
    from urllib.parse import urlparse
    domain = urlparse(url).netloc

    if domain in BROWSER_REQUIRED:
        session = bb_client.sessions.create(url=url)
        data = session.page.markdown.content
        session.close()
        return data
    else:
        response = http_client.get(url)
        return response.text

Over time, you'll build a mapping of which domains need a browser. Start with HTTP-only and add domains to the browser set as you discover they need JavaScript rendering.

Structuring Extracted Data

Raw HTML or markdown is the intermediate format. The end goal is structured JSON that your application can use. Two approaches work well:

Schema-based extraction (with Browserbeam): define the fields you want, and the API returns structured JSON.

result = session.extract(
    books=[{
        "_parent": "article.product_pod",
        "title": "h3 a >> text",
        "price": ".price_color >> text",
        "stock": ".instock.availability >> text",
        "url": "h3 a >> href"
    }]
)
for book in result.extraction["books"]:
    print(f"{book['title']}: {book['price']}")

BeautifulSoup parsing (for HTTP-scraped content): parse the HTML yourself and build the data structure.

Both approaches produce the same output: clean JSON. The schema-based approach is less code and more resilient to site changes. The BeautifulSoup approach gives you more control and works without a browser API.

Web Scraping with Browserbeam: A Practical Example

Here's a working example that scrapes book data from books.toscrape.com, a practice site designed for web scraping.

Setting Up the Client

First, sign up for a Browserbeam account and grab your API key. Then install the SDK for your language:

# No SDK needed for cURL. Just use your API key:
curl -X POST https://api.browserbeam.com/v1/sessions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://books.toscrape.com"}'

pip install browserbeam

from browserbeam import Browserbeam

client = Browserbeam(api_key="YOUR_API_KEY")
session = client.sessions.create(url="https://books.toscrape.com")
print(session.page.title)  # "All products | Books to Scrape - Sandbox"

npm install @browserbeam/sdk

import Browserbeam from "@browserbeam/sdk";

const client = new Browserbeam({ apiKey: "YOUR_API_KEY" });
const session = await client.sessions.create({ url: "https://books.toscrape.com" });
console.log(session.page?.title);

gem install browserbeam

require "browserbeam"

client = Browserbeam::Client.new(api_key: "YOUR_API_KEY")
session = client.sessions.create(url: "https://books.toscrape.com")
puts session.page.title

That's it. One API call creates a browser session, navigates to the URL, waits for the page to stabilize, and returns the page state. No browser to install. No WebDriver to configure.

Navigating and Extracting Data

Now let's extract structured book data. Browserbeam's extract method takes a schema that describes the data you want:

curl -X POST https://api.browserbeam.com/v1/sessions/SESSION_ID/act \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "steps": [{
      "extract": {
        "books": [{
          "_parent": "article.product_pod",
          "title": "h3 a >> text",
          "price": ".price_color >> text",
          "stock": ".instock.availability >> text"
        }]
      }
    }]
  }'

result = session.extract(
    books=[{
        "_parent": "article.product_pod",
        "title": "h3 a >> text",
        "price": ".price_color >> text",
        "stock": ".instock.availability >> text"
    }]
)

for book in result.extraction["books"]:
    print(f"{book['title']}: {book['price']}")

const result = await session.extract({
  books: [{
    _parent: "article.product_pod",
    title: "h3 a >> text",
    price: ".price_color >> text",
    stock: ".instock.availability >> text",
  }],
});

const books = result.extraction?.books as any[];
books.forEach((book) => console.log(`${book.title}: ${book.price}`));

result = session.extract(
  books: [{
    "_parent" => "article.product_pod",
    "title" => "h3 a >> text",
    "price" => ".price_color >> text",
    "stock" => ".instock.availability >> text"
  }]
)

result.extraction["books"].each do |book|
  puts "#{book['title']}: #{book['price']}"
end

The _parent selector tells Browserbeam to find all matching elements and extract the specified fields from each one. The result is a clean JSON array of book objects. No HTML parsing. No CSS selector maintenance.

Handling Common Edge Cases

Multi-page scraping: Navigate to the next page and extract again.

all_books = []

for page_num in range(1, 4):
    if page_num > 1:
        session.goto(f"https://books.toscrape.com/catalogue/page-{page_num}.html")

    result = session.extract(
        books=[{
            "_parent": "article.product_pod",
            "title": "h3 a >> text",
            "price": ".price_color >> text"
        }]
    )
    all_books.extend(result.extraction["books"])

session.close()
print(f"Scraped {len(all_books)} books across 3 pages")

Cookie banners and popups: Browserbeam's auto_dismiss_blockers parameter handles these automatically:

session = client.sessions.create(
    url="https://books.toscrape.com",
    auto_dismiss_blockers=True
)

Timeouts: Set an explicit timeout to avoid sessions hanging indefinitely:

session = client.sessions.create(
    url="https://books.toscrape.com",
    timeout=60  # Session auto-closes after 60 seconds
)

Always close your sessions when done. Open sessions consume resources and keep billing active. If your code might crash before reaching session.close(), the timeout parameter is your safety net.

For a more detailed walkthrough of building a full scraping agent, see the Python web scraping agent tutorial.

Legal and Ethical Considerations

Web scraping legality is one of the most searched topics in this space, and for good reason. The legal picture is complicated, varies by jurisdiction, and is still evolving.

robots.txt and Terms of Service

The robots.txt file at the root of every website (e.g., https://example.com/robots.txt) tells crawlers which paths are allowed and which are restricted. It's not legally binding in most jurisdictions, but ignoring it is both disrespectful and risky.

Check robots.txt before scraping any site. If a path is disallowed, don't scrape it. If the site's Terms of Service explicitly prohibit scraping, reconsider whether the data is worth the legal risk.

In the US, the 2022 hiQ Labs v. LinkedIn ruling established that scraping publicly accessible data does not violate the Computer Fraud and Abuse Act (CFAA). But this applies specifically to public data. Scraping behind a login wall or circumventing technical access controls is a different legal question.

In the EU, the GDPR adds constraints around personal data. Even if the data is publicly visible, collecting and storing personal information (names, email addresses, phone numbers) requires a legal basis under GDPR.

Rate Limiting and Respectful Scraping

Respectful scraping means not overwhelming the target server. A good rule of thumb:

1-2 requests per second for small sites
5-10 requests per second for large sites with CDN infrastructure
Always check Crawl-delay in robots.txt if specified
Back off on 429 (Too Many Requests) responses with exponential delays

import time

def respectful_scrape(urls, delay=1.0):
    results = []
    for url in urls:
        result = scrape_single(url)
        results.append(result)
        time.sleep(delay)  # Respect the server
    return results

With Browserbeam, rate limiting is handled at the API level. Your sessions run on Browserbeam's infrastructure, so you're not sending direct traffic to the target site from your IP. But you should still pace your requests to be respectful of the target site's resources.

If you scrape personal data from EU-accessible websites, GDPR applies regardless of where your company is based. The key rules:

Have a legal basis for collecting the data (legitimate interest, consent, or another GDPR basis)
Minimize data collection to only what you need
Don't store personal data longer than necessary
Honor data deletion requests if individuals contact you
Document your data processing activities

For web scraping projects, the safest approach is to avoid collecting personal data altogether. Scrape product prices, job listings, news headlines, and public statistics instead. If you must collect personal data, consult a privacy lawyer before building the scraper.

Performance and Scaling Patterns

Once your scraper works for 10 pages, the next question is: how do I make it work for 10,000?

Async Scraping with Python

For HTTP-only scraping, Python's asyncio with httpx gives you concurrency without threads:

import asyncio
import httpx
from bs4 import BeautifulSoup

async def scrape_page(client, url):
    response = await client.get(url)
    soup = BeautifulSoup(response.text, "html.parser")
    books = []
    for article in soup.select("article.product_pod"):
        books.append({
            "title": article.select_one("h3 a")["title"],
            "price": article.select_one(".price_color").text,
        })
    return books

async def main():
    urls = [
        f"https://books.toscrape.com/catalogue/page-{i}.html"
        for i in range(1, 51)
    ]
    async with httpx.AsyncClient() as client:
        tasks = [scrape_page(client, url) for url in urls]
        results = await asyncio.gather(*tasks)
    all_books = [book for page in results for book in page]
    print(f"Scraped {len(all_books)} books from {len(urls)} pages")

asyncio.run(main())

This scrapes 50 pages concurrently. The bottleneck is the target server's response time, not your code.

For browser-based scraping with Browserbeam, use the async client:

from browserbeam import AsyncBrowserbeam
import asyncio

async def scrape_with_browser(client, url):
    session = await client.sessions.create(url=url)
    result = await session.extract(
        books=[{
            "_parent": "article.product_pod",
            "title": "h3 a >> text",
            "price": ".price_color >> text"
        }]
    )
    await session.close()
    return result.extraction["books"]

async def main():
    client = AsyncBrowserbeam(api_key="YOUR_API_KEY")
    urls = [
        f"https://books.toscrape.com/catalogue/page-{i}.html"
        for i in range(1, 11)
    ]
    tasks = [scrape_with_browser(client, url) for url in urls]
    results = await asyncio.gather(*tasks)
    await client.close()
    all_books = [book for page in results for book in page]
    print(f"Scraped {len(all_books)} books")

asyncio.run(main())

Parallel Browser Sessions

Browserbeam sessions are independent and isolated. You can run as many in parallel as your plan allows. Each session gets its own browser context with separate cookies, storage, and memory.

The pattern for parallel browser sessions:

Create a pool of sessions (e.g., 10 concurrent)
Assign URLs from a queue to available sessions
Extract data and close sessions as they complete
Create new sessions for the next batch

For production workloads at this scale, use a task queue (Redis, SQS, or RabbitMQ) to manage URL distribution. See our scaling web automation guide for queue architecture patterns.

Caching and Deduplication

Don't scrape the same page twice unless you need fresh data. A simple caching layer saves both time and cost:

import hashlib
import json

cache = {}

def cached_scrape(url, scrape_fn, ttl_hours=24):
    cache_key = hashlib.sha256(url.encode()).hexdigest()

    if cache_key in cache:
        cached_at, data = cache[cache_key]
        age_hours = (time.time() - cached_at) / 3600
        if age_hours < ttl_hours:
            return data

    data = scrape_fn(url)
    cache[cache_key] = (time.time(), data)
    return data

For production, replace the in-memory dict with Redis or a database. Set the TTL based on how often the source data changes: hourly for prices, daily for product catalogs, weekly for documentation.

Web Scraping Tools Comparison (2026)

Choosing the right web scraping tool depends on your use case, scale, and technical requirements. Here's how the major options compare:

Tool	Type	JS Support	Async	Best For	Pricing
BeautifulSoup + Requests	Library	No	No (use httpx)	Static HTML parsing	Free (open source)
Selenium	Browser automation	Yes	Limited	QA testing, legacy scraping	Free (open source)
Playwright	Browser automation	Yes	Yes	Modern browser automation	Free (open source)
Puppeteer	Browser automation	Yes	Yes	Chrome-specific automation	Free (open source)
Browserbeam	Cloud browser API	Yes	Yes	AI agents, structured extraction	Usage-based
Scrapy	Framework	No (needs Splash)	Yes	Large-scale crawling	Free (open source)

When to Use Each Tool

BeautifulSoup + Requests: You're scraping static HTML, need maximum speed, and don't need JavaScript. Perfect for data pipelines that hit simple, stable pages.

Playwright: You need local browser control with a modern API. Good for complex interactions, testing, and scraping projects where you manage your own infrastructure.

Browserbeam: You need browser rendering without managing Chromium. Good for AI agent workflows, structured data extraction, and teams that want to focus on scraping logic instead of browser infrastructure. See the Python SDK getting started guide to try it.

Scrapy: You're building a large-scale crawl (millions of pages) with custom pipelines, middleware, and deduplication. Scrapy's framework handles the orchestration.

Decision Framework: Picking Your Tool

Ask these four questions:

Does the target site need JavaScript? No = BeautifulSoup. Yes = continue.
Do you want to manage browser infrastructure? Yes = Playwright/Puppeteer. No = Browserbeam.
Do you need structured data extraction? Yes = Browserbeam (schema-based). No = any browser tool works.
Are you building at scale (1000+ pages/day)? Yes = consider Scrapy (for crawling) or Browserbeam (for browser rendering). No = any tool works.

Common Mistakes

Across the teams I've worked with, these five mistakes appear again and again. All of them are avoidable.

Scraping Without Checking robots.txt

Always check robots.txt first. It takes 30 seconds and prevents both legal issues and IP bans. Add a robots.txt check to your scraper's initialization:

import urllib.robotparser

def is_allowed(url, user_agent="*"):
    from urllib.parse import urlparse
    parsed = urlparse(url)
    robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"

    rp = urllib.robotparser.RobotFileParser()
    rp.set_url(robots_url)
    rp.read()
    return rp.can_fetch(user_agent, url)

Ignoring Rate Limits

Sending 100 requests per second to a small site will get your IP banned within minutes. Even large sites will throttle you. Watch for 429 responses and implement exponential backoff:

import time
import random

def backoff_request(url, client, max_retries=5):
    for attempt in range(max_retries):
        response = client.get(url)
        if response.status_code == 429:
            wait = (2 ** attempt) + random.uniform(0, 1)
            time.sleep(wait)
            continue
        return response
    raise Exception(f"Failed after {max_retries} retries: {url}")

Not Handling Stale Selectors

CSS selectors break when sites update their HTML. A selector that worked yesterday returns empty results today. The fix: validate your extraction results and alert on anomalies.

def validate_extraction(data, expected_fields, min_items=1):
    if not data or len(data) < min_items:
        raise ValueError(f"Expected at least {min_items} items, got {len(data or [])}")
    for item in data:
        for field in expected_fields:
            if field not in item or not item[field]:
                raise ValueError(f"Missing or empty field: {field} in {item}")
    return True

Schema-based extraction (like Browserbeam's extract method) reduces this problem because the schema targets content semantics rather than specific CSS class names. But validation is still important.

Over-Engineering the First Version

Teams spend weeks building retry logic, caching layers, proxy rotation, and distributed queues before they have a single working scraper. Start simple. Get a working scraper for one page. Then add complexity as you need it.

A working scraper that handles 10 pages is more valuable than an over-engineered framework that handles zero.

Running Headless Browsers Locally in Production

Running Chromium on your production servers is a footgun at scale. Memory leaks, zombie processes, crash recovery, browser version management, and security patching all become your problem. At 50+ concurrent sessions, this becomes a full-time ops job.

Cloud browser APIs exist specifically to solve this. Let Browserbeam (or a similar service) manage the browser infrastructure. Your scraper sends API calls and receives structured data. No Chromium processes to babysit.

Frequently Asked Questions

How to scrape a website in 2026?

Start by checking whether the site needs JavaScript. If the data is in the HTML source (View Source in your browser), use Python with Requests and BeautifulSoup. If the data loads via JavaScript, use a browser tool like Playwright or a cloud browser API like Browserbeam. Always check robots.txt before scraping and respect rate limits.

Is web scraping legal?

Web scraping of publicly accessible data is generally legal in the US after the 2022 hiQ Labs v. LinkedIn ruling. In the EU, scraping is allowed but subject to GDPR when personal data is involved. Scraping behind login walls, circumventing access controls, or violating a site's Terms of Service carries higher legal risk. When in doubt, consult a lawyer.

What is the best tool for web scraping?

It depends on your use case. For static HTML, BeautifulSoup with Python is hard to beat. For JavaScript-rendered pages, Playwright gives you full local control. For teams that want structured extraction without managing browsers, Browserbeam handles the infrastructure. See the comparison table above for a detailed breakdown.

Can I scrape JavaScript-heavy sites without a browser?

Sometimes. Check the Network tab in DevTools to see if the data comes from an API endpoint you can call directly. If it does, you can skip the browser entirely and call the API with an HTTP client. If the data is rendered client-side with no accessible API, you need a browser.

How do I avoid getting blocked while web scraping?

Respect robots.txt, keep your request rate at 1-2 per second for small sites, rotate User-Agent strings, and use residential proxies for sensitive targets. Cloud browser APIs like Browserbeam help because sessions run on managed infrastructure with real browser fingerprints, not your application server's IP.

What is the difference between web scraping and web crawling?

Web crawling is discovering and indexing pages (following links across a site). Web scraping is extracting specific data from those pages. A crawler finds URLs. A scraper extracts data from them. Most production systems combine both: crawl to discover, scrape to extract.

Do I need a cloud browser API for web scraping?

Not always. For static sites, HTTP clients work fine. For JavaScript-rendered sites at small scale, local Playwright is sufficient. Cloud browser APIs pay off when you need browser rendering at scale (50+ concurrent sessions), want structured data extraction without managing Chromium, or are building AI agents that need browser access. The cloud browser API comparison covers the tradeoffs in detail.

The Teams That Build Smart Scraping Stacks Win

Web scraping in 2026 is two different disciplines. Static HTML scraping hasn't changed much. Python, Requests, and BeautifulSoup still do the job. The real shift is in JavaScript-rendered content, where you need a real browser, structured extraction, and infrastructure that scales without becoming an ops burden.

The teams getting the best results use a layered approach: HTTP clients for static content, cloud browser APIs for dynamic content, and schema-based extraction for both. They check robots.txt, respect rate limits, and build caching into every pipeline. They don't over-engineer the first version, and they don't run Chromium on their production servers.

Start with the Browserbeam API docs for the full API reference. Build your first scraper with the Python SDK guide. Scale it using the patterns in the scaling web automation guide. Or try the structured web scraping guide for a deep dive on extraction schemas.

What will you scrape first?