Web Scraping with Python: The Complete Guide (2026)

May 04, 2026 18 min read

Python web scraping tools: BeautifulSoup, Scrapy, Selenium, Playwright, and Requests with the Python logo

By the end of this guide, you'll have working Python scrapers that fetch web pages, parse HTML, extract structured data, handle JavaScript-heavy sites, and paginate through hundreds of pages. We'll build each one from scratch, using real websites and real selectors.

Python dominates the web scraping ecosystem, and for good reason. Between Requests, BeautifulSoup, Scrapy, Selenium, Playwright, and cloud browser APIs like Browserbeam, the Python ecosystem covers every scraping scenario you'll run into. The hard part isn't finding tools. It's knowing which tool to use, when, and how to avoid the pitfalls that trip up most scrapers.

This guide covers the full spectrum: from a five-line BeautifulSoup script to production-grade scrapers that rotate proxies, manage cookies, and handle infinite scroll. Every code example runs against real, publicly accessible websites. Copy, paste, run.

What you'll learn in this guide:

  • How to fetch and parse HTML with Requests and BeautifulSoup
  • How to scrape JavaScript-rendered pages with Selenium and Browserbeam
  • How to extract structured data using CSS selectors and extraction schemas
  • How to handle pagination, infinite scroll, and multi-page scraping
  • How to use proxies, manage cookies, and avoid getting blocked
  • Three real-world scraping projects you can build today
  • How to choose the right Python scraping library for your use case

TL;DR: Python web scraping starts with Requests + BeautifulSoup for static HTML pages, but most modern sites need a real browser. Selenium, Playwright, and cloud browser APIs like Browserbeam handle JavaScript rendering, cookie banners, and dynamic content. This guide walks through both approaches with working code, then covers pagination, proxies, cookies, and three real-world projects.


What Is Web Scraping and Why Python?

Web scraping is the process of extracting data from websites programmatically. Instead of copying text by hand, you write a script that fetches a web page, finds the data you need, and saves it in a structured format like JSON or CSV.

The basic workflow has three steps: fetch the page, parse the HTML, and extract the target data. Everything else (pagination, authentication, JavaScript rendering) builds on top of that core loop.

How Web Scraping Works

Every web page is an HTML document. When your browser loads a page, it downloads that HTML, executes any JavaScript, applies CSS, and renders the result. A web scraper does the same thing, minus the rendering.

A simple scraper sends an HTTP GET request to a URL, receives the HTML response, and uses a parser to locate specific elements. CSS selectors or XPath expressions tell the parser which elements contain the data you want. The scraper reads the text or attributes from those elements, then moves to the next page.

The challenge is that modern websites don't serve all their content in the initial HTML. Many sites load data with JavaScript after the page loads. For those sites, you need a tool that executes JavaScript, which means a real browser or a browser API.

Why Python Is the Go-To Language for Scraping

Python has one of the deepest scraping toolchains in any language. The combination of Requests (for HTTP), BeautifulSoup (for parsing), and Scrapy (for large-scale crawling) has been the default starting point for scraping work in most teams I've shipped with.

The syntax is readable and concise, which matters when you're writing throwaway scripts to grab data quickly. The library ecosystem covers every step of the pipeline, from HTTP and parsing to crawling, browser automation, and data export. And Python's data tooling (pandas, CSV module, JSON module) lets you clean and analyze scraped data in the same script that fetched it, without context switching to another language.

For JavaScript-heavy sites, Python also has strong browser automation options: Selenium, Playwright, and cloud browser APIs.


Python Web Scraping Libraries Compared

Before writing code, you need to pick the right tool. Here's how the five Python web scraping libraries you'll encounter most often in 2026 compare.

Library Type JavaScript Support Learning Curve Best For
Requests + BeautifulSoup HTTP client + parser No Low Static HTML pages, quick scripts
Scrapy Full framework No (without plugins) Medium Large-scale crawling, pipelines
Selenium Browser automation Yes Medium Legacy projects, testing + scraping
Playwright Browser automation Yes Medium Modern browser automation, testing
Browserbeam Cloud browser API Yes Low JavaScript sites, no infrastructure to manage

Requests + BeautifulSoup

The classic combination. Requests handles HTTP, BeautifulSoup handles HTML parsing. Together they cover most scraping tasks where the data is in the initial HTML response: static pages, server-rendered content, and anything that doesn't require JavaScript execution.

Strengths: Fast, lightweight, zero setup. You can scrape a page in five lines of code. No browser binary to install.

Weaknesses: Can't handle JavaScript-rendered content. No built-in session management for complex workflows. You handle cookies, headers, and retries manually.

Scrapy

Scrapy is a full-featured web crawling framework. It handles request scheduling, rate limiting, data pipelines, and export formats out of the box. For large crawling jobs (thousands of pages), Scrapy's async architecture lets it fetch many URLs in parallel, while a sequential Requests loop blocks on each call.

Strengths: Built-in concurrency, middleware system, data pipelines, and export to JSON/CSV/databases. Production-ready for large crawls.

Weaknesses: Steeper learning curve. Overkill for simple one-off scripts. No JavaScript support without adding Splash or Playwright integration.

Selenium

Selenium launched as a browser testing tool and became the default Python scraping tool for JavaScript-heavy sites. It controls a real browser (Chrome, Firefox) through WebDriver.

Strengths: Handles any JavaScript-rendered page. Large community, extensive documentation. Works with Chrome, Firefox, Edge.

Weaknesses: Slow compared to HTTP-based scraping. Requires a browser binary and WebDriver on every machine. Brittle selectors. Resource-heavy for large-scale scraping.

Playwright

Playwright is the modern alternative to Selenium, built by Microsoft. It controls Chromium, Firefox, and WebKit with a cleaner API, auto-waiting, and better reliability.

Strengths: Faster than Selenium, auto-waits for elements, built-in screenshot and PDF support. Better API design. Supports headed and headless modes.

Weaknesses: Still requires a local browser binary. Resource-heavy at scale. You manage the browser lifecycle yourself.

Browserbeam

Browserbeam is a cloud browser API. Instead of running a browser locally, you send HTTP requests to Browserbeam's API, and it returns structured page data. No Chromium to install, no WebDriver to manage.

Strengths: No local browser dependency. Returns structured data (markdown, element refs, extraction schemas) instead of raw HTML. Built-in stability detection, cookie banner dismissal, and proxy support. Python SDK wraps the API in a clean interface. Free trial available.

Weaknesses: Requires an API key. Not free for high-volume use. Newer product with a smaller community than Selenium or Scrapy.


Setting Up Your Python Scraping Environment

Let's set up a clean environment with the libraries we'll use throughout this guide.

Installing Core Libraries

Start with a virtual environment to keep dependencies isolated:

python -m venv scraper-env
source scraper-env/bin/activate  # macOS/Linux
# scraper-env\Scripts\activate   # Windows

Install the core libraries:

pip install requests beautifulsoup4 lxml browserbeam

That gives you Requests (HTTP client), BeautifulSoup (HTML parser), lxml (fast parser backend), and the Browserbeam Python SDK. If you want Selenium or Playwright too:

pip install selenium
# or
pip install playwright && playwright install chromium

Project Structure for Scraping Scripts

For anything beyond a throwaway script, organize your code:

my-scraper/
├── scrape.py          # main scraping script
├── exporters.py       # JSON/CSV export functions
├── config.py          # URLs, selectors, settings
├── requirements.txt   # pinned dependencies
└── data/              # output directory

Keep your selectors and target URLs in a config file. When a website changes its layout (and it will), you update one file instead of hunting through code.


Your First Python Web Scraper: Step by Step

Let's build a scraper that extracts book titles and prices from books.toscrape.com, a site built specifically for scraping practice.

Step 1: Fetch a Web Page

import requests

url = "https://books.toscrape.com"
response = requests.get(url)

print(f"Status: {response.status_code}")
print(f"Content length: {len(response.text)} characters")

requests.get() sends an HTTP GET request and returns the response. A 200 status means success. The .text attribute contains the raw HTML.

If you get a 403 or 429, the site is blocking your request. We'll cover how to handle that in the advanced patterns section.

Step 2: Parse the HTML

from bs4 import BeautifulSoup

soup = BeautifulSoup(response.text, "lxml")
print(soup.title.text)  # "All products | Books to Scrape - Sandbox"

BeautifulSoup takes the raw HTML string and builds a parse tree you can query. The "lxml" parser is faster than the default html.parser. Both work, but lxml handles malformed HTML more gracefully.

Step 3: Extract Data with CSS Selectors

Now let's pull out the book data. Each book on the page is wrapped in an article.product_pod element:

books = []
for article in soup.select("article.product_pod"):
    title = article.select_one("h3 a")["title"]
    price = article.select_one(".price_color").text
    in_stock = article.select_one(".instock.availability")
    stock_text = in_stock.text.strip() if in_stock else "Unknown"

    books.append({
        "title": title,
        "price": price,
        "availability": stock_text
    })

print(f"Found {len(books)} books")
for book in books[:3]:
    print(f"  {book['title']}: {book['price']}")

soup.select() uses CSS selectors, the same syntax you'd use in browser DevTools. select_one() returns the first match. The ["title"] syntax reads an HTML attribute. .text reads the element's text content.

Pro tip: Open the target page in your browser, right-click an element, and choose "Inspect" to find the right selector. Look for unique class names or IDs that identify the data you want.

Step 4: Export to JSON or CSV

import json

with open("data/books.json", "w") as f:
    json.dump(books, f, indent=2)

print(f"Saved {len(books)} books to data/books.json")

For CSV output:

import csv

with open("data/books.csv", "w", newline="") as f:
    writer = csv.DictWriter(f, fieldnames=["title", "price", "availability"])
    writer.writeheader()
    writer.writerows(books)

That's a complete scraper in about 20 lines of Python. Fetch, parse, extract, save. For static HTML pages, Requests + BeautifulSoup is all you need.


Scraping JavaScript-Heavy Sites with Python

Most modern websites render content with JavaScript. If you view the page source and the data isn't in the HTML, a simple HTTP request won't work. You need a tool that runs JavaScript.

Why Static Scrapers Fail on Modern Sites

Try scraping a React or Vue.js application with Requests:

response = requests.get("https://quotes.toscrape.com/scroll")
soup = BeautifulSoup(response.text, "lxml")
quotes = soup.select(".quote")
print(f"Found {len(quotes)} quotes")  # 0 quotes

Zero results. The page loads quotes dynamically via JavaScript after the initial HTML loads. Requests only sees the empty shell.

Scraping with Selenium

Selenium opens a real browser, executes JavaScript, and gives you the fully rendered DOM:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()
driver.get("https://books.toscrape.com")

# Wait for content to load
WebDriverWait(driver, 10).until(
    EC.presence_of_element_located((By.CSS_SELECTOR, "article.product_pod"))
)

articles = driver.find_elements(By.CSS_SELECTOR, "article.product_pod")
for article in articles[:3]:
    title = article.find_element(By.CSS_SELECTOR, "h3 a").get_attribute("title")
    price = article.find_element(By.CSS_SELECTOR, ".price_color").text
    print(f"{title}: {price}")

driver.quit()

Selenium works, but it's verbose. You manage the browser lifecycle, write explicit waits, and handle crashes yourself. For large-scale scraping, running dozens of Chrome instances locally gets expensive fast.

Scraping with a Cloud Browser API

A cloud browser API like Browserbeam handles the browser infrastructure for you. No local Chrome, no WebDriver, no memory management. You send an API call and get structured data back.

from browserbeam import Browserbeam

client = Browserbeam()
session = client.sessions.create(
    url="https://books.toscrape.com",
    auto_dismiss_blockers=True
)

print(f"Title: {session.page.title}")
print(f"Stable: {session.page.stable}")

result = session.extract(
    books=[{
        "_parent": "article.product_pod",
        "title": "h3 a >> text",
        "price": ".price_color >> text"
    }]
)

for book in result.extraction["books"][:3]:
    print(f"  {book['title']}: {book['price']}")

session.close()

The extract method uses a schema to define what data you want. _parent scopes the extraction to each matching container, and the selector strings (h3 a >> text) tell Browserbeam which element and attribute to pull. The result is clean JSON, not raw HTML you need to parse.

Pro tip: When your selector returns a >> href for a relative link, normalize it with urllib.parse.urljoin before storing. Sites use a mix of absolute, root-relative, and document-relative URLs, and a missed normalization step is the most common reason scraped links fail to resolve later.

For a deeper walkthrough of the Browserbeam Python SDK, see the getting started guide.

Approach Lines of Code Local Dependencies JavaScript Support Speed
Requests + BeautifulSoup ~15 None No Fast
Selenium ~20 Chrome + ChromeDriver Yes Slow
Browserbeam SDK ~15 None Yes Medium

Handling Pagination, Multiple Pages, and Infinite Scroll

Real scraping jobs rarely stop at one page. Products span hundreds of pages. Search results paginate. News feeds scroll infinitely.

The simplest pagination pattern: find the "next" link and follow it until it disappears.

import requests
from bs4 import BeautifulSoup
import time

all_books = []
url = "https://books.toscrape.com/catalogue/page-1.html"

while url:
    response = requests.get(url)
    soup = BeautifulSoup(response.text, "lxml")

    for article in soup.select("article.product_pod"):
        title = article.select_one("h3 a")["title"]
        price = article.select_one(".price_color").text
        all_books.append({"title": title, "price": price})

    next_btn = soup.select_one("li.next a")
    if next_btn:
        next_path = next_btn["href"]
        base = "https://books.toscrape.com/catalogue/"
        url = base + next_path
        time.sleep(1)
    else:
        url = None

print(f"Total books scraped: {len(all_books)}")

The time.sleep(1) between requests is important. Hammering a server with rapid requests gets you blocked and puts unnecessary load on the target site. One second between pages is a reasonable starting point.

Offset and Cursor-Based Pagination

Some sites use API endpoints with page or cursor parameters instead of next-page links. Here's a working example against the Hacker News Algolia API, which paginates results with a page parameter:

import requests
import time

all_stories = []
page = 0
max_pages = 3

while page < max_pages:
    response = requests.get(
        "https://hn.algolia.com/api/v1/search_by_date",
        params={"tags": "story", "hitsPerPage": 20, "page": page}
    )
    data = response.json()
    hits = data.get("hits", [])

    if not hits:
        break

    all_stories.extend(hits)
    page += 1
    time.sleep(0.5)

print(f"Collected {len(all_stories)} stories across {page} pages")

If the API uses cursor-based pagination instead, replace page with the cursor value returned in each response (often called next, cursor, or after).

Infinite Scroll Pages

Infinite scroll pages load more content as you scroll down. No next-page link exists. You need a browser that can scroll and wait for new content.

With Browserbeam, the scroll_collect method handles this automatically:

from browserbeam import Browserbeam

client = Browserbeam()
session = client.sessions.create(url="https://quotes.toscrape.com/scroll")

result = session.scroll_collect(max_scrolls=10, wait_ms=1000)
print(f"Page length after scrolling: {len(result.page.markdown.content)} chars")

quotes = session.extract(
    quotes=[{
        "_parent": ".quote",
        "text": ".text >> text",
        "author": ".author >> text"
    }]
)

print(f"Extracted {len(quotes.extraction['quotes'])} quotes")
session.close()

scroll_collect scrolls the page repeatedly, waiting for new content to load between each scroll. The max_scrolls parameter caps how far it goes. After scrolling, you extract data from the fully loaded page.

For more on handling pagination and lazy loading patterns, read the structured web scraping guide.


Advanced Python Web Scraping Patterns

Once you've built basic scrapers, these patterns will make them more reliable and harder to detect.

Using Proxies for Web Scraping

Proxies route your requests through different IP addresses. This helps when a site blocks your IP after too many requests, or when you need to access geo-restricted content.

With Requests:

proxies = {
    "http": "http://user:pass@proxy.example.com:8080",
    "https": "http://user:pass@proxy.example.com:8080"
}

response = requests.get("https://books.toscrape.com", proxies=proxies)

With Browserbeam, pass the proxy when creating a session:

session = client.sessions.create(
    url="https://books.toscrape.com",
    proxy="http://user:pass@proxy.example.com:8080"
)

Datacenter proxies are cheap and fast, but easier for sites to detect. Residential proxies use real consumer IP addresses and are harder to block, but cost more per gigabyte. For most scraping tasks, datacenter proxies work fine. Switch to residential proxies when a site actively blocks datacenter IP ranges.

Pro tip: Don't reach for residential proxies until your datacenter setup actually fails. Residential bandwidth is roughly 10x the cost of datacenter, so the wrong default eats your budget. Start datacenter, monitor block rates, and upgrade per-domain when the data shows you need to.

Managing Cookies and Sessions

Some sites require cookies for pagination, authentication, or session tracking. The requests.Session object persists cookies across requests:

session = requests.Session()
session.headers.update({
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                  "AppleWebKit/537.36 (KHTML, like Gecko) "
                  "Chrome/125.0.0.0 Safari/537.36"
})

# First request: site sets cookies
session.get("https://books.toscrape.com")

# Subsequent requests: cookies are sent automatically
response = session.get("https://books.toscrape.com/catalogue/page-2.html")
print(f"Cookies: {dict(session.cookies)}")

Setting a realistic User-Agent header is one of the simplest things you can do to avoid blocks. Many sites reject requests with Python's default user agent (python-requests/2.x).

Pro tip: Rotate between a small pool of recent browser User-Agent strings instead of using one. Sites that detect bot traffic often look for a fixed UA hitting many pages in a row; rotating across three or four real Chrome and Firefox UAs reduces that signal.

Rate Limiting and Retries

Aggressive scraping gets you blocked. Respect rate limits, add delays between requests, and implement retries with exponential backoff for transient failures:

import time
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

session = requests.Session()
retries = Retry(
    total=3,
    backoff_factor=1,
    status_forcelist=[429, 500, 502, 503, 504]
)
session.mount("https://", HTTPAdapter(max_retries=retries))

def fetch_with_delay(url, delay=1.0):
    response = session.get(url)
    response.raise_for_status()
    time.sleep(delay)
    return response

The Retry adapter handles server errors and rate limit responses (429) automatically. The backoff_factor of 1 means the retry delays are 1s, 2s, 4s. Combined with a polite delay between requests, this pattern handles most transient failures without manual intervention.


Three Real-World Python Scraping Projects

Let's apply everything we've covered to three practical projects. Each one uses a different scraping approach and solves a real problem.

Project 1: Price Monitoring Bot

Track book prices and detect changes over time. This pattern works for any e-commerce site.

import json
import os
from datetime import datetime
from browserbeam import Browserbeam

PRICES_FILE = "data/price_history.json"

def load_history():
    if os.path.exists(PRICES_FILE):
        with open(PRICES_FILE) as f:
            return json.load(f)
    return {}

def save_history(history):
    os.makedirs("data", exist_ok=True)
    with open(PRICES_FILE, "w") as f:
        json.dump(history, f, indent=2)

def scrape_prices():
    client = Browserbeam()
    session = client.sessions.create(
        url="https://books.toscrape.com",
        auto_dismiss_blockers=True
    )
    result = session.extract(
        books=[{
            "_parent": "article.product_pod",
            "title": "h3 a >> text",
            "price": ".price_color >> text"
        }]
    )
    session.close()
    return result.extraction["books"]

def check_prices():
    history = load_history()
    current = scrape_prices()
    changes = []
    timestamp = datetime.now().isoformat()

    for book in current:
        title = book["title"]
        price = book["price"]

        if title in history and history[title]["last_price"] != price:
            changes.append({
                "title": title,
                "old_price": history[title]["last_price"],
                "new_price": price
            })

        history[title] = {"last_price": price, "checked_at": timestamp}

    save_history(history)
    return changes

changes = check_prices()
if changes:
    for c in changes:
        print(f"PRICE CHANGE: {c['title']}: {c['old_price']} -> {c['new_price']}")
else:
    print("No price changes detected.")

Run this on a daily cron job and you've got a working price monitor. Add email or Slack notifications and you have a complete alerting system. For a more advanced version with GPT-powered analysis, see the competitive intelligence agent guide.

Pro tip: Store snapshots with a timestamp instead of overwriting the same key every run. A history file lets you answer "how often does this product change price?" later, which is the question that actually matters for negotiation and pricing strategy.

Project 2: Job Board Aggregator

Scrape job listings from Fake Python Jobs and structure them for analysis:

import requests
from bs4 import BeautifulSoup
import json

response = requests.get("https://realpython.github.io/fake-jobs/")
soup = BeautifulSoup(response.text, "lxml")

jobs = []
for card in soup.select(".card"):
    title_el = card.select_one("h2.title")
    company_el = card.select_one("h3.company")
    location_el = card.select_one(".location")
    date_el = card.select_one("time")

    if title_el and company_el:
        jobs.append({
            "title": title_el.text.strip(),
            "company": company_el.text.strip(),
            "location": location_el.text.strip() if location_el else "",
            "date": date_el.text.strip() if date_el else ""
        })

print(f"Found {len(jobs)} job listings")

# Filter by keyword
python_jobs = [j for j in jobs if "python" in j["title"].lower()]
print(f"Python-related jobs: {len(python_jobs)}")

with open("data/jobs.json", "w") as f:
    json.dump(jobs, f, indent=2)

This is a static page, so Requests + BeautifulSoup is enough. No browser needed.

Project 3: News Headline Tracker

Monitor Hacker News for trending stories:

from browserbeam import Browserbeam
import json

client = Browserbeam()
session = client.sessions.create(url="https://news.ycombinator.com")

result = session.extract(
    stories=[{
        "_parent": ".athing",
        "rank": ".rank >> text",
        "title": ".titleline > a >> text",
        "url": ".titleline > a >> href"
    }]
)

stories = result.extraction["stories"]
print(f"Top {len(stories)} stories on Hacker News:\n")
for story in stories[:10]:
    print(f"  {story['rank']} {story['title']}")
    print(f"     {story['url']}\n")

session.close()

with open("data/hn_stories.json", "w") as f:
    json.dump(stories, f, indent=2)

This scraper runs in about two seconds and gives you structured JSON for every story on the front page. Schedule it hourly to build a dataset of trending tech stories over time.


Common Python Web Scraping Mistakes

After building dozens of scrapers, these are the mistakes that cause the most debugging time.

Mistake 1: Not Checking the Robots.txt

Every website has a robots.txt file that specifies which pages scrapers should avoid. Ignoring it can get your IP blocked and, in some cases, create legal issues.

import requests

robots = requests.get("https://books.toscrape.com/robots.txt")
print(robots.text)

Check robots.txt before scraping a new site. Respect Disallow rules and Crawl-delay directives.

Mistake 2: Using the Default User-Agent

Python's Requests library identifies itself as python-requests/2.x.x. Many sites block this outright.

Fix: Always set a realistic browser user agent:

headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                  "AppleWebKit/537.36 (KHTML, like Gecko) "
                  "Chrome/125.0.0.0 Safari/537.36"
}
response = requests.get(url, headers=headers)

Mistake 3: Not Handling Errors Gracefully

Scrapers fail. Pages change. Servers go down. If your script crashes on the first error, you lose everything scraped up to that point.

Fix: Wrap each page fetch in a try/except and continue:

for url in urls:
    try:
        response = requests.get(url, timeout=10)
        response.raise_for_status()
        # parse and extract...
    except requests.RequestException as e:
        print(f"Failed to fetch {url}: {e}")
        continue

Mistake 4: Scraping Too Fast

Sending hundreds of requests per second overwhelms the target server and gets you blocked immediately.

Fix: Add a delay between requests. One to two seconds per page is a reasonable default. For high-volume jobs, use Scrapy's built-in rate limiting or a cloud browser API that manages request timing for you.

Mistake 5: Building Fragile Selectors

Selectors like div > div > div > span:nth-child(3) break the moment a site changes its layout.

Fix: Use class names, IDs, and semantic attributes instead of position-based selectors. Prefer .price_color over div:nth-child(4) > span. With Browserbeam, the extract method uses stable schemas that survive minor layout changes better than brittle CSS paths.


How to Choose the Right Python Scraping Library

There's no single best tool. The right choice depends on what you're scraping and how much infrastructure you want to manage.

Does the site need JavaScript?
↓ No
How many pages?
A few Requests + BeautifulSoup
Thousands Scrapy
↓ Yes
Manage browsers locally?
Yes Playwright
No Browserbeam

When to Use Requests + BeautifulSoup

Use this combination when the target page is static HTML. Open the page in your browser, view the page source (Ctrl+U), and check if the data you need is in the HTML. If it is, you don't need a browser.

Good for: blogs, documentation sites, government data portals, Wikipedia, static e-commerce pages, and any site that serves content in the initial HTML response.

When to Use a Browser-Based Tool

Use Selenium or Playwright when the page loads data with JavaScript, requires user interaction (clicking, scrolling, form filling), or has anti-bot protections that check for browser fingerprints.

Selenium is the right choice if you already have Selenium infrastructure, need cross-browser testing alongside scraping, or your team knows the Selenium API. Playwright is the better choice for new projects: it's faster, has auto-waiting, and the API is cleaner.

When to Use a Cloud Browser API

Use Browserbeam or a similar cloud browser API when you need JavaScript execution without managing browser infrastructure. This is the right choice when you're deploying scrapers to servers (where installing Chrome is a pain), when you want structured data instead of raw HTML, or when you want built-in proxy support and cookie banner handling.

For a detailed comparison of cloud browser options, read the cloud browser APIs compared guide.

Scenario Recommended Tool Why
Static HTML, one-off script Requests + BeautifulSoup Simplest, fastest, no dependencies
Static HTML, thousands of pages Scrapy Built-in concurrency and pipelines
JavaScript site, local development Playwright Modern API, auto-waiting
JavaScript site, server deployment Browserbeam No browser binary needed
Login + multi-step workflow Browserbeam or Playwright Session management, cookie persistence
Site with anti-bot detection Browserbeam Cloud browser with residential proxy support

Python Web Scraping Best Practices

A quick reference for building scrapers that last.

Practice Why It Matters
Check robots.txt first Legal compliance, avoids aggressive scraping of restricted pages
Set a realistic User-Agent Prevents instant blocks from default Python header
Add 1-2s delay between requests Respects server resources, avoids rate limits
Handle errors with try/except Prevents losing hours of scraped data on a single failure
Use stable selectors (classes, IDs) Survives site layout changes
Store data incrementally Don't wait until the end to save; write after each page
Log what you scrape Debugging a failed scrape is impossible without logs
Keep selectors in a config file One-file update when a site changes layout
Test on a small sample first Catch selector bugs before running a full crawl

These practices apply regardless of which library you use. The scraper that runs reliably for six months is worth more than the scraper that's 10% faster but breaks every week.


Frequently Asked Questions

How do I scrape a website with Python?

Install the requests and beautifulsoup4 libraries, send a GET request to the target URL, parse the HTML with BeautifulSoup, and use CSS selectors to extract the data you need. For JavaScript-heavy sites, use Selenium, Playwright, or a cloud browser API like Browserbeam. See the step-by-step tutorial in this guide for working code examples.

Is Python good for web scraping?

Python is the language most teams reach for first when they start a scraping project. It has mature libraries for every part of the pipeline: Requests for HTTP, BeautifulSoup for parsing, Scrapy for large-scale crawling, and Selenium/Playwright/Browserbeam for JavaScript-heavy sites. The readable syntax and strong data processing ecosystem (pandas, JSON, CSV modules) make Python a natural fit for scraping work in 2026.

What is the best Python web scraping library?

There's no single best library. Requests + BeautifulSoup is best for static pages. Scrapy is best for large-scale crawling. Playwright is best for modern browser automation. Browserbeam is best when you need JavaScript support without managing browser infrastructure. The "How to Choose" section above has a decision framework.

Can I scrape dynamic JavaScript websites with Python?

Yes. Use Selenium, Playwright, or a cloud browser API like Browserbeam. These tools run a real browser that executes JavaScript and returns the fully rendered page. Standard HTTP libraries (Requests, urllib) only see the initial HTML and miss any content loaded by JavaScript.

How do I handle pagination in Python web scraping?

Find the "next page" link in the HTML and follow it in a loop until it disappears. Add a delay between requests to avoid overwhelming the server. For API-based pagination, increment the offset or page parameter. For infinite scroll pages, use a browser tool that can scroll and wait for content to load.

How do I avoid getting blocked while web scraping with Python?

Set a realistic browser User-Agent header, add delays between requests (1-2 seconds), respect robots.txt rules, and rotate IP addresses with proxies if needed. For sites with aggressive anti-bot detection, use a cloud browser API with residential proxy support. Avoid scraping faster than a human would browse.

Do I need proxies for web scraping?

Not always. For small-scale scraping of public sites, your regular IP address works fine. Proxies become necessary when a site blocks your IP after repeated requests, when you need to access geo-restricted content, or when you're scraping at volume and need to distribute requests across multiple addresses.

What is the difference between web scraping and web crawling?

Web scraping extracts specific data from web pages (prices, titles, contact info). Web crawling discovers and visits pages by following links across a site or across the internet. In practice, most scraping projects include some crawling (following pagination links), and most crawlers extract some data. The distinction is about the primary goal: data extraction vs. page discovery.


Conclusion

You now have the tools and patterns to scrape almost anything with Python. We started with Requests + BeautifulSoup for static pages, moved to Selenium and Browserbeam for JavaScript-heavy sites, handled pagination and infinite scroll, added proxies and cookies for tougher targets, and built three working projects you can extend.

The key takeaway: pick the simplest tool that handles your target site. Start with Requests + BeautifulSoup. If the data isn't in the static HTML, move to a browser-based tool. If you don't want to manage browser infrastructure, try a cloud browser API.

Every scraper in this guide runs against real websites with real selectors. Open a terminal, install the libraries, and start building. Try modifying the price monitor to track a product you actually care about, or point the news tracker at a different site and adjust the extraction schema.

For your next step, the Python SDK getting started guide covers every Browserbeam method in detail. The building a web scraping agent tutorial shows how to wire these patterns into an autonomous agent. And the scaling web automation guide covers production deployment.

What will you scrape first?

You might also like:

How to Scrape YouTube: Videos and Transcripts

Scrape YouTube video data, channel listings, and transcripts. Working Python, TypeScript, and Ruby code. Bypasses headless detection with residential proxies.

10 min read May 01, 2026

Give your AI agent a faster, leaner browser

Structured page data instead of raw HTML. Your agent processes less, decides faster, and costs less to run.

Stability detection built in
Fraction of the payload size
Diffs after every action
No credit card required. 5,000 free credits included.