
By the end of this guide, you'll have working Python scrapers that fetch web pages, parse HTML, extract structured data, handle JavaScript-heavy sites, and paginate through hundreds of pages. We'll build each one from scratch, using real websites and real selectors.
Python dominates the web scraping ecosystem, and for good reason. Between Requests, BeautifulSoup, Scrapy, Selenium, Playwright, and cloud browser APIs like Browserbeam, the Python ecosystem covers every scraping scenario you'll run into. The hard part isn't finding tools. It's knowing which tool to use, when, and how to avoid the pitfalls that trip up most scrapers.
This guide covers the full spectrum: from a five-line BeautifulSoup script to production-grade scrapers that rotate proxies, manage cookies, and handle infinite scroll. Every code example runs against real, publicly accessible websites. Copy, paste, run.
What you'll learn in this guide:
- How to fetch and parse HTML with Requests and BeautifulSoup
- How to scrape JavaScript-rendered pages with Selenium and Browserbeam
- How to extract structured data using CSS selectors and extraction schemas
- How to handle pagination, infinite scroll, and multi-page scraping
- How to use proxies, manage cookies, and avoid getting blocked
- Three real-world scraping projects you can build today
- How to choose the right Python scraping library for your use case
TL;DR: Python web scraping starts with Requests + BeautifulSoup for static HTML pages, but most modern sites need a real browser. Selenium, Playwright, and cloud browser APIs like Browserbeam handle JavaScript rendering, cookie banners, and dynamic content. This guide walks through both approaches with working code, then covers pagination, proxies, cookies, and three real-world projects.
What Is Web Scraping and Why Python?
Web scraping is the process of extracting data from websites programmatically. Instead of copying text by hand, you write a script that fetches a web page, finds the data you need, and saves it in a structured format like JSON or CSV.
The basic workflow has three steps: fetch the page, parse the HTML, and extract the target data. Everything else (pagination, authentication, JavaScript rendering) builds on top of that core loop.
How Web Scraping Works
Every web page is an HTML document. When your browser loads a page, it downloads that HTML, executes any JavaScript, applies CSS, and renders the result. A web scraper does the same thing, minus the rendering.
A simple scraper sends an HTTP GET request to a URL, receives the HTML response, and uses a parser to locate specific elements. CSS selectors or XPath expressions tell the parser which elements contain the data you want. The scraper reads the text or attributes from those elements, then moves to the next page.
The challenge is that modern websites don't serve all their content in the initial HTML. Many sites load data with JavaScript after the page loads. For those sites, you need a tool that executes JavaScript, which means a real browser or a browser API.
Why Python Is the Go-To Language for Scraping
Python has one of the deepest scraping toolchains in any language. The combination of Requests (for HTTP), BeautifulSoup (for parsing), and Scrapy (for large-scale crawling) has been the default starting point for scraping work in most teams I've shipped with.
The syntax is readable and concise, which matters when you're writing throwaway scripts to grab data quickly. The library ecosystem covers every step of the pipeline, from HTTP and parsing to crawling, browser automation, and data export. And Python's data tooling (pandas, CSV module, JSON module) lets you clean and analyze scraped data in the same script that fetched it, without context switching to another language.
For JavaScript-heavy sites, Python also has strong browser automation options: Selenium, Playwright, and cloud browser APIs.
Python Web Scraping Libraries Compared
Before writing code, you need to pick the right tool. Here's how the five Python web scraping libraries you'll encounter most often in 2026 compare.
| Library | Type | JavaScript Support | Learning Curve | Best For |
|---|---|---|---|---|
| Requests + BeautifulSoup | HTTP client + parser | No | Low | Static HTML pages, quick scripts |
| Scrapy | Full framework | No (without plugins) | Medium | Large-scale crawling, pipelines |
| Selenium | Browser automation | Yes | Medium | Legacy projects, testing + scraping |
| Playwright | Browser automation | Yes | Medium | Modern browser automation, testing |
| Browserbeam | Cloud browser API | Yes | Low | JavaScript sites, no infrastructure to manage |
Requests + BeautifulSoup
The classic combination. Requests handles HTTP, BeautifulSoup handles HTML parsing. Together they cover most scraping tasks where the data is in the initial HTML response: static pages, server-rendered content, and anything that doesn't require JavaScript execution.
Strengths: Fast, lightweight, zero setup. You can scrape a page in five lines of code. No browser binary to install.
Weaknesses: Can't handle JavaScript-rendered content. No built-in session management for complex workflows. You handle cookies, headers, and retries manually.
Scrapy
Scrapy is a full-featured web crawling framework. It handles request scheduling, rate limiting, data pipelines, and export formats out of the box. For large crawling jobs (thousands of pages), Scrapy's async architecture lets it fetch many URLs in parallel, while a sequential Requests loop blocks on each call.
Strengths: Built-in concurrency, middleware system, data pipelines, and export to JSON/CSV/databases. Production-ready for large crawls.
Weaknesses: Steeper learning curve. Overkill for simple one-off scripts. No JavaScript support without adding Splash or Playwright integration.
Selenium
Selenium launched as a browser testing tool and became the default Python scraping tool for JavaScript-heavy sites. It controls a real browser (Chrome, Firefox) through WebDriver.
Strengths: Handles any JavaScript-rendered page. Large community, extensive documentation. Works with Chrome, Firefox, Edge.
Weaknesses: Slow compared to HTTP-based scraping. Requires a browser binary and WebDriver on every machine. Brittle selectors. Resource-heavy for large-scale scraping.
Playwright
Playwright is the modern alternative to Selenium, built by Microsoft. It controls Chromium, Firefox, and WebKit with a cleaner API, auto-waiting, and better reliability.
Strengths: Faster than Selenium, auto-waits for elements, built-in screenshot and PDF support. Better API design. Supports headed and headless modes.
Weaknesses: Still requires a local browser binary. Resource-heavy at scale. You manage the browser lifecycle yourself.
Browserbeam
Browserbeam is a cloud browser API. Instead of running a browser locally, you send HTTP requests to Browserbeam's API, and it returns structured page data. No Chromium to install, no WebDriver to manage.
Strengths: No local browser dependency. Returns structured data (markdown, element refs, extraction schemas) instead of raw HTML. Built-in stability detection, cookie banner dismissal, and proxy support. Python SDK wraps the API in a clean interface. Free trial available.
Weaknesses: Requires an API key. Not free for high-volume use. Newer product with a smaller community than Selenium or Scrapy.
Setting Up Your Python Scraping Environment
Let's set up a clean environment with the libraries we'll use throughout this guide.
Installing Core Libraries
Start with a virtual environment to keep dependencies isolated:
python -m venv scraper-env
source scraper-env/bin/activate # macOS/Linux
# scraper-env\Scripts\activate # Windows
Install the core libraries:
pip install requests beautifulsoup4 lxml browserbeam
That gives you Requests (HTTP client), BeautifulSoup (HTML parser), lxml (fast parser backend), and the Browserbeam Python SDK. If you want Selenium or Playwright too:
pip install selenium
# or
pip install playwright && playwright install chromium
Project Structure for Scraping Scripts
For anything beyond a throwaway script, organize your code:
my-scraper/
├── scrape.py # main scraping script
├── exporters.py # JSON/CSV export functions
├── config.py # URLs, selectors, settings
├── requirements.txt # pinned dependencies
└── data/ # output directory
Keep your selectors and target URLs in a config file. When a website changes its layout (and it will), you update one file instead of hunting through code.
Your First Python Web Scraper: Step by Step
Let's build a scraper that extracts book titles and prices from books.toscrape.com, a site built specifically for scraping practice.
Step 1: Fetch a Web Page
import requests
url = "https://books.toscrape.com"
response = requests.get(url)
print(f"Status: {response.status_code}")
print(f"Content length: {len(response.text)} characters")
requests.get() sends an HTTP GET request and returns the response. A 200 status means success. The .text attribute contains the raw HTML.
If you get a 403 or 429, the site is blocking your request. We'll cover how to handle that in the advanced patterns section.
Step 2: Parse the HTML
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text, "lxml")
print(soup.title.text) # "All products | Books to Scrape - Sandbox"
BeautifulSoup takes the raw HTML string and builds a parse tree you can query. The "lxml" parser is faster than the default html.parser. Both work, but lxml handles malformed HTML more gracefully.
Step 3: Extract Data with CSS Selectors
Now let's pull out the book data. Each book on the page is wrapped in an article.product_pod element:
books = []
for article in soup.select("article.product_pod"):
title = article.select_one("h3 a")["title"]
price = article.select_one(".price_color").text
in_stock = article.select_one(".instock.availability")
stock_text = in_stock.text.strip() if in_stock else "Unknown"
books.append({
"title": title,
"price": price,
"availability": stock_text
})
print(f"Found {len(books)} books")
for book in books[:3]:
print(f" {book['title']}: {book['price']}")
soup.select() uses CSS selectors, the same syntax you'd use in browser DevTools. select_one() returns the first match. The ["title"] syntax reads an HTML attribute. .text reads the element's text content.
Pro tip: Open the target page in your browser, right-click an element, and choose "Inspect" to find the right selector. Look for unique class names or IDs that identify the data you want.
Step 4: Export to JSON or CSV
import json
with open("data/books.json", "w") as f:
json.dump(books, f, indent=2)
print(f"Saved {len(books)} books to data/books.json")
For CSV output:
import csv
with open("data/books.csv", "w", newline="") as f:
writer = csv.DictWriter(f, fieldnames=["title", "price", "availability"])
writer.writeheader()
writer.writerows(books)
That's a complete scraper in about 20 lines of Python. Fetch, parse, extract, save. For static HTML pages, Requests + BeautifulSoup is all you need.
Scraping JavaScript-Heavy Sites with Python
Most modern websites render content with JavaScript. If you view the page source and the data isn't in the HTML, a simple HTTP request won't work. You need a tool that runs JavaScript.
Why Static Scrapers Fail on Modern Sites
Try scraping a React or Vue.js application with Requests:
response = requests.get("https://quotes.toscrape.com/scroll")
soup = BeautifulSoup(response.text, "lxml")
quotes = soup.select(".quote")
print(f"Found {len(quotes)} quotes") # 0 quotes
Zero results. The page loads quotes dynamically via JavaScript after the initial HTML loads. Requests only sees the empty shell.
Scraping with Selenium
Selenium opens a real browser, executes JavaScript, and gives you the fully rendered DOM:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.get("https://books.toscrape.com")
# Wait for content to load
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CSS_SELECTOR, "article.product_pod"))
)
articles = driver.find_elements(By.CSS_SELECTOR, "article.product_pod")
for article in articles[:3]:
title = article.find_element(By.CSS_SELECTOR, "h3 a").get_attribute("title")
price = article.find_element(By.CSS_SELECTOR, ".price_color").text
print(f"{title}: {price}")
driver.quit()
Selenium works, but it's verbose. You manage the browser lifecycle, write explicit waits, and handle crashes yourself. For large-scale scraping, running dozens of Chrome instances locally gets expensive fast.
Scraping with a Cloud Browser API
A cloud browser API like Browserbeam handles the browser infrastructure for you. No local Chrome, no WebDriver, no memory management. You send an API call and get structured data back.
from browserbeam import Browserbeam
client = Browserbeam()
session = client.sessions.create(
url="https://books.toscrape.com",
auto_dismiss_blockers=True
)
print(f"Title: {session.page.title}")
print(f"Stable: {session.page.stable}")
result = session.extract(
books=[{
"_parent": "article.product_pod",
"title": "h3 a >> text",
"price": ".price_color >> text"
}]
)
for book in result.extraction["books"][:3]:
print(f" {book['title']}: {book['price']}")
session.close()
The extract method uses a schema to define what data you want. _parent scopes the extraction to each matching container, and the selector strings (h3 a >> text) tell Browserbeam which element and attribute to pull. The result is clean JSON, not raw HTML you need to parse.
Pro tip: When your selector returns a >> href for a relative link, normalize it with urllib.parse.urljoin before storing. Sites use a mix of absolute, root-relative, and document-relative URLs, and a missed normalization step is the most common reason scraped links fail to resolve later.
For a deeper walkthrough of the Browserbeam Python SDK, see the getting started guide.
| Approach | Lines of Code | Local Dependencies | JavaScript Support | Speed |
|---|---|---|---|---|
| Requests + BeautifulSoup | ~15 | None | No | Fast |
| Selenium | ~20 | Chrome + ChromeDriver | Yes | Slow |
| Browserbeam SDK | ~15 | None | Yes | Medium |
Handling Pagination, Multiple Pages, and Infinite Scroll
Real scraping jobs rarely stop at one page. Products span hundreds of pages. Search results paginate. News feeds scroll infinitely.
Following Next-Page Links
The simplest pagination pattern: find the "next" link and follow it until it disappears.
import requests
from bs4 import BeautifulSoup
import time
all_books = []
url = "https://books.toscrape.com/catalogue/page-1.html"
while url:
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")
for article in soup.select("article.product_pod"):
title = article.select_one("h3 a")["title"]
price = article.select_one(".price_color").text
all_books.append({"title": title, "price": price})
next_btn = soup.select_one("li.next a")
if next_btn:
next_path = next_btn["href"]
base = "https://books.toscrape.com/catalogue/"
url = base + next_path
time.sleep(1)
else:
url = None
print(f"Total books scraped: {len(all_books)}")
The time.sleep(1) between requests is important. Hammering a server with rapid requests gets you blocked and puts unnecessary load on the target site. One second between pages is a reasonable starting point.
Offset and Cursor-Based Pagination
Some sites use API endpoints with page or cursor parameters instead of next-page links. Here's a working example against the Hacker News Algolia API, which paginates results with a page parameter:
import requests
import time
all_stories = []
page = 0
max_pages = 3
while page < max_pages:
response = requests.get(
"https://hn.algolia.com/api/v1/search_by_date",
params={"tags": "story", "hitsPerPage": 20, "page": page}
)
data = response.json()
hits = data.get("hits", [])
if not hits:
break
all_stories.extend(hits)
page += 1
time.sleep(0.5)
print(f"Collected {len(all_stories)} stories across {page} pages")
If the API uses cursor-based pagination instead, replace page with the cursor value returned in each response (often called next, cursor, or after).
Infinite Scroll Pages
Infinite scroll pages load more content as you scroll down. No next-page link exists. You need a browser that can scroll and wait for new content.
With Browserbeam, the scroll_collect method handles this automatically:
from browserbeam import Browserbeam
client = Browserbeam()
session = client.sessions.create(url="https://quotes.toscrape.com/scroll")
result = session.scroll_collect(max_scrolls=10, wait_ms=1000)
print(f"Page length after scrolling: {len(result.page.markdown.content)} chars")
quotes = session.extract(
quotes=[{
"_parent": ".quote",
"text": ".text >> text",
"author": ".author >> text"
}]
)
print(f"Extracted {len(quotes.extraction['quotes'])} quotes")
session.close()
scroll_collect scrolls the page repeatedly, waiting for new content to load between each scroll. The max_scrolls parameter caps how far it goes. After scrolling, you extract data from the fully loaded page.
For more on handling pagination and lazy loading patterns, read the structured web scraping guide.
Advanced Python Web Scraping Patterns
Once you've built basic scrapers, these patterns will make them more reliable and harder to detect.
Using Proxies for Web Scraping
Proxies route your requests through different IP addresses. This helps when a site blocks your IP after too many requests, or when you need to access geo-restricted content.
With Requests:
proxies = {
"http": "http://user:pass@proxy.example.com:8080",
"https": "http://user:pass@proxy.example.com:8080"
}
response = requests.get("https://books.toscrape.com", proxies=proxies)
With Browserbeam, pass the proxy when creating a session:
session = client.sessions.create(
url="https://books.toscrape.com",
proxy="http://user:pass@proxy.example.com:8080"
)
Datacenter proxies are cheap and fast, but easier for sites to detect. Residential proxies use real consumer IP addresses and are harder to block, but cost more per gigabyte. For most scraping tasks, datacenter proxies work fine. Switch to residential proxies when a site actively blocks datacenter IP ranges.
Pro tip: Don't reach for residential proxies until your datacenter setup actually fails. Residential bandwidth is roughly 10x the cost of datacenter, so the wrong default eats your budget. Start datacenter, monitor block rates, and upgrade per-domain when the data shows you need to.
Managing Cookies and Sessions
Some sites require cookies for pagination, authentication, or session tracking. The requests.Session object persists cookies across requests:
session = requests.Session()
session.headers.update({
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/125.0.0.0 Safari/537.36"
})
# First request: site sets cookies
session.get("https://books.toscrape.com")
# Subsequent requests: cookies are sent automatically
response = session.get("https://books.toscrape.com/catalogue/page-2.html")
print(f"Cookies: {dict(session.cookies)}")
Setting a realistic User-Agent header is one of the simplest things you can do to avoid blocks. Many sites reject requests with Python's default user agent (python-requests/2.x).
Pro tip: Rotate between a small pool of recent browser User-Agent strings instead of using one. Sites that detect bot traffic often look for a fixed UA hitting many pages in a row; rotating across three or four real Chrome and Firefox UAs reduces that signal.
Rate Limiting and Retries
Aggressive scraping gets you blocked. Respect rate limits, add delays between requests, and implement retries with exponential backoff for transient failures:
import time
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
session = requests.Session()
retries = Retry(
total=3,
backoff_factor=1,
status_forcelist=[429, 500, 502, 503, 504]
)
session.mount("https://", HTTPAdapter(max_retries=retries))
def fetch_with_delay(url, delay=1.0):
response = session.get(url)
response.raise_for_status()
time.sleep(delay)
return response
The Retry adapter handles server errors and rate limit responses (429) automatically. The backoff_factor of 1 means the retry delays are 1s, 2s, 4s. Combined with a polite delay between requests, this pattern handles most transient failures without manual intervention.
Three Real-World Python Scraping Projects
Let's apply everything we've covered to three practical projects. Each one uses a different scraping approach and solves a real problem.
Project 1: Price Monitoring Bot
Track book prices and detect changes over time. This pattern works for any e-commerce site.
import json
import os
from datetime import datetime
from browserbeam import Browserbeam
PRICES_FILE = "data/price_history.json"
def load_history():
if os.path.exists(PRICES_FILE):
with open(PRICES_FILE) as f:
return json.load(f)
return {}
def save_history(history):
os.makedirs("data", exist_ok=True)
with open(PRICES_FILE, "w") as f:
json.dump(history, f, indent=2)
def scrape_prices():
client = Browserbeam()
session = client.sessions.create(
url="https://books.toscrape.com",
auto_dismiss_blockers=True
)
result = session.extract(
books=[{
"_parent": "article.product_pod",
"title": "h3 a >> text",
"price": ".price_color >> text"
}]
)
session.close()
return result.extraction["books"]
def check_prices():
history = load_history()
current = scrape_prices()
changes = []
timestamp = datetime.now().isoformat()
for book in current:
title = book["title"]
price = book["price"]
if title in history and history[title]["last_price"] != price:
changes.append({
"title": title,
"old_price": history[title]["last_price"],
"new_price": price
})
history[title] = {"last_price": price, "checked_at": timestamp}
save_history(history)
return changes
changes = check_prices()
if changes:
for c in changes:
print(f"PRICE CHANGE: {c['title']}: {c['old_price']} -> {c['new_price']}")
else:
print("No price changes detected.")
Run this on a daily cron job and you've got a working price monitor. Add email or Slack notifications and you have a complete alerting system. For a more advanced version with GPT-powered analysis, see the competitive intelligence agent guide.
Pro tip: Store snapshots with a timestamp instead of overwriting the same key every run. A history file lets you answer "how often does this product change price?" later, which is the question that actually matters for negotiation and pricing strategy.
Project 2: Job Board Aggregator
Scrape job listings from Fake Python Jobs and structure them for analysis:
import requests
from bs4 import BeautifulSoup
import json
response = requests.get("https://realpython.github.io/fake-jobs/")
soup = BeautifulSoup(response.text, "lxml")
jobs = []
for card in soup.select(".card"):
title_el = card.select_one("h2.title")
company_el = card.select_one("h3.company")
location_el = card.select_one(".location")
date_el = card.select_one("time")
if title_el and company_el:
jobs.append({
"title": title_el.text.strip(),
"company": company_el.text.strip(),
"location": location_el.text.strip() if location_el else "",
"date": date_el.text.strip() if date_el else ""
})
print(f"Found {len(jobs)} job listings")
# Filter by keyword
python_jobs = [j for j in jobs if "python" in j["title"].lower()]
print(f"Python-related jobs: {len(python_jobs)}")
with open("data/jobs.json", "w") as f:
json.dump(jobs, f, indent=2)
This is a static page, so Requests + BeautifulSoup is enough. No browser needed.
Project 3: News Headline Tracker
Monitor Hacker News for trending stories:
from browserbeam import Browserbeam
import json
client = Browserbeam()
session = client.sessions.create(url="https://news.ycombinator.com")
result = session.extract(
stories=[{
"_parent": ".athing",
"rank": ".rank >> text",
"title": ".titleline > a >> text",
"url": ".titleline > a >> href"
}]
)
stories = result.extraction["stories"]
print(f"Top {len(stories)} stories on Hacker News:\n")
for story in stories[:10]:
print(f" {story['rank']} {story['title']}")
print(f" {story['url']}\n")
session.close()
with open("data/hn_stories.json", "w") as f:
json.dump(stories, f, indent=2)
This scraper runs in about two seconds and gives you structured JSON for every story on the front page. Schedule it hourly to build a dataset of trending tech stories over time.
Common Python Web Scraping Mistakes
After building dozens of scrapers, these are the mistakes that cause the most debugging time.
Mistake 1: Not Checking the Robots.txt
Every website has a robots.txt file that specifies which pages scrapers should avoid. Ignoring it can get your IP blocked and, in some cases, create legal issues.
import requests
robots = requests.get("https://books.toscrape.com/robots.txt")
print(robots.text)
Check robots.txt before scraping a new site. Respect Disallow rules and Crawl-delay directives.
Mistake 2: Using the Default User-Agent
Python's Requests library identifies itself as python-requests/2.x.x. Many sites block this outright.
Fix: Always set a realistic browser user agent:
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/125.0.0.0 Safari/537.36"
}
response = requests.get(url, headers=headers)
Mistake 3: Not Handling Errors Gracefully
Scrapers fail. Pages change. Servers go down. If your script crashes on the first error, you lose everything scraped up to that point.
Fix: Wrap each page fetch in a try/except and continue:
for url in urls:
try:
response = requests.get(url, timeout=10)
response.raise_for_status()
# parse and extract...
except requests.RequestException as e:
print(f"Failed to fetch {url}: {e}")
continue
Mistake 4: Scraping Too Fast
Sending hundreds of requests per second overwhelms the target server and gets you blocked immediately.
Fix: Add a delay between requests. One to two seconds per page is a reasonable default. For high-volume jobs, use Scrapy's built-in rate limiting or a cloud browser API that manages request timing for you.
Mistake 5: Building Fragile Selectors
Selectors like div > div > div > span:nth-child(3) break the moment a site changes its layout.
Fix: Use class names, IDs, and semantic attributes instead of position-based selectors. Prefer .price_color over div:nth-child(4) > span. With Browserbeam, the extract method uses stable schemas that survive minor layout changes better than brittle CSS paths.
How to Choose the Right Python Scraping Library
There's no single best tool. The right choice depends on what you're scraping and how much infrastructure you want to manage.
When to Use Requests + BeautifulSoup
Use this combination when the target page is static HTML. Open the page in your browser, view the page source (Ctrl+U), and check if the data you need is in the HTML. If it is, you don't need a browser.
Good for: blogs, documentation sites, government data portals, Wikipedia, static e-commerce pages, and any site that serves content in the initial HTML response.
When to Use a Browser-Based Tool
Use Selenium or Playwright when the page loads data with JavaScript, requires user interaction (clicking, scrolling, form filling), or has anti-bot protections that check for browser fingerprints.
Selenium is the right choice if you already have Selenium infrastructure, need cross-browser testing alongside scraping, or your team knows the Selenium API. Playwright is the better choice for new projects: it's faster, has auto-waiting, and the API is cleaner.
When to Use a Cloud Browser API
Use Browserbeam or a similar cloud browser API when you need JavaScript execution without managing browser infrastructure. This is the right choice when you're deploying scrapers to servers (where installing Chrome is a pain), when you want structured data instead of raw HTML, or when you want built-in proxy support and cookie banner handling.
For a detailed comparison of cloud browser options, read the cloud browser APIs compared guide.
| Scenario | Recommended Tool | Why |
|---|---|---|
| Static HTML, one-off script | Requests + BeautifulSoup | Simplest, fastest, no dependencies |
| Static HTML, thousands of pages | Scrapy | Built-in concurrency and pipelines |
| JavaScript site, local development | Playwright | Modern API, auto-waiting |
| JavaScript site, server deployment | Browserbeam | No browser binary needed |
| Login + multi-step workflow | Browserbeam or Playwright | Session management, cookie persistence |
| Site with anti-bot detection | Browserbeam | Cloud browser with residential proxy support |
Python Web Scraping Best Practices
A quick reference for building scrapers that last.
| Practice | Why It Matters |
|---|---|
Check robots.txt first |
Legal compliance, avoids aggressive scraping of restricted pages |
| Set a realistic User-Agent | Prevents instant blocks from default Python header |
| Add 1-2s delay between requests | Respects server resources, avoids rate limits |
| Handle errors with try/except | Prevents losing hours of scraped data on a single failure |
| Use stable selectors (classes, IDs) | Survives site layout changes |
| Store data incrementally | Don't wait until the end to save; write after each page |
| Log what you scrape | Debugging a failed scrape is impossible without logs |
| Keep selectors in a config file | One-file update when a site changes layout |
| Test on a small sample first | Catch selector bugs before running a full crawl |
These practices apply regardless of which library you use. The scraper that runs reliably for six months is worth more than the scraper that's 10% faster but breaks every week.
Frequently Asked Questions
How do I scrape a website with Python?
Install the requests and beautifulsoup4 libraries, send a GET request to the target URL, parse the HTML with BeautifulSoup, and use CSS selectors to extract the data you need. For JavaScript-heavy sites, use Selenium, Playwright, or a cloud browser API like Browserbeam. See the step-by-step tutorial in this guide for working code examples.
Is Python good for web scraping?
Python is the language most teams reach for first when they start a scraping project. It has mature libraries for every part of the pipeline: Requests for HTTP, BeautifulSoup for parsing, Scrapy for large-scale crawling, and Selenium/Playwright/Browserbeam for JavaScript-heavy sites. The readable syntax and strong data processing ecosystem (pandas, JSON, CSV modules) make Python a natural fit for scraping work in 2026.
What is the best Python web scraping library?
There's no single best library. Requests + BeautifulSoup is best for static pages. Scrapy is best for large-scale crawling. Playwright is best for modern browser automation. Browserbeam is best when you need JavaScript support without managing browser infrastructure. The "How to Choose" section above has a decision framework.
Can I scrape dynamic JavaScript websites with Python?
Yes. Use Selenium, Playwright, or a cloud browser API like Browserbeam. These tools run a real browser that executes JavaScript and returns the fully rendered page. Standard HTTP libraries (Requests, urllib) only see the initial HTML and miss any content loaded by JavaScript.
How do I handle pagination in Python web scraping?
Find the "next page" link in the HTML and follow it in a loop until it disappears. Add a delay between requests to avoid overwhelming the server. For API-based pagination, increment the offset or page parameter. For infinite scroll pages, use a browser tool that can scroll and wait for content to load.
How do I avoid getting blocked while web scraping with Python?
Set a realistic browser User-Agent header, add delays between requests (1-2 seconds), respect robots.txt rules, and rotate IP addresses with proxies if needed. For sites with aggressive anti-bot detection, use a cloud browser API with residential proxy support. Avoid scraping faster than a human would browse.
Do I need proxies for web scraping?
Not always. For small-scale scraping of public sites, your regular IP address works fine. Proxies become necessary when a site blocks your IP after repeated requests, when you need to access geo-restricted content, or when you're scraping at volume and need to distribute requests across multiple addresses.
What is the difference between web scraping and web crawling?
Web scraping extracts specific data from web pages (prices, titles, contact info). Web crawling discovers and visits pages by following links across a site or across the internet. In practice, most scraping projects include some crawling (following pagination links), and most crawlers extract some data. The distinction is about the primary goal: data extraction vs. page discovery.
Conclusion
You now have the tools and patterns to scrape almost anything with Python. We started with Requests + BeautifulSoup for static pages, moved to Selenium and Browserbeam for JavaScript-heavy sites, handled pagination and infinite scroll, added proxies and cookies for tougher targets, and built three working projects you can extend.
The key takeaway: pick the simplest tool that handles your target site. Start with Requests + BeautifulSoup. If the data isn't in the static HTML, move to a browser-based tool. If you don't want to manage browser infrastructure, try a cloud browser API.
Every scraper in this guide runs against real websites with real selectors. Open a terminal, install the libraries, and start building. Try modifying the price monitor to track a product you actually care about, or point the news tracker at a different site and adjust the extraction schema.
For your next step, the Python SDK getting started guide covers every Browserbeam method in detail. The building a web scraping agent tutorial shows how to wire these patterns into an autonomous agent. And the scaling web automation guide covers production deployment.
What will you scrape first?