
You've outgrown Scrapy. Maybe your spiders can't render JavaScript. Maybe the callback architecture fights you every time you need a simple sequential workflow. Maybe you're building an AI agent and Scrapy's raw HTML output floods your LLM's context window. Whatever the trigger, you're looking for scrapy alternatives, and there are real options in 2026.
The python web scraping libraries landscape has split into two camps: frameworks you run yourself (Crawlee, Playwright, Selenium) and managed APIs that handle the browser infrastructure for you (Browserbeam, ScrapingBee, Firecrawl, Browserless). Each camp makes different tradeoffs on control, output format, and operational overhead.
This guide evaluates seven tools across both camps. No hand-waving. Real code, comparison tables, and a decision framework so you can pick the right tool without discovering the limitations three months into a project.
In this guide, you'll learn:
- Why developers leave Scrapy and what triggers the switch
- The difference between framework alternatives and managed API alternatives
- How Crawlee, Playwright, and Selenium compare to Scrapy as framework replacements
- How Browserbeam, ScrapingBee, Firecrawl, and Browserless compare as managed API alternatives
- A head-to-head feature matrix across all seven tools plus Scrapy
- A decision framework for choosing based on your use case
- How to migrate from Scrapy to a managed API with before/after code
TL;DR: For AI agent workflows, Browserbeam returns structured markdown and JSON instead of raw HTML, cutting token usage by 60-80%. For Python-native crawling with JavaScript support, Crawlee is the closest Scrapy replacement. For raw browser control, Playwright gives you the most power. For maximum anti-bot coverage without infrastructure, ScrapingBee handles it. The right choice depends on whether you need structured output, local control, or managed infrastructure.
Why Developers Leave Scrapy (and Where They Go)
Scrapy is a decade-old Python web scraping framework. It works well for what it was built for: crawling static HTML pages at scale with a spider-based architecture. But the web has changed, and three problems push teams to look for alternatives.
The JavaScript Problem
Scrapy doesn't render JavaScript. It downloads raw HTML, which means any content loaded by React, Vue, Angular, or vanilla JS fetch calls is invisible. You can bolt on Splash or Scrapy-Playwright, but both add complexity and failure modes. Splash hasn't seen an update in years. Scrapy-Playwright works but turns your simple spider into a hybrid that's harder to debug.
The question "can scrapy handle javascript?" has a technically-yes-but-practically-no answer. It can, with plugins. But every plugin adds a layer between you and the browser, and each layer is a potential point of failure.
The Memory and Concurrency Wall
Scrapy's Twisted reactor is efficient for I/O-bound crawling. But add JavaScript rendering (via Splash or Playwright) and each request now spawns a browser context. Memory usage jumps from megabytes to gigabytes. Scrapy concurrency settings that work for static HTML requests don't translate to browser-rendered pages.
| Pain Point | Impact | What Alternatives Solve |
|---|---|---|
| No JavaScript rendering | Missing content on 60%+ of modern sites | All alternatives render JS natively |
| Callback architecture | Complex control flow for sequential tasks | Sync/async APIs with linear code |
| Raw HTML output | Requires BeautifulSoup/lxml parsing | Structured output (markdown, JSON) |
| Local browser management | Memory leaks, zombie processes, version mismatches | Cloud browsers handle infrastructure |
| Anti-bot detection | Requires custom middleware, proxy rotation | Built-in stealth, CAPTCHA solving |
The AI Agent Gap
If you're building a web scraping llm agent, Scrapy's output is the wrong format. An LLM doesn't need raw HTML with navigation bars, scripts, and CSS classes. It needs clean text or structured data. Converting Scrapy's HTML output to something an LLM can process requires a full parsing pipeline on top of the scraping pipeline. Newer tools return AI-ready formats directly.
Python Web Scraping Libraries in 2026: The Landscape
The best python scraping library depends on what you're building. The market has two distinct categories, and choosing within the wrong category wastes more time than choosing the wrong tool within the right one.
Frameworks vs Managed APIs
| Dimension | Frameworks | Managed APIs |
|---|---|---|
| Examples | Crawlee, Playwright, Selenium | Browserbeam, ScrapingBee, Firecrawl, Browserless |
| You manage | Browser binaries, drivers, infrastructure | Nothing (API key only) |
| Output | Raw HTML (you parse it) | Markdown, JSON, or structured data |
| JavaScript | Yes (browser-based) | Yes (cloud browser) |
| Concurrency | Your servers, your RAM | Provider scales for you |
| Cost model | Server costs + engineering time | Monthly subscription or usage-based |
| Best for | Full control, custom logic, testing | Production pipelines, AI agents, scale |
What AI Agents Need from a Scraping Tool
AI agents have specific requirements that traditional scraping tools don't address:
- Token-efficient output. Raw HTML wastes 60-80% of tokens on markup, scripts, and navigation. Agents need clean markdown or structured JSON.
- Interactive sessions. Agents need to click, fill forms, and navigate across pages within a single session. One-shot scraping APIs don't support this.
- Stability detection. Agents can't call
time.sleep()and guess when a page is ready. They need a signal that says "the page is stable, read it now." - Structured extraction. Returning typed JSON from a declarative schema is more reliable than having an LLM parse HTML.
Not every tool on this list meets all four requirements. The comparison tables below make the gaps clear.
Framework Alternative 1: Crawlee (Apify)
Crawlee is a web crawling and scraping library maintained by Apify. It started as a Node.js library and added Python support in 2024. It's the closest spiritual successor to Scrapy: a framework for building crawlers with request queuing, automatic retries, and pluggable browser backends.
What Crawlee Does Well
- JavaScript rendering built-in. Crawlee supports Playwright and BeautifulSoup as backends. You pick the right one per crawler without bolting on plugins.
- Request queue and retry logic. Similar to Scrapy's scheduler but with automatic retries, error tracking, and configurable concurrency.
- Apify platform integration. Deploy crawlers to Apify Cloud for managed execution, scheduling, and storage. This gives you the convenience of a managed API with the flexibility of a framework.
- crawlee python support is maturing. The Python SDK mirrors the Node.js API closely.
Where Crawlee Falls Short
- Younger ecosystem. Scrapy has thousands of community plugins and StackOverflow answers. Crawlee's Python ecosystem is still growing.
- Output is still raw HTML. Crawlee gives you the page source. You still need to parse it with selectors or feed it through an extraction pipeline.
- No structured extraction. There's no built-in way to declare "give me the title and price from every product card as JSON." You write selectors yourself.
Crawlee vs Scrapy: Side-by-Side
# Scrapy: callback-based spider
# import scrapy
#
# class BookSpider(scrapy.Spider):
# name = "books"
# start_urls = ["https://books.toscrape.com"]
#
# def parse(self, response):
# for book in response.css("article.product_pod"):
# yield {
# "title": book.css("h3 a::text").get(),
# "price": book.css(".price_color::text").get(),
# }
# next_page = response.css("li.next a::attr(href)").get()
# if next_page:
# yield response.follow(next_page, self.parse)
# Crawlee (Python): async sequential
# from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext
#
# crawler = PlaywrightCrawler()
#
# @crawler.router.default_handler
# async def handler(context: PlaywrightCrawlingContext):
# books = await context.page.query_selector_all("article.product_pod")
# for book in books:
# title = await (await book.query_selector("h3 a")).inner_text()
# price = await (await book.query_selector(".price_color")).inner_text()
# await context.push_data({"title": title, "price": price})
#
# await crawler.run(["https://books.toscrape.com"])
Crawlee replaces the callback architecture with async handlers. The code reads top-to-bottom. But you still write selectors and parse HTML manually. For a deeper Playwright comparison, see the Puppeteer vs Playwright vs Browserbeam guide.
Framework Alternative 2: Playwright
Playwright is a browser automation framework from Microsoft. It's not a scraping framework like Scrapy. It's a browser control library. But many teams use it for scraping because it handles JavaScript rendering, multi-browser support, and complex interactions that Scrapy can't.
What Playwright Does Well
- Full browser control. Chromium, Firefox, and WebKit. Network interception, request modification, cookie management, screenshot capture.
- Multi-language support. Python, Node.js, Java, .NET. Scrapy is Python-only.
- Auto-wait. Playwright waits for elements to be actionable before interacting. No explicit sleep calls for most operations.
- Active development. Microsoft maintains it with monthly releases, unlike Scrapy's slower release cadence.
Where Playwright Falls Short
- Not a scraping framework. No request queue, no retry logic, no built-in data pipeline. You build all of that yourself.
- Local infrastructure. You manage browser binaries, memory, and process lifecycle. At scale, this becomes a significant operational burden.
- Raw HTML output.
page.content()returns the full DOM. You parse it with BeautifulSoup, lxml, or Playwright's own selectors. - No structured extraction. Like Crawlee, there's no way to declare a schema and get typed JSON back.
Scrapy vs Playwright: Side-by-Side
# Playwright: sequential, browser-based
# from playwright.sync_api import sync_playwright
#
# with sync_playwright() as p:
# browser = p.chromium.launch()
# page = browser.new_page()
# page.goto("https://books.toscrape.com")
# page.wait_for_selector("article.product_pod")
#
# books = page.query_selector_all("article.product_pod")
# for book in books:
# title = book.query_selector("h3 a").inner_text()
# price = book.query_selector(".price_color").inner_text()
# print(f"{title}: {price}")
#
# browser.close()
Playwright gives you imperative control over the browser. The tradeoff versus Scrapy: you gain JavaScript rendering and interaction capabilities, but you lose the crawler framework (queuing, retries, pipelines). For teams that need both, Crawlee wraps Playwright with crawler infrastructure.
Framework Alternative 3: Selenium
Selenium is the original browser automation tool, dating back to 2004. It still has the largest community and the most learning resources. For many developers, the scrapy vs selenium question comes down to whether they need a crawler (Scrapy) or a browser (Selenium).
What Selenium Does Well
- Massive community. More StackOverflow answers, more tutorials, more plugins than any other browser automation tool.
- Enterprise adoption. The standard for QA testing. If your organization already uses Selenium for testing, using it for scraping avoids introducing a new tool.
- WebDriver protocol. Works with Chrome, Firefox, Edge, and Safari through a standard protocol.
Where Selenium Falls Short
- Slow. WebDriver protocol adds overhead compared to Playwright's direct protocol. Tests and scrapes run slower.
- Verbose API. Simple tasks require more code than Playwright or Browserbeam. The Java-influenced API design shows its age in Python.
- Driver management. You need matching browser drivers (ChromeDriver, GeckoDriver) for your browser version. Version mismatches are the most common source of failures.
- No built-in waits.
WebDriverWait+expected_conditionsis manual and verbose compared to Playwright's auto-wait or Browserbeam's stability detection.
Scrapy vs Selenium: Side-by-Side
# Selenium: verbose, driver-based
# from selenium import webdriver
# from selenium.webdriver.common.by import By
# from selenium.webdriver.support.ui import WebDriverWait
# from selenium.webdriver.support import expected_conditions as EC
#
# driver = webdriver.Chrome()
# driver.get("https://books.toscrape.com")
# WebDriverWait(driver, 10).until(
# EC.presence_of_element_located((By.CSS_SELECTOR, "article.product_pod"))
# )
#
# books = driver.find_elements(By.CSS_SELECTOR, "article.product_pod")
# for book in books:
# title = book.find_element(By.CSS_SELECTOR, "h3 a").text
# price = book.find_element(By.CSS_SELECTOR, ".price_color").text
# print(f"{title}: {price}")
#
# driver.quit()
Selenium works. It's just more code for the same result. If you're already in the Selenium ecosystem for testing, adding scraping is straightforward. If you're starting fresh, Playwright or a managed API is more efficient.
Managed API Alternative 1: Browserbeam
Browserbeam is a cloud browser API that returns structured output designed for AI agents and data pipelines. Instead of managing a browser locally, you call a REST API and get back clean markdown, structured JSON, and interaction primitives.
What Browserbeam Does Well
- Structured output. Every observation returns clean markdown (not raw HTML) with optional page maps, interactive element lists, and link inventories. This is the key differentiator for web scraping llm agent workflows.
- Declarative extraction. Define a schema (field names mapped to CSS selectors) and get typed JSON back. No BeautifulSoup, no lxml, no parsing code.
- Stability detection. Automatic network idle + DOM mutation checking means no
time.sleep()orWebDriverWait. The page is stable when data is returned. - Interactive sessions. Click, fill, scroll, navigate within a persistent session. Not just one-shot page fetching.
- Anti-bot handling. Built-in stealth, proxy rotation, and CAPTCHA solving without custom middleware.
Output Format for AI Agents
Where Scrapy returns raw HTML that an LLM struggles with, Browserbeam returns markdown that fits directly into a prompt. The difference matters at scale:
from browserbeam import Browserbeam
client = Browserbeam()
session = client.sessions.create(url="https://quotes.toscrape.com")
# Clean markdown, not raw HTML
markdown = session.page.markdown.content
print(f"Markdown length: {len(markdown)} characters")
print(markdown[:300])
session.close()
A typical page that's 45,000 characters of raw HTML becomes 1,500-3,000 characters of clean markdown. For an LLM processing hundreds of pages, that's the difference between burning through your token budget in an hour and running for a week. For a deep dive on extraction schemas, see the data extraction guide.
Code Example: Structured Extraction
The workflow that takes 20+ lines in Scrapy (spider class, callbacks, selectors, yield statements) takes 10 lines with Browserbeam:
curl -X POST https://api.browserbeam.com/v1/sessions \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://books.toscrape.com",
"steps": [{
"extract": {
"books": [{
"_parent": "article.product_pod",
"title": "h3 a >> text",
"price": ".price_color >> text",
"url": "h3 a >> href"
}]
}
}]
}'
from browserbeam import Browserbeam
client = Browserbeam()
session = client.sessions.create(url="https://books.toscrape.com")
result = session.extract(
books=[{
"_parent": "article.product_pod",
"title": "h3 a >> text",
"price": ".price_color >> text",
"url": "h3 a >> href"
}]
)
for book in result.extraction["books"][:5]:
print(f"{book['title']}: {book['price']}")
session.close()
import Browserbeam from "@browserbeam/sdk";
const client = new Browserbeam();
const session = await client.sessions.create({ url: "https://books.toscrape.com" });
const result = await session.extract({
books: [{
_parent: "article.product_pod",
title: "h3 a >> text",
price: ".price_color >> text",
url: "h3 a >> href"
}]
});
console.log(result.extraction.books.slice(0, 5));
await session.close();
require "browserbeam"
client = Browserbeam::Client.new
session = client.sessions.create(url: "https://books.toscrape.com")
result = session.extract(
books: [{
_parent: "article.product_pod",
title: "h3 a >> text",
price: ".price_color >> text",
url: "h3 a >> href"
}]
)
result.extraction["books"].first(5).each { |b| puts "#{b['title']}: #{b['price']}" }
session.close
No spider class. No callback chain. No browser binary. The schema declares what you want and the engine extracts it as typed JSON. For a step-by-step walkthrough of the Python SDK, see the getting started guide.
Managed API Alternative 2: ScrapingBee
ScrapingBee is a web scraping API focused on anti-bot handling and JavaScript rendering. It's a single-endpoint service: send a URL, get back rendered HTML or extracted data.
What ScrapingBee Does Well
- Anti-bot coverage. Residential proxies, headless browser rotation, and premium proxy pools for heavily protected sites.
- Simple API. One endpoint (
/api/v1), one request, one response. Low learning curve. - CSS extraction. The
extract_rulesparameter lets you define selectors to pull specific data, similar to Browserbeam's extract but less flexible. - Google Search scraping. Dedicated endpoint for SERP scraping with structured output.
- Generous free tier. 1,000 credits on signup.
Where ScrapingBee Falls Short
- No interactive sessions. Each request is stateless. You can't click buttons, fill forms, or navigate within a session. This makes it unsuitable for multi-step agent workflows.
- Raw HTML output by default. Without
extract_rules, you get raw HTML and parse it yourself. - Fixed waits, not stability detection. The
waitparameter is a fixed millisecond delay, not signal-based. Thewait_forparameter waits for a CSS selector, which is better but still not automatic stability detection. - No markdown output. For LLM consumption, you need to convert the HTML yourself.
Managed API Alternative 3: Firecrawl
Firecrawl is an LLM-ready web scraping API focused on turning websites into clean data for AI. It's open source (MIT license) with a hosted service.
What Firecrawl Does Well
- LLM-optimized output. Returns clean markdown or structured data from any URL. The
/scrapeendpoint handles JavaScript rendering, pop-up dismissal, and content extraction in one call. - Crawl endpoint. The
/crawlendpoint follows links and returns multiple pages, similar to Scrapy's crawling behavior but managed. - Open source. Self-host if you need data residency or cost control. 30,000+ GitHub stars.
- Agent endpoint. The
/agentendpoint handles multi-step web research autonomously, though it's designed for research tasks rather than arbitrary browser interactions. - firecrawl alternative seekers often end up here because it fills the gap between raw scraping APIs and full browser automation.
Where Firecrawl Falls Short
- Limited interactivity. The
/scrapeand/crawlendpoints are one-shot. You can't maintain a session, click specific elements by selector, or fill forms across pages. - No element refs or DOM diffs. Each scrape is independent. There's no concept of "what changed since my last observation."
- Credits-based pricing. Each action consumes credits, and complex pages consume more. Costs can be unpredictable compared to fixed subscription plans.
Managed API Alternative 4: Browserless
Browserless is a hosted browser service that gives you Playwright and Puppeteer access through a WebSocket or REST API. It's the longest-running cloud browser provider, founded in 2017. It's also a common target for developers searching for a browserless alternative.
What Browserless Does Well
- Raw browser access. Connect your local Playwright or Puppeteer scripts to Browserless's cloud browsers via a WebSocket URL. Your existing scripts work with minimal changes.
- Multi-engine support. Chrome, Firefox, and WebKit. Most cloud browser providers only support Chromium.
- Self-hosting option. Run Browserless in your own infrastructure with a commercial license.
- Maturity. Eight years in production. Battle-tested at scale with enterprise customers.
- New extraction endpoints.
/smart-scrape,/search,/map, and/crawlendpoints add AI-powered extraction on top of the raw browser.
Where Browserless Falls Short
- Raw HTML by default. The core service returns whatever Playwright or Puppeteer returns. The newer extraction endpoints add structure but are still maturing.
- Unit-based pricing. Each session consumes units in 30-second blocks. Short sessions that finish in 5 seconds still consume a full unit.
- Steeper learning curve. You need to know Playwright or Puppeteer to use the core service. The newer REST endpoints are simpler but less documented.
Head-to-Head: Features That Matter for AI Agents
Feature Comparison Table
This is the table that answers the top web scraping tools comparison question. Eight tools, twelve dimensions:
Token Cost Comparison: Scrapy Output vs Structured Output
For AI agent use cases, output format determines token cost. Here's what the same page produces across different tools:
from browserbeam import Browserbeam
client = Browserbeam()
# Observe in "main" mode for clean, boilerplate-free content
session = client.sessions.create(url="https://books.toscrape.com")
result = session.observe(mode="main")
raw_html_estimate = 45000 # typical full HTML for this page
markdown_length = len(result.page.markdown.content)
# Rough token estimate: 1 token ~= 4 characters
raw_tokens = raw_html_estimate // 4
markdown_tokens = markdown_length // 4
print(f"Raw HTML: ~{raw_html_estimate:,} chars (~{raw_tokens:,} tokens)")
print(f"Markdown: ~{markdown_length:,} chars (~{markdown_tokens:,} tokens)")
print(f"Reduction: {100 - (markdown_length * 100 // raw_html_estimate)}%")
session.close()
| Output Format | Typical Characters | Approx. Tokens | Reduction |
|---|---|---|---|
| Raw HTML (Scrapy, Selenium) | ~45,000 | ~11,250 | Baseline |
| Markdown (Browserbeam, Firecrawl) | ~1,500 | ~375 | 97% |
| Structured JSON (Browserbeam extract) | ~800 | ~200 | 98% |
At 1,000 pages per day, the token difference between raw HTML and structured markdown is roughly 10 million tokens. That's real money when you're feeding pages to an LLM.
Decision Framework: When to Use What
The best web scraping framework for your project depends on four factors: output format needs, infrastructure preferences, interactivity requirements, and budget constraints.
You Need Structured Output for LLMs
Use Browserbeam or Firecrawl.
If your scraping feeds an AI agent or LLM pipeline, structured output is the single most important feature. Both Browserbeam and Firecrawl return markdown and structured data. Browserbeam adds interactive sessions (click, fill, navigate) and DOM diffs. Firecrawl adds autonomous crawling and a research agent. Choose based on whether you need interactivity.
You Need a Full Python Framework
Use Crawlee.
If you want a Scrapy-like framework with JavaScript support, request queuing, and the option to deploy to a managed platform (Apify Cloud), Crawlee is the direct successor. The crawlee python SDK mirrors the established Node.js API. You keep the framework mindset while gaining browser rendering.
You Need Raw Browser Control
Use Playwright (local) or Browserless (cloud).
If you need to intercept network requests, inject JavaScript, capture screenshots, or run cross-browser tests alongside scraping, Playwright gives you full CDP access. If you want to run Playwright scripts without managing browser infrastructure, point them at Browserless's WebSocket endpoint.
You Need Maximum Anti-Bot Coverage
Use ScrapingBee.
If your primary challenge is getting past anti-bot protections on heavily protected sites, ScrapingBee's residential proxy pools and specialized browser rotation handle the toughest targets. The tradeoff: you give up interactive sessions and structured output. For a broader cloud API comparison, see the cloud browser API guide.
| Use Case | Best Tool | Why |
|---|---|---|
| AI agent browsing the web | Browserbeam | Structured output, interactive sessions, stability detection |
| LLM knowledge base building | Firecrawl | Autonomous crawling, markdown output, /crawl endpoint |
| Python web crawling at scale | Crawlee | Modern framework, JS rendering, Apify deployment |
| Cross-browser testing + scraping | Playwright | Full CDP access, multi-browser, best auto-wait |
| Scraping protected e-commerce sites | ScrapingBee | Strongest anti-bot, residential proxies |
| Running Playwright in the cloud | Browserless | WebSocket connection, multi-engine, self-host option |
| Legacy QA automation + scraping | Selenium | Largest community, enterprise standard |
Migration Guide: Moving from Scrapy to a Managed API
If you're migrating from Scrapy to a managed API, the architecture changes more than the logic. Scrapy's spider-callback-pipeline pattern maps to a simpler request-extract-process pattern.
What Changes and What Stays the Same
| Scrapy Concept | Managed API Equivalent |
|---|---|
| Spider class | A function or script |
start_urls |
session = client.sessions.create(url="...") |
response.css() / response.xpath() |
session.extract(schema=...) or session.observe() |
yield item |
Append to a list or write to database |
response.follow(next_page) |
session.goto(url=next_url) |
| Item pipelines | Process data after extraction (same logic, no framework) |
| Scrapy middleware | Not needed (API handles proxies, retries, rendering) |
scrapy crawl myspider |
python my_script.py |
Before/After Code Example
# BEFORE: Scrapy spider (25+ lines, callback architecture)
# import scrapy
#
# class BookSpider(scrapy.Spider):
# name = "books"
# start_urls = ["https://books.toscrape.com"]
#
# def parse(self, response):
# for book in response.css("article.product_pod"):
# yield {
# "title": book.css("h3 a::text").get(),
# "price": book.css(".price_color::text").get(),
# "in_stock": book.css(".instock.availability::text").get().strip(),
# }
# next_page = response.css("li.next a::attr(href)").get()
# if next_page:
# yield response.follow(next_page, self.parse)
# AFTER: Browserbeam (15 lines, linear flow)
from browserbeam import Browserbeam
client = Browserbeam()
session = client.sessions.create(url="https://books.toscrape.com")
all_books = []
for page_num in range(3):
result = session.extract(
books=[{
"_parent": "article.product_pod",
"title": "h3 a >> text",
"price": ".price_color >> text",
"in_stock": ".instock.availability >> text"
}]
)
all_books.extend(result.extraction["books"])
print(f"Page {page_num + 1}: {len(result.extraction['books'])} books")
next_url = f"https://books.toscrape.com/catalogue/page-{page_num + 2}.html"
session.goto(url=next_url)
session.close()
print(f"Total: {len(all_books)} books across 3 pages")
The Scrapy version requires understanding spiders, callbacks, response.follow, item dictionaries, and running scrapy crawl. The Browserbeam version is a Python script with a for loop. For building complete agent workflows on top of this pattern, see the web scraping agent guide.
Common Mistakes When Switching from Scrapy
1. Keeping Scrapy's Callback Architecture in a Synchronous API
Scrapy's callback pattern (yield response.follow(url, self.parse)) exists because Scrapy uses an async reactor. When you move to a synchronous API, you don't need callbacks. Write linear code: navigate, extract, process, navigate to the next page. Forcing a callback pattern onto a synchronous API creates unnecessary complexity.
2. Not Using Structured Extraction (Still Parsing HTML)
Teams migrate to a new tool but keep using BeautifulSoup to parse HTML. If your new tool offers structured extraction (Browserbeam's extract, ScrapingBee's extract_rules, Firecrawl's structured output), use it. Declarative schemas are easier to maintain than imperative parsing code. For schema design patterns, see the structured web scraping guide.
3. Over-Engineering Concurrency
Scrapy's CONCURRENT_REQUESTS and DOWNLOAD_DELAY settings require careful tuning. With a managed API, the provider handles concurrency and rate limiting. Don't build your own async request queue on top of a managed API. Start with sequential requests and add concurrency only when you've confirmed it works correctly.
4. Ignoring JavaScript Rendering When Moving to a New Tool
Some teams switch from Scrapy to a new tool but keep targeting static HTML endpoints. If you're already switching tools, take the opportunity to target the JavaScript-rendered page instead of the API endpoint you reverse-engineered. The rendered page is more stable (site redesigns change the API, the visible product card stays) and requires less reverse engineering.
5. Choosing Based on Price Instead of Output Format
A tool that costs $29/month but returns structured JSON might save more than a free framework that returns raw HTML you spend 10 hours/week parsing. Factor in engineering time, not just API costs. The cheapest tool is the one that minimizes total cost including your time. For a perspective on how structured output feeds LLM pipelines, see the LLM training data pipeline guide.
Frequently Asked Questions
What are the best Scrapy alternatives for Python in 2026?
The top scrapy alternatives depend on your needs. For a framework replacement with JavaScript support, Crawlee (by Apify) is the closest match. For a managed API with structured output, Browserbeam returns markdown and JSON directly. For raw browser control, Playwright is the industry standard. For anti-bot protection, ScrapingBee handles the toughest sites. All four render JavaScript, which is Scrapy's primary limitation.
Can Scrapy handle JavaScript-rendered pages?
Technically yes, with plugins like Scrapy-Playwright or Scrapy-Splash. Practically, these plugins add complexity and failure modes. Scrapy-Splash depends on the Splash rendering service, which hasn't been actively maintained. Scrapy-Playwright works better but turns your spider into a hybrid that's harder to debug. If JavaScript rendering is a core requirement, consider a tool that supports it natively: Crawlee, Playwright, or a cloud browser API like Browserbeam.
What is the best python scraping library for AI agents?
For AI agents, the best python scraping library returns structured, token-efficient output. Browserbeam returns clean markdown (97% smaller than raw HTML) and typed JSON from declarative schemas. Firecrawl returns markdown and has an autonomous agent endpoint for research tasks. Both are better choices than Scrapy, Playwright, or Selenium for agent workflows, because they eliminate the HTML parsing step entirely.
Scrapy vs BeautifulSoup: which should I use?
Scrapy and BeautifulSoup solve different problems. Scrapy is a full crawling framework (scheduling, retries, pipelines) that can download pages and parse them. BeautifulSoup is a parsing library that only parses HTML you already have. They're often used together: Scrapy downloads pages, BeautifulSoup or Scrapy's built-in selectors parse them. If you need crawling, use Scrapy (or Crawlee). If you just need to parse a single HTML document, use BeautifulSoup.
Is Crawlee better than Scrapy for web crawling?
Crawlee is better for modern web crawling that involves JavaScript-rendered pages. It supports Playwright and BeautifulSoup backends, has built-in request queuing and retries, and deploys to Apify Cloud for managed execution. Scrapy is better for high-volume static HTML crawling where JavaScript rendering isn't needed. Scrapy's ecosystem (plugins, documentation, community) is larger. Crawlee's crawlee python SDK is newer but growing. If your crawl targets are mostly JavaScript-heavy sites, Crawlee wins. If they're mostly static HTML, Scrapy is still solid.
What is the best web scraping framework for beginners?
For beginners who want to learn web scraping fundamentals, BeautifulSoup with the requests library is the simplest starting point. For beginners building a real project, Browserbeam has the lowest barrier: install the SDK, create a session with a URL, and get clean data back without managing browsers. Scrapy has a steeper learning curve (spiders, callbacks, pipelines) but teaches important concepts. Playwright is powerful but requires understanding browser lifecycle management.
How do I migrate from Scrapy to a managed API?
Replace the spider class with a script. Replace start_urls with client.sessions.create(url="..."). Replace response.css() selectors with session.extract(schema=...) for structured data or session.observe() for markdown. Replace response.follow() with session.goto(). Remove middleware (the API handles proxies and rendering). Remove item pipelines (process data inline or write to your database directly). The architectural shift is from a framework (Scrapy manages the event loop) to a library (you manage the control flow).
Start Building Without Scrapy's Overhead
Seven alternatives, two categories, one decision that matters most: do you want a framework or a managed API?
If you want a framework, Crawlee is the modern Scrapy replacement with JavaScript support. Playwright gives you raw browser control. Selenium works if you're already in the ecosystem.
If you want a managed API, the choice depends on output format. Browserbeam returns structured markdown and JSON for AI agent workflows. Firecrawl returns markdown for LLM data pipelines. ScrapingBee handles anti-bot for protected sites. Browserless hosts your Playwright scripts in the cloud.
For most teams building AI agents or data pipelines in 2026, the managed API path saves more time than the framework path. No browser management, no parsing code, no infrastructure scaling. The best api for ai agents is the one that returns data in the format your agent needs.
Grab the SDK and try the extraction example from this post:
pip install browserbeam # Python
npm install @browserbeam/sdk # TypeScript
gem install browserbeam # Ruby
Sign up for a free account and extract your first structured dataset. The API docs have the full reference for extraction schemas, interactive sessions, and all the features covered in this comparison.