Best Scrapy Alternatives for AI Agents in 2026

Comparison of Scrapy alternatives for Python web scraping and AI agents in 2026

You've outgrown Scrapy. Maybe your spiders can't render JavaScript. Maybe the callback architecture fights you every time you need a simple sequential workflow. Maybe you're building an AI agent and Scrapy's raw HTML output floods your LLM's context window. Whatever the trigger, you're looking for scrapy alternatives, and there are real options in 2026.

The python web scraping libraries landscape has split into two camps: frameworks you run yourself (Crawlee, Playwright, Selenium) and managed APIs that handle the browser infrastructure for you (Browserbeam, ScrapingBee, Firecrawl, Browserless). Each camp makes different tradeoffs on control, output format, and operational overhead.

This guide evaluates seven tools across both camps. No hand-waving. Real code, comparison tables, and a decision framework so you can pick the right tool without discovering the limitations three months into a project.

In this guide, you'll learn:

Why developers leave Scrapy and what triggers the switch
The difference between framework alternatives and managed API alternatives
How Crawlee, Playwright, and Selenium compare to Scrapy as framework replacements
How Browserbeam, ScrapingBee, Firecrawl, and Browserless compare as managed API alternatives
A head-to-head feature matrix across all seven tools plus Scrapy
A decision framework for choosing based on your use case
How to migrate from Scrapy to a managed API with before/after code

TL;DR: For AI agent workflows, Browserbeam returns structured markdown and JSON instead of raw HTML, cutting token usage by 60-80%. For Python-native crawling with JavaScript support, Crawlee is the closest Scrapy replacement. For raw browser control, Playwright gives you the most power. For maximum anti-bot coverage without infrastructure, ScrapingBee handles it. The right choice depends on whether you need structured output, local control, or managed infrastructure.

Why Developers Leave Scrapy (and Where They Go)

Scrapy is a decade-old Python web scraping framework. It works well for what it was built for: crawling static HTML pages at scale with a spider-based architecture. But the web has changed, and three problems push teams to look for alternatives.

The JavaScript Problem

Scrapy doesn't render JavaScript. It downloads raw HTML, which means any content loaded by React, Vue, Angular, or vanilla JS fetch calls is invisible. You can bolt on Splash or Scrapy-Playwright, but both add complexity and failure modes. Splash hasn't seen an update in years. Scrapy-Playwright works but turns your simple spider into a hybrid that's harder to debug.

The question "can scrapy handle javascript?" has a technically-yes-but-practically-no answer. It can, with plugins. But every plugin adds a layer between you and the browser, and each layer is a potential point of failure.

The Memory and Concurrency Wall

Scrapy's Twisted reactor is efficient for I/O-bound crawling. But add JavaScript rendering (via Splash or Playwright) and each request now spawns a browser context. Memory usage jumps from megabytes to gigabytes. Scrapy concurrency settings that work for static HTML requests don't translate to browser-rendered pages.

Pain Point	Impact	What Alternatives Solve
No JavaScript rendering	Missing content on 60%+ of modern sites	All alternatives render JS natively
Callback architecture	Complex control flow for sequential tasks	Sync/async APIs with linear code
Raw HTML output	Requires BeautifulSoup/lxml parsing	Structured output (markdown, JSON)
Local browser management	Memory leaks, zombie processes, version mismatches	Cloud browsers handle infrastructure
Anti-bot detection	Requires custom middleware, proxy rotation	Built-in stealth, CAPTCHA solving

The AI Agent Gap

If you're building a web scraping llm agent, Scrapy's output is the wrong format. An LLM doesn't need raw HTML with navigation bars, scripts, and CSS classes. It needs clean text or structured data. Converting Scrapy's HTML output to something an LLM can process requires a full parsing pipeline on top of the scraping pipeline. Newer tools return AI-ready formats directly.

Python Web Scraping Libraries in 2026: The Landscape

The best python scraping library depends on what you're building. The market has two distinct categories, and choosing within the wrong category wastes more time than choosing the wrong tool within the right one.

Frameworks vs Managed APIs

Dimension	Frameworks	Managed APIs
Examples	Crawlee, Playwright, Selenium	Browserbeam, ScrapingBee, Firecrawl, Browserless
You manage	Browser binaries, drivers, infrastructure	Nothing (API key only)
Output	Raw HTML (you parse it)	Markdown, JSON, or structured data
JavaScript	Yes (browser-based)	Yes (cloud browser)
Concurrency	Your servers, your RAM	Provider scales for you
Cost model	Server costs + engineering time	Monthly subscription or usage-based
Best for	Full control, custom logic, testing	Production pipelines, AI agents, scale

What AI Agents Need from a Scraping Tool

AI agents have specific requirements that traditional scraping tools don't address:

Token-efficient output. Raw HTML wastes 60-80% of tokens on markup, scripts, and navigation. Agents need clean markdown or structured JSON.
Interactive sessions. Agents need to click, fill forms, and navigate across pages within a single session. One-shot scraping APIs don't support this.
Stability detection. Agents can't call time.sleep() and guess when a page is ready. They need a signal that says "the page is stable, read it now."
Structured extraction. Returning typed JSON from a declarative schema is more reliable than having an LLM parse HTML.

Not every tool on this list meets all four requirements. The comparison tables below make the gaps clear.

Framework Alternative 1: Crawlee (Apify)

Crawlee is a web crawling and scraping library maintained by Apify. It started as a Node.js library and added Python support in 2024. It's the closest spiritual successor to Scrapy: a framework for building crawlers with request queuing, automatic retries, and pluggable browser backends.

What Crawlee Does Well

JavaScript rendering built-in. Crawlee supports Playwright and BeautifulSoup as backends. You pick the right one per crawler without bolting on plugins.
Request queue and retry logic. Similar to Scrapy's scheduler but with automatic retries, error tracking, and configurable concurrency.
Apify platform integration. Deploy crawlers to Apify Cloud for managed execution, scheduling, and storage. This gives you the convenience of a managed API with the flexibility of a framework.
crawlee python support is maturing. The Python SDK mirrors the Node.js API closely.

Where Crawlee Falls Short

Younger ecosystem. Scrapy has thousands of community plugins and StackOverflow answers. Crawlee's Python ecosystem is still growing.
Output is still raw HTML. Crawlee gives you the page source. You still need to parse it with selectors or feed it through an extraction pipeline.
No structured extraction. There's no built-in way to declare "give me the title and price from every product card as JSON." You write selectors yourself.

Crawlee vs Scrapy: Side-by-Side

# Scrapy: callback-based spider
# import scrapy
#
# class BookSpider(scrapy.Spider):
#     name = "books"
#     start_urls = ["https://books.toscrape.com"]
#
#     def parse(self, response):
#         for book in response.css("article.product_pod"):
#             yield {
#                 "title": book.css("h3 a::text").get(),
#                 "price": book.css(".price_color::text").get(),
#             }
#         next_page = response.css("li.next a::attr(href)").get()
#         if next_page:
#             yield response.follow(next_page, self.parse)

# Crawlee (Python): async sequential
# from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext
#
# crawler = PlaywrightCrawler()
#
# @crawler.router.default_handler
# async def handler(context: PlaywrightCrawlingContext):
#     books = await context.page.query_selector_all("article.product_pod")
#     for book in books:
#         title = await (await book.query_selector("h3 a")).inner_text()
#         price = await (await book.query_selector(".price_color")).inner_text()
#         await context.push_data({"title": title, "price": price})
#
# await crawler.run(["https://books.toscrape.com"])

Crawlee replaces the callback architecture with async handlers. The code reads top-to-bottom. But you still write selectors and parse HTML manually. For a deeper Playwright comparison, see the Puppeteer vs Playwright vs Browserbeam guide.

Framework Alternative 2: Playwright

Playwright is a browser automation framework from Microsoft. It's not a scraping framework like Scrapy. It's a browser control library. But many teams use it for scraping because it handles JavaScript rendering, multi-browser support, and complex interactions that Scrapy can't.

What Playwright Does Well

Full browser control. Chromium, Firefox, and WebKit. Network interception, request modification, cookie management, screenshot capture.
Multi-language support. Python, Node.js, Java, .NET. Scrapy is Python-only.
Auto-wait. Playwright waits for elements to be actionable before interacting. No explicit sleep calls for most operations.
Active development. Microsoft maintains it with monthly releases, unlike Scrapy's slower release cadence.

Where Playwright Falls Short

Not a scraping framework. No request queue, no retry logic, no built-in data pipeline. You build all of that yourself.
Local infrastructure. You manage browser binaries, memory, and process lifecycle. At scale, this becomes a significant operational burden.
Raw HTML output. page.content() returns the full DOM. You parse it with BeautifulSoup, lxml, or Playwright's own selectors.
No structured extraction. Like Crawlee, there's no way to declare a schema and get typed JSON back.

Scrapy vs Playwright: Side-by-Side

# Playwright: sequential, browser-based
# from playwright.sync_api import sync_playwright
#
# with sync_playwright() as p:
#     browser = p.chromium.launch()
#     page = browser.new_page()
#     page.goto("https://books.toscrape.com")
#     page.wait_for_selector("article.product_pod")
#
#     books = page.query_selector_all("article.product_pod")
#     for book in books:
#         title = book.query_selector("h3 a").inner_text()
#         price = book.query_selector(".price_color").inner_text()
#         print(f"{title}: {price}")
#
#     browser.close()

Playwright gives you imperative control over the browser. The tradeoff versus Scrapy: you gain JavaScript rendering and interaction capabilities, but you lose the crawler framework (queuing, retries, pipelines). For teams that need both, Crawlee wraps Playwright with crawler infrastructure.

Framework Alternative 3: Selenium

Selenium is the original browser automation tool, dating back to 2004. It still has the largest community and the most learning resources. For many developers, the scrapy vs selenium question comes down to whether they need a crawler (Scrapy) or a browser (Selenium).

What Selenium Does Well

Massive community. More StackOverflow answers, more tutorials, more plugins than any other browser automation tool.
Enterprise adoption. The standard for QA testing. If your organization already uses Selenium for testing, using it for scraping avoids introducing a new tool.
WebDriver protocol. Works with Chrome, Firefox, Edge, and Safari through a standard protocol.

Where Selenium Falls Short

Slow. WebDriver protocol adds overhead compared to Playwright's direct protocol. Tests and scrapes run slower.
Verbose API. Simple tasks require more code than Playwright or Browserbeam. The Java-influenced API design shows its age in Python.
Driver management. You need matching browser drivers (ChromeDriver, GeckoDriver) for your browser version. Version mismatches are the most common source of failures.
No built-in waits. WebDriverWait + expected_conditions is manual and verbose compared to Playwright's auto-wait or Browserbeam's stability detection.

Scrapy vs Selenium: Side-by-Side

# Selenium: verbose, driver-based
# from selenium import webdriver
# from selenium.webdriver.common.by import By
# from selenium.webdriver.support.ui import WebDriverWait
# from selenium.webdriver.support import expected_conditions as EC
#
# driver = webdriver.Chrome()
# driver.get("https://books.toscrape.com")
# WebDriverWait(driver, 10).until(
#     EC.presence_of_element_located((By.CSS_SELECTOR, "article.product_pod"))
# )
#
# books = driver.find_elements(By.CSS_SELECTOR, "article.product_pod")
# for book in books:
#     title = book.find_element(By.CSS_SELECTOR, "h3 a").text
#     price = book.find_element(By.CSS_SELECTOR, ".price_color").text
#     print(f"{title}: {price}")
#
# driver.quit()

Selenium works. It's just more code for the same result. If you're already in the Selenium ecosystem for testing, adding scraping is straightforward. If you're starting fresh, Playwright or a managed API is more efficient.

Managed API Alternative 1: Browserbeam

Browserbeam is a cloud browser API that returns structured output designed for AI agents and data pipelines. Instead of managing a browser locally, you call a REST API and get back clean markdown, structured JSON, and interaction primitives.

What Browserbeam Does Well

Structured output. Every observation returns clean markdown (not raw HTML) with optional page maps, interactive element lists, and link inventories. This is the key differentiator for web scraping llm agent workflows.
Declarative extraction. Define a schema (field names mapped to CSS selectors) and get typed JSON back. No BeautifulSoup, no lxml, no parsing code.
Stability detection. Automatic network idle + DOM mutation checking means no time.sleep() or WebDriverWait. The page is stable when data is returned.
Interactive sessions. Click, fill, scroll, navigate within a persistent session. Not just one-shot page fetching.
Anti-bot handling. Built-in stealth, proxy rotation, and CAPTCHA solving without custom middleware.

Output Format for AI Agents

Where Scrapy returns raw HTML that an LLM struggles with, Browserbeam returns markdown that fits directly into a prompt. The difference matters at scale:

from browserbeam import Browserbeam

client = Browserbeam()
session = client.sessions.create(url="https://quotes.toscrape.com")

# Clean markdown, not raw HTML
markdown = session.page.markdown.content
print(f"Markdown length: {len(markdown)} characters")
print(markdown[:300])

session.close()

A typical page that's 45,000 characters of raw HTML becomes 1,500-3,000 characters of clean markdown. For an LLM processing hundreds of pages, that's the difference between burning through your token budget in an hour and running for a week. For a deep dive on extraction schemas, see the data extraction guide.

Code Example: Structured Extraction

The workflow that takes 20+ lines in Scrapy (spider class, callbacks, selectors, yield statements) takes 10 lines with Browserbeam:

curl -X POST https://api.browserbeam.com/v1/sessions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://books.toscrape.com",
    "steps": [{
      "extract": {
        "books": [{
          "_parent": "article.product_pod",
          "title": "h3 a >> text",
          "price": ".price_color >> text",
          "url": "h3 a >> href"
        }]
      }
    }]
  }'

from browserbeam import Browserbeam

client = Browserbeam()
session = client.sessions.create(url="https://books.toscrape.com")

result = session.extract(
    books=[{
        "_parent": "article.product_pod",
        "title": "h3 a >> text",
        "price": ".price_color >> text",
        "url": "h3 a >> href"
    }]
)

for book in result.extraction["books"][:5]:
    print(f"{book['title']}: {book['price']}")

session.close()

import Browserbeam from "@browserbeam/sdk";

const client = new Browserbeam();
const session = await client.sessions.create({ url: "https://books.toscrape.com" });

const result = await session.extract({
  books: [{
    _parent: "article.product_pod",
    title: "h3 a >> text",
    price: ".price_color >> text",
    url: "h3 a >> href"
  }]
});

console.log(result.extraction.books.slice(0, 5));
await session.close();

require "browserbeam"

client = Browserbeam::Client.new
session = client.sessions.create(url: "https://books.toscrape.com")

result = session.extract(
  books: [{
    _parent: "article.product_pod",
    title: "h3 a >> text",
    price: ".price_color >> text",
    url: "h3 a >> href"
  }]
)

result.extraction["books"].first(5).each { |b| puts "#{b['title']}: #{b['price']}" }
session.close

No spider class. No callback chain. No browser binary. The schema declares what you want and the engine extracts it as typed JSON. For a step-by-step walkthrough of the Python SDK, see the getting started guide.

Managed API Alternative 2: ScrapingBee

ScrapingBee is a web scraping API focused on anti-bot handling and JavaScript rendering. It's a single-endpoint service: send a URL, get back rendered HTML or extracted data.

What ScrapingBee Does Well

Anti-bot coverage. Residential proxies, headless browser rotation, and premium proxy pools for heavily protected sites.
Simple API. One endpoint (/api/v1), one request, one response. Low learning curve.
CSS extraction. The extract_rules parameter lets you define selectors to pull specific data, similar to Browserbeam's extract but less flexible.
Google Search scraping. Dedicated endpoint for SERP scraping with structured output.
Generous free tier. 1,000 credits on signup.

Where ScrapingBee Falls Short

No interactive sessions. Each request is stateless. You can't click buttons, fill forms, or navigate within a session. This makes it unsuitable for multi-step agent workflows.
Raw HTML output by default. Without extract_rules, you get raw HTML and parse it yourself.
Fixed waits, not stability detection. The wait parameter is a fixed millisecond delay, not signal-based. The wait_for parameter waits for a CSS selector, which is better but still not automatic stability detection.
No markdown output. For LLM consumption, you need to convert the HTML yourself.

Managed API Alternative 3: Firecrawl

Firecrawl is an LLM-ready web scraping API focused on turning websites into clean data for AI. It's open source (MIT license) with a hosted service.

What Firecrawl Does Well

LLM-optimized output. Returns clean markdown or structured data from any URL. The /scrape endpoint handles JavaScript rendering, pop-up dismissal, and content extraction in one call.
Crawl endpoint. The /crawl endpoint follows links and returns multiple pages, similar to Scrapy's crawling behavior but managed.
Open source. Self-host if you need data residency or cost control. 30,000+ GitHub stars.
Agent endpoint. The /agent endpoint handles multi-step web research autonomously, though it's designed for research tasks rather than arbitrary browser interactions.
firecrawl alternative seekers often end up here because it fills the gap between raw scraping APIs and full browser automation.

Where Firecrawl Falls Short

Limited interactivity. The /scrape and /crawl endpoints are one-shot. You can't maintain a session, click specific elements by selector, or fill forms across pages.
No element refs or DOM diffs. Each scrape is independent. There's no concept of "what changed since my last observation."
Credits-based pricing. Each action consumes credits, and complex pages consume more. Costs can be unpredictable compared to fixed subscription plans.

Managed API Alternative 4: Browserless

Browserless is a hosted browser service that gives you Playwright and Puppeteer access through a WebSocket or REST API. It's the longest-running cloud browser provider, founded in 2017. It's also a common target for developers searching for a browserless alternative.

What Browserless Does Well

Raw browser access. Connect your local Playwright or Puppeteer scripts to Browserless's cloud browsers via a WebSocket URL. Your existing scripts work with minimal changes.
Multi-engine support. Chrome, Firefox, and WebKit. Most cloud browser providers only support Chromium.
Self-hosting option. Run Browserless in your own infrastructure with a commercial license.
Maturity. Eight years in production. Battle-tested at scale with enterprise customers.
New extraction endpoints. /smart-scrape, /search, /map, and /crawl endpoints add AI-powered extraction on top of the raw browser.

Where Browserless Falls Short

Raw HTML by default. The core service returns whatever Playwright or Puppeteer returns. The newer extraction endpoints add structure but are still maturing.
Unit-based pricing. Each session consumes units in 30-second blocks. Short sessions that finish in 5 seconds still consume a full unit.
Steeper learning curve. You need to know Playwright or Puppeteer to use the core service. The newer REST endpoints are simpler but less documented.

Head-to-Head: Features That Matter for AI Agents

Feature Comparison Table

This is the table that answers the top web scraping tools comparison question. Eight tools, twelve dimensions:

Feature	Scrapy	Crawlee	Playwright	Selenium	Browserbeam	ScrapingBee	Firecrawl	Browserless
JS rendering	Plugin only	Built-in	Built-in	Built-in	Built-in	Built-in	Built-in	Built-in
Output format	Raw HTML	Raw HTML	Raw HTML	Raw HTML	Markdown + JSON	HTML (+ extract)	Markdown + JSON	Raw HTML (+ extract)
Structured extraction	No	No	No	No	Declarative schema	CSS extract rules	LLM-based	REST endpoint
Interactive sessions	No	Limited	Yes	Yes	Yes	No	No	Yes (via Playwright)
Stability detection	No	No	`networkidle`	No	Network idle + DOM quiet	Fixed `wait` param	Auto	No
Anti-bot	Manual middleware	Plugin	Manual	Manual	Built-in	Built-in (strong)	Built-in	Built-in
Infrastructure	Local	Local / Apify Cloud	Local	Local	Cloud	Cloud	Cloud / self-host	Cloud / self-host
Language	Python	Python, Node.js	Multi-lang	Multi-lang	Python, JS, Ruby	Any (REST)	Any (REST)	JS
Concurrency	Async (Twisted)	Async	Async	Thread-based	API-managed	API-managed	API-managed	API-managed
DOM diff	No	No	No	No	Yes	No	No	No
Pricing	Free (OSS)	Free (OSS) + Apify	Free (OSS)	Free (OSS)	$29-299/mo	$49-599/mo	Free (OSS) + hosted	$10-200+/mo
Best for	Static HTML crawling	Modern Scrapy alt	Browser automation	Enterprise QA	AI agents, extraction	Anti-bot scraping	LLM pipelines	Hosted Playwright

Token Cost Comparison: Scrapy Output vs Structured Output

For AI agent use cases, output format determines token cost. Here's what the same page produces across different tools:

from browserbeam import Browserbeam

client = Browserbeam()

# Observe in "main" mode for clean, boilerplate-free content
session = client.sessions.create(url="https://books.toscrape.com")
result = session.observe(mode="main")

raw_html_estimate = 45000   # typical full HTML for this page
markdown_length = len(result.page.markdown.content)

# Rough token estimate: 1 token ~= 4 characters
raw_tokens = raw_html_estimate // 4
markdown_tokens = markdown_length // 4

print(f"Raw HTML: ~{raw_html_estimate:,} chars (~{raw_tokens:,} tokens)")
print(f"Markdown: ~{markdown_length:,} chars (~{markdown_tokens:,} tokens)")
print(f"Reduction: {100 - (markdown_length * 100 // raw_html_estimate)}%")

session.close()

Output Format	Typical Characters	Approx. Tokens	Reduction
Raw HTML (Scrapy, Selenium)	~45,000	~11,250	Baseline
Markdown (Browserbeam, Firecrawl)	~1,500	~375	97%
Structured JSON (Browserbeam extract)	~800	~200	98%

At 1,000 pages per day, the token difference between raw HTML and structured markdown is roughly 10 million tokens. That's real money when you're feeding pages to an LLM.

Decision Framework: When to Use What

The best web scraping framework for your project depends on four factors: output format needs, infrastructure preferences, interactivity requirements, and budget constraints.

You Need Structured Output for LLMs

Use Browserbeam or Firecrawl.

If your scraping feeds an AI agent or LLM pipeline, structured output is the single most important feature. Both Browserbeam and Firecrawl return markdown and structured data. Browserbeam adds interactive sessions (click, fill, navigate) and DOM diffs. Firecrawl adds autonomous crawling and a research agent. Choose based on whether you need interactivity.

You Need a Full Python Framework

Use Crawlee.

If you want a Scrapy-like framework with JavaScript support, request queuing, and the option to deploy to a managed platform (Apify Cloud), Crawlee is the direct successor. The crawlee python SDK mirrors the established Node.js API. You keep the framework mindset while gaining browser rendering.

You Need Raw Browser Control

Use Playwright (local) or Browserless (cloud).

If you need to intercept network requests, inject JavaScript, capture screenshots, or run cross-browser tests alongside scraping, Playwright gives you full CDP access. If you want to run Playwright scripts without managing browser infrastructure, point them at Browserless's WebSocket endpoint.

You Need Maximum Anti-Bot Coverage

Use ScrapingBee.

If your primary challenge is getting past anti-bot protections on heavily protected sites, ScrapingBee's residential proxy pools and specialized browser rotation handle the toughest targets. The tradeoff: you give up interactive sessions and structured output. For a broader cloud API comparison, see the cloud browser API guide.

Use Case	Best Tool	Why
AI agent browsing the web	Browserbeam	Structured output, interactive sessions, stability detection
LLM knowledge base building	Firecrawl	Autonomous crawling, markdown output, `/crawl` endpoint
Python web crawling at scale	Crawlee	Modern framework, JS rendering, Apify deployment
Cross-browser testing + scraping	Playwright	Full CDP access, multi-browser, best auto-wait
Scraping protected e-commerce sites	ScrapingBee	Strongest anti-bot, residential proxies
Running Playwright in the cloud	Browserless	WebSocket connection, multi-engine, self-host option
Legacy QA automation + scraping	Selenium	Largest community, enterprise standard

Migration Guide: Moving from Scrapy to a Managed API

If you're migrating from Scrapy to a managed API, the architecture changes more than the logic. Scrapy's spider-callback-pipeline pattern maps to a simpler request-extract-process pattern.

What Changes and What Stays the Same

Scrapy Concept	Managed API Equivalent
Spider class	A function or script
`start_urls`	`session = client.sessions.create(url="...")`
`response.css()` / `response.xpath()`	`session.extract(schema=...)` or `session.observe()`
`yield item`	Append to a list or write to database
`response.follow(next_page)`	`session.goto(url=next_url)`
Item pipelines	Process data after extraction (same logic, no framework)
Scrapy middleware	Not needed (API handles proxies, retries, rendering)
`scrapy crawl myspider`	`python my_script.py`

Before/After Code Example

# BEFORE: Scrapy spider (25+ lines, callback architecture)
# import scrapy
#
# class BookSpider(scrapy.Spider):
#     name = "books"
#     start_urls = ["https://books.toscrape.com"]
#
#     def parse(self, response):
#         for book in response.css("article.product_pod"):
#             yield {
#                 "title": book.css("h3 a::text").get(),
#                 "price": book.css(".price_color::text").get(),
#                 "in_stock": book.css(".instock.availability::text").get().strip(),
#             }
#         next_page = response.css("li.next a::attr(href)").get()
#         if next_page:
#             yield response.follow(next_page, self.parse)

# AFTER: Browserbeam (15 lines, linear flow)
from browserbeam import Browserbeam

client = Browserbeam()
session = client.sessions.create(url="https://books.toscrape.com")
all_books = []

for page_num in range(3):
    result = session.extract(
        books=[{
            "_parent": "article.product_pod",
            "title": "h3 a >> text",
            "price": ".price_color >> text",
            "in_stock": ".instock.availability >> text"
        }]
    )
    all_books.extend(result.extraction["books"])
    print(f"Page {page_num + 1}: {len(result.extraction['books'])} books")

    next_url = f"https://books.toscrape.com/catalogue/page-{page_num + 2}.html"
    session.goto(url=next_url)

session.close()
print(f"Total: {len(all_books)} books across 3 pages")

The Scrapy version requires understanding spiders, callbacks, response.follow, item dictionaries, and running scrapy crawl. The Browserbeam version is a Python script with a for loop. For building complete agent workflows on top of this pattern, see the web scraping agent guide.

Common Mistakes When Switching from Scrapy

1. Keeping Scrapy's Callback Architecture in a Synchronous API

Scrapy's callback pattern (yield response.follow(url, self.parse)) exists because Scrapy uses an async reactor. When you move to a synchronous API, you don't need callbacks. Write linear code: navigate, extract, process, navigate to the next page. Forcing a callback pattern onto a synchronous API creates unnecessary complexity.

2. Not Using Structured Extraction (Still Parsing HTML)

Teams migrate to a new tool but keep using BeautifulSoup to parse HTML. If your new tool offers structured extraction (Browserbeam's extract, ScrapingBee's extract_rules, Firecrawl's structured output), use it. Declarative schemas are easier to maintain than imperative parsing code. For schema design patterns, see the structured web scraping guide.

3. Over-Engineering Concurrency

Scrapy's CONCURRENT_REQUESTS and DOWNLOAD_DELAY settings require careful tuning. With a managed API, the provider handles concurrency and rate limiting. Don't build your own async request queue on top of a managed API. Start with sequential requests and add concurrency only when you've confirmed it works correctly.

4. Ignoring JavaScript Rendering When Moving to a New Tool

Some teams switch from Scrapy to a new tool but keep targeting static HTML endpoints. If you're already switching tools, take the opportunity to target the JavaScript-rendered page instead of the API endpoint you reverse-engineered. The rendered page is more stable (site redesigns change the API, the visible product card stays) and requires less reverse engineering.

5. Choosing Based on Price Instead of Output Format

A tool that costs $29/month but returns structured JSON might save more than a free framework that returns raw HTML you spend 10 hours/week parsing. Factor in engineering time, not just API costs. The cheapest tool is the one that minimizes total cost including your time. For a perspective on how structured output feeds LLM pipelines, see the LLM training data pipeline guide.

Frequently Asked Questions

What are the best Scrapy alternatives for Python in 2026?

The top scrapy alternatives depend on your needs. For a framework replacement with JavaScript support, Crawlee (by Apify) is the closest match. For a managed API with structured output, Browserbeam returns markdown and JSON directly. For raw browser control, Playwright is the industry standard. For anti-bot protection, ScrapingBee handles the toughest sites. All four render JavaScript, which is Scrapy's primary limitation.

Can Scrapy handle JavaScript-rendered pages?

Technically yes, with plugins like Scrapy-Playwright or Scrapy-Splash. Practically, these plugins add complexity and failure modes. Scrapy-Splash depends on the Splash rendering service, which hasn't been actively maintained. Scrapy-Playwright works better but turns your spider into a hybrid that's harder to debug. If JavaScript rendering is a core requirement, consider a tool that supports it natively: Crawlee, Playwright, or a cloud browser API like Browserbeam.

What is the best python scraping library for AI agents?

For AI agents, the best python scraping library returns structured, token-efficient output. Browserbeam returns clean markdown (97% smaller than raw HTML) and typed JSON from declarative schemas. Firecrawl returns markdown and has an autonomous agent endpoint for research tasks. Both are better choices than Scrapy, Playwright, or Selenium for agent workflows, because they eliminate the HTML parsing step entirely.

Scrapy vs BeautifulSoup: which should I use?

Scrapy and BeautifulSoup solve different problems. Scrapy is a full crawling framework (scheduling, retries, pipelines) that can download pages and parse them. BeautifulSoup is a parsing library that only parses HTML you already have. They're often used together: Scrapy downloads pages, BeautifulSoup or Scrapy's built-in selectors parse them. If you need crawling, use Scrapy (or Crawlee). If you just need to parse a single HTML document, use BeautifulSoup.

Is Crawlee better than Scrapy for web crawling?

Crawlee is better for modern web crawling that involves JavaScript-rendered pages. It supports Playwright and BeautifulSoup backends, has built-in request queuing and retries, and deploys to Apify Cloud for managed execution. Scrapy is better for high-volume static HTML crawling where JavaScript rendering isn't needed. Scrapy's ecosystem (plugins, documentation, community) is larger. Crawlee's crawlee python SDK is newer but growing. If your crawl targets are mostly JavaScript-heavy sites, Crawlee wins. If they're mostly static HTML, Scrapy is still solid.

What is the best web scraping framework for beginners?

For beginners who want to learn web scraping fundamentals, BeautifulSoup with the requests library is the simplest starting point. For beginners building a real project, Browserbeam has the lowest barrier: install the SDK, create a session with a URL, and get clean data back without managing browsers. Scrapy has a steeper learning curve (spiders, callbacks, pipelines) but teaches important concepts. Playwright is powerful but requires understanding browser lifecycle management.

How do I migrate from Scrapy to a managed API?

Replace the spider class with a script. Replace start_urls with client.sessions.create(url="..."). Replace response.css() selectors with session.extract(schema=...) for structured data or session.observe() for markdown. Replace response.follow() with session.goto(). Remove middleware (the API handles proxies and rendering). Remove item pipelines (process data inline or write to your database directly). The architectural shift is from a framework (Scrapy manages the event loop) to a library (you manage the control flow).

Start Building Without Scrapy's Overhead

Seven alternatives, two categories, one decision that matters most: do you want a framework or a managed API?

If you want a framework, Crawlee is the modern Scrapy replacement with JavaScript support. Playwright gives you raw browser control. Selenium works if you're already in the ecosystem.

If you want a managed API, the choice depends on output format. Browserbeam returns structured markdown and JSON for AI agent workflows. Firecrawl returns markdown for LLM data pipelines. ScrapingBee handles anti-bot for protected sites. Browserless hosts your Playwright scripts in the cloud.

For most teams building AI agents or data pipelines in 2026, the managed API path saves more time than the framework path. No browser management, no parsing code, no infrastructure scaling. The best api for ai agents is the one that returns data in the format your agent needs.

Grab the SDK and try the extraction example from this post:

pip install browserbeam        # Python
npm install @browserbeam/sdk   # TypeScript
gem install browserbeam        # Ruby

Sign up for a free account and extract your first structured dataset. The API docs have the full reference for extraction schemas, interactive sessions, and all the features covered in this comparison.