How to Build a Web Crawler from Scratch: Python, JavaScript and Ruby

May 08, 2026 18 min read

Web crawler illustration with Python, JavaScript, and Ruby logos showing a spider bot crawling web pages

By the end of this guide, you'll have a working web crawler that discovers pages, follows links, respects robots.txt, stores URLs in a database, and extracts structured data from every page it visits. We'll build crawlers from scratch in Python and JavaScript, then show how to crawl JavaScript-rendered pages using Browserbeam's cloud browser API.

A web crawler is different from a web scraper. A scraper extracts data from pages you already know about. A crawler discovers those pages in the first place by following links across a site. Most real-world data collection pipelines need both: a crawler to find URLs, and a scraper to pull structured data from each one.

This guide covers crawler architecture, working implementations in multiple languages, production patterns (sitemaps, databases, rate limiting, Docker), and the common mistakes that trip up most developers building their first crawler.

What you'll learn in this guide:

  • The difference between web crawlers and web scrapers (and when you need both)
  • How to design a crawler architecture with URL frontier, fetcher, parser, and storage
  • How to build a complete web crawler in Python with Requests and BeautifulSoup
  • How to build a web crawler in JavaScript with Axios and Cheerio
  • How to handle robots.txt, crawl delays, and politeness rules
  • How to crawl JavaScript-rendered pages with Browserbeam
  • Production patterns: sitemaps, databases, rate limiting, distributed crawling, and Docker

TL;DR: A web crawler discovers pages by following links from a seed URL. You need a URL frontier (queue of URLs to visit), a fetcher (HTTP client), a parser (link extractor), and storage (database or file). Python's Requests + BeautifulSoup and Node.js's Axios + Cheerio handle static pages. For JavaScript-rendered content, Browserbeam crawls pages in a cloud browser and returns structured data without you managing browser infrastructure.


Web Crawler vs Web Scraper: What's the Difference?

This is the most common question developers ask before building either tool. The terms get used interchangeably, but they solve different problems.

What a Web Crawler Does

A web crawler starts with one or more seed URLs and discovers new pages by following links. Its job is exploration. Think of Googlebot: it visits a page, finds all the links on that page, adds them to a queue, and repeats. The output is a list of discovered URLs (and optionally the raw HTML of each page).

Crawlers care about:
- Breadth: How many pages can I find?
- Depth: How many levels of links do I follow?
- Politeness: Am I overwhelming the server?
- Deduplication: Have I already visited this URL?

What a Web Scraper Does

A web scraper takes a known URL (or list of URLs) and extracts specific data from the page. Its job is extraction. You give it a product page, and it pulls out the title, price, and availability. The output is structured data: JSON, CSV, or database rows.

Scrapers care about:
- Selectors: Which elements contain the data I want?
- Structure: What shape should the output take?
- Reliability: Does the selector still work after a site redesign?
- Scale: Can I extract from thousands of pages efficiently?

For a deep dive into scraping techniques, see our Python web scraping guide or Node.js web scraping guide.

When You Need Both

Most data collection projects combine both. The crawler discovers all product pages on a site. The scraper extracts price, title, and stock status from each page. In practice, you can combine them into a single script: crawl and extract in the same loop.

Web Crawler Web Scraper
Purpose Discover URLs Extract data from URLs
Input Seed URL(s) List of known URLs
Output List of discovered URLs Structured data (JSON, CSV)
Core logic Follow links, manage queue Parse HTML, extract fields
Example Find all product pages on a store Get price and title from each product page
Analogy Library catalog (finds all books) Reader (extracts info from a book)

Web Crawler Architecture

Every web crawler, from a 50-line Python script to Googlebot, shares the same four components. The scale differs, but the architecture stays constant.

Core Components of a Web Crawler

Web Crawler Architecture
Seed URLs
URL Frontier (Queue + Visited Set)
Fetcher (HTTP Client)
Parser (Link Extractor)
Storage (Database)
New URLs → back to Frontier
  1. URL Frontier (the queue): Holds URLs waiting to be crawled. New links discovered by the parser get added here. A visited set prevents duplicates.

  2. Fetcher (HTTP client): Downloads the HTML for each URL. Handles timeouts, retries, status codes, and respects crawl delays.

  3. Parser (link extractor): Parses the HTML, finds all <a href="..."> links, normalizes them to absolute URLs, and filters by domain or path rules.

  4. Storage: Saves crawl results. This could be a simple JSON file, a SQLite database, or a full PostgreSQL instance for production crawlers.

The crawl loop:
1. Pop a URL from the frontier
2. Fetch the page
3. Parse the HTML for links
4. Add new links to the frontier
5. Store the page data
6. Repeat until the frontier is empty or a limit is reached

Crawl Strategies: BFS vs DFS

Breadth-first search (BFS) explores all pages at the current depth before going deeper. Start at the homepage, visit every link on that page, then visit every link on those pages. BFS is the standard for most crawlers because it finds the most important pages first (pages linked directly from the homepage tend to be higher-value).

Depth-first search (DFS) follows one path as deep as possible before backtracking. DFS can get stuck in deep link chains (like pagination). It's rarely used for general crawling.

Priority queues let you weight URLs by importance. Pages linked from many other pages get crawled first. Sitemaps can provide explicit priority hints.

Strategy Best For Risk
BFS (queue) General site crawling, discovery Can be slow for deep content
DFS (stack) Following specific paths Gets stuck in deep chains
Priority queue Large-scale crawling More complex to implement
Sitemap-first Known site structure Misses unlisted pages

How Search Engines Do It

Google's crawler operates on the same principles, just at massive scale. Googlebot uses a priority queue with hundreds of signals (PageRank, freshness, crawl budget). It respects robots.txt, honors crawl-delay directives, and re-crawls pages based on how often they change.

The key difference from your crawler: search engines crawl the entire web. Your crawler crawls one site (or a handful of sites). That means you can use simpler data structures and skip the distributed systems complexity for most use cases.


Building a Web Crawler in Python

Let's build a working web crawler from scratch. We'll crawl Books to Scrape, discovering all book listing pages by following pagination links.

Project Setup and Dependencies

mkdir python-crawler && cd python-crawler
pip install requests beautifulsoup4

Two libraries are all you need. Requests handles HTTP. BeautifulSoup parses HTML and extracts links. For a full guide to Python scraping libraries, see our web scraping with Python guide.

The URL Frontier: Queue and Visited Set

The frontier is the heart of any web crawler. We use a deque for the queue (efficient popleft for BFS) and a set for tracking visited URLs:

from collections import deque

frontier = deque(["https://books.toscrape.com/"])
visited = set()

The visited set prevents infinite loops. Without it, your crawler would revisit the same pages forever (page A links to page B, page B links back to page A).

Fetching Pages with Requests

import requests

def fetch(url):
    try:
        response = requests.get(url, timeout=10, headers={
            "User-Agent": "MyCrawler/1.0 (educational project)"
        })
        response.raise_for_status()
        return response.text
    except requests.RequestException as e:
        print(f"Failed to fetch {url}: {e}")
        return None

Always set a timeout. Without one, a slow server can hang your crawler indefinitely. The User-Agent header identifies your crawler to the site owner.

from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse

def extract_links(html, base_url, allowed_domain):
    soup = BeautifulSoup(html, "html.parser")
    links = []
    for anchor in soup.find_all("a", href=True):
        url = urljoin(base_url, anchor["href"])
        parsed = urlparse(url)
        # Stay within the target domain
        if parsed.netloc == allowed_domain:
            # Remove fragments
            clean_url = parsed._replace(fragment="").geturl()
            links.append(clean_url)
    return links

urljoin converts relative URLs (/catalogue/page-2.html) to absolute ones (https://books.toscrape.com/catalogue/page-2.html). We filter by domain to avoid crawling the entire internet.

Storing Results in a Database

For anything beyond a quick test, store crawl results in a database. SQLite requires zero setup:

import sqlite3

def init_db(db_path="crawl_results.db"):
    conn = sqlite3.connect(db_path)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS pages (
            url TEXT PRIMARY KEY,
            title TEXT,
            status_code INTEGER,
            crawled_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        )
    """)
    conn.commit()
    return conn

def save_page(conn, url, title, status_code):
    conn.execute(
        "INSERT OR IGNORE INTO pages (url, title, status_code) VALUES (?, ?, ?)",
        (url, title, status_code)
    )
    conn.commit()

Putting It All Together

Here's the complete Python web crawler. It crawls Books to Scrape, follows pagination links, and stores every discovered page in SQLite:

import time
import sqlite3
import requests
from collections import deque
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse

SEED_URL = "https://books.toscrape.com/"
ALLOWED_DOMAIN = "books.toscrape.com"
MAX_PAGES = 60
CRAWL_DELAY = 1  # seconds between requests

def fetch(url):
    try:
        response = requests.get(url, timeout=10, headers={
            "User-Agent": "MyCrawler/1.0 (educational project)"
        })
        response.raise_for_status()
        return response
    except requests.RequestException as e:
        print(f"Failed: {url} ({e})")
        return None

def extract_links(html, base_url):
    soup = BeautifulSoup(html, "html.parser")
    links = []
    for anchor in soup.find_all("a", href=True):
        url = urljoin(base_url, anchor["href"])
        parsed = urlparse(url)
        if parsed.netloc == ALLOWED_DOMAIN:
            clean_url = parsed._replace(fragment="").geturl()
            links.append(clean_url)
    return links

def get_title(html):
    soup = BeautifulSoup(html, "html.parser")
    tag = soup.find("title")
    return tag.get_text(strip=True) if tag else None

def crawl():
    conn = sqlite3.connect("crawl_results.db")
    conn.execute("""
        CREATE TABLE IF NOT EXISTS pages (
            url TEXT PRIMARY KEY,
            title TEXT,
            status_code INTEGER,
            crawled_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        )
    """)
    conn.commit()

    frontier = deque([SEED_URL])
    visited = set()
    pages_crawled = 0

    while frontier and pages_crawled < MAX_PAGES:
        url = frontier.popleft()
        if url in visited:
            continue

        visited.add(url)
        response = fetch(url)
        if response is None:
            continue

        title = get_title(response.text)
        conn.execute(
            "INSERT OR IGNORE INTO pages (url, title, status_code) VALUES (?, ?, ?)",
            (url, title, response.status_code)
        )
        conn.commit()

        new_links = extract_links(response.text, url)
        for link in new_links:
            if link not in visited:
                frontier.append(link)

        pages_crawled += 1
        print(f"[{pages_crawled}/{MAX_PAGES}] {url}")
        time.sleep(CRAWL_DELAY)

    conn.close()
    print(f"Done. Crawled {pages_crawled} pages.")

if __name__ == "__main__":
    crawl()

Run it and you'll have a SQLite database with every page on Books to Scrape, including titles and URLs. The crawler respects a 1-second delay between requests and stops at 60 pages.


Building a Web Crawler in JavaScript (Node.js)

The same architecture translates to Node.js with Axios and Cheerio. The async event loop makes concurrent fetching natural.

Setup with Axios and Cheerio

mkdir node-crawler && cd node-crawler
npm init -y
npm pkg set type=module
npm install axios cheerio

For more Node.js scraping patterns, see our web scraping with Node.js guide.

Async Crawl Loop with URL Queue

import axios from "axios";
import * as cheerio from "cheerio";
import { writeFileSync } from "fs";

const SEED_URL = "https://books.toscrape.com/";
const ALLOWED_DOMAIN = "books.toscrape.com";
const MAX_PAGES = 60;
const CRAWL_DELAY_MS = 1000;

const frontier = [SEED_URL];
const visited = new Set();
const results = [];

function extractLinks(html, baseUrl) {
  const $ = cheerio.load(html);
  const links = [];
  $("a[href]").each((_, el) => {
    const href = $(el).attr("href");
    const absolute = new URL(href, baseUrl).href;
    const parsed = new URL(absolute);
    parsed.hash = "";
    if (parsed.hostname === ALLOWED_DOMAIN) {
      links.push(parsed.href);
    }
  });
  return links;
}

function sleep(ms) {
  return new Promise(resolve => setTimeout(resolve, ms));
}

async function crawl() {
  let pagesCrawled = 0;

  while (frontier.length > 0 && pagesCrawled < MAX_PAGES) {
    const url = frontier.shift();
    if (visited.has(url)) continue;

    visited.add(url);

    try {
      const response = await axios.get(url, {
        timeout: 10000,
        headers: { "User-Agent": "MyCrawler/1.0 (educational project)" },
      });

      const $ = cheerio.load(response.data);
      const title = $("title").text().trim();

      results.push({ url, title, status: response.status });

      const links = extractLinks(response.data, url);
      for (const link of links) {
        if (!visited.has(link)) {
          frontier.push(link);
        }
      }

      pagesCrawled++;
      console.log(`[${pagesCrawled}/${MAX_PAGES}] ${url}`);
      await sleep(CRAWL_DELAY_MS);
    } catch (error) {
      console.log(`Failed: ${url} (${error.message})`);
    }
  }

  writeFileSync("crawl_results.json", JSON.stringify(results, null, 2));
  console.log(`Done. Crawled ${pagesCrawled} pages.`);
}

crawl();

Full Working Crawler: What It Does

This crawler starts at the Books to Scrape homepage, discovers all linked pages within the domain, and saves the results to a JSON file. The pattern is identical to the Python version: frontier queue, visited set, fetch-parse-store loop.

The JavaScript version has one advantage for concurrent crawling: you can fetch multiple pages in parallel with Promise.all while respecting rate limits. For the basic version above, we keep it sequential with a delay to stay polite.


Handling robots.txt and Crawl Etiquette

A responsible crawler checks what it's allowed to crawl before making requests. The robots.txt file at the root of every website tells crawlers which paths are off-limits.

Parsing robots.txt

Every site hosts a robots.txt at its root URL. Here's how to parse it in Python:

from urllib.robotparser import RobotFileParser

def check_robots(url, user_agent="*"):
    from urllib.parse import urlparse
    parsed = urlparse(url)
    robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"

    rp = RobotFileParser()
    rp.set_url(robots_url)
    rp.read()

    return rp.can_fetch(user_agent, url)

Python's standard library includes RobotFileParser. No extra dependencies needed. Call check_robots(url) before fetching each page.

Implementing Crawl Delays

The robots.txt file may include a Crawl-delay directive. Respect it:

def get_crawl_delay(robots_url, user_agent="*"):
    rp = RobotFileParser()
    rp.set_url(robots_url)
    rp.read()

    delay = rp.crawl_delay(user_agent)
    return delay if delay else 1  # Default to 1 second

If no crawl-delay is specified, use a minimum of 1 second between requests to the same domain. For larger crawls, 2-3 seconds is safer. Hammering a small site with rapid requests can get your IP blocked or cause real harm to their infrastructure.

Setting a Proper User-Agent

Your crawler should identify itself clearly. A good User-Agent string includes:
- Your crawler name
- A version number
- Contact info or a URL where the site owner can learn more

headers = {
    "User-Agent": "MyCrawler/1.0 (+https://yoursite.com/crawler-info)"
}

Never impersonate a real browser (like Chrome or Firefox) unless you have a specific reason. Site owners use User-Agent strings to understand their traffic. Impersonation is dishonest and can violate terms of service.


Crawling JavaScript-Rendered Pages with Browserbeam

Static crawlers using Requests or Axios only see the initial HTML response. Modern websites built with React, Vue, or Angular load content with JavaScript after the page arrives. A static fetcher gets an empty <div id="root"></div> instead of actual content.

Why Static Crawlers Miss Content

Open Quotes to Scrape (JS version) in your browser. You'll see quotes. Fetch it with requests.get() and you'll get an empty page. The quotes are injected by JavaScript after the initial HTML loads.

For JavaScript-heavy sites, you need a tool that runs a real browser. You can manage Puppeteer or Playwright locally, or use a cloud browser API like Browserbeam that handles the browser for you and returns structured data.

For a full comparison of browser automation options, see our Puppeteer vs Playwright vs Browserbeam analysis.

Crawling with Browserbeam

Browserbeam runs a real browser in the cloud. You send HTTP requests, it renders pages with JavaScript, and returns structured data. No Chromium to install, no Docker containers, no crash handling. The Python SDK wraps the API in a clean interface.

Here's a crawler that discovers pages and extracts structured data from each one:

from browserbeam import Browserbeam
from collections import deque
from urllib.parse import urljoin

client = Browserbeam(api_key="YOUR_API_KEY")
seed_url = "https://books.toscrape.com/"
frontier = deque([seed_url])
visited = set()
all_books = []

while frontier and len(visited) < 10:
    url = frontier.popleft()
    if url in visited:
        continue
    visited.add(url)

    session = client.sessions.create(url=url)

    # Extract book data from the current page
    result = session.extract(
        books=[{
            "_parent": "article.product_pod",
            "title": "h3 a >> text",
            "price": ".price_color >> text",
            "url": "h3 a >> href"
        }]
    )

    if result.extraction and result.extraction.get("books"):
        all_books.extend(result.extraction["books"])

    # Extract pagination links
    links_result = session.extract(
        next_page=".next a >> href"
    )

    session.close()

    if links_result.extraction and links_result.extraction.get("next_page"):
        next_url = urljoin(url, links_result.extraction["next_page"])
        if next_url not in visited:
            frontier.append(next_url)

print(f"Crawled {len(visited)} pages, found {len(all_books)} books")

The Browserbeam crawler handles JavaScript rendering, cookie banners, and stability detection automatically. Each sessions.create call spins up a real browser, renders the page, and waits for content to stabilize before extraction. No sleep timers, no manual wait logic.


Web Crawler Design Patterns for Production

A basic crawler works fine for 50-100 pages. For larger crawls, you need patterns that handle scale, failures, and efficiency.

Sitemap-Based Crawling

Most websites publish a sitemap at /sitemap.xml. Instead of discovering pages through link-following, parse the sitemap first to get a complete URL list:

import requests
from xml.etree import ElementTree

def get_sitemap_urls(sitemap_url):
    response = requests.get(sitemap_url, timeout=10)
    if response.status_code != 200:
        return []
    root = ElementTree.fromstring(response.content)

    # Handle namespace
    ns = {"sm": "http://www.sitemaps.org/schemas/sitemap/0.9"}
    urls = []
    for url_element in root.findall(".//sm:url/sm:loc", ns):
        urls.append(url_element.text)

    return urls

# Example: crawl a site with a published sitemap
urls = get_sitemap_urls("https://example.com/sitemap.xml")
print(f"Found {len(urls)} URLs in sitemap")

Sitemap-first crawling is faster and more complete than link-following. You know every URL upfront, can prioritize by <lastmod> or <priority>, and skip the parsing step for URL discovery. Use link-following as a fallback for pages not listed in the sitemap (not all sites publish one).

Storing Crawl Data in a Database

For production crawlers, SQLite hits limits quickly. PostgreSQL handles concurrent writes, and its ON CONFLICT clause makes upserts simple:

import psycopg2

conn = psycopg2.connect("postgresql://localhost/crawler_db")
cur = conn.cursor()

cur.execute("""
    CREATE TABLE IF NOT EXISTS crawl_queue (
        url TEXT PRIMARY KEY,
        status TEXT DEFAULT 'pending',
        discovered_at TIMESTAMP DEFAULT NOW(),
        crawled_at TIMESTAMP,
        depth INTEGER DEFAULT 0
    )
""")

def enqueue_url(cur, url, depth):
    cur.execute("""
        INSERT INTO crawl_queue (url, depth)
        VALUES (%s, %s)
        ON CONFLICT (url) DO NOTHING
    """, (url, depth))

def get_next_url(cur):
    cur.execute("""
        UPDATE crawl_queue
        SET status = 'crawling', crawled_at = NOW()
        WHERE url = (
            SELECT url FROM crawl_queue
            WHERE status = 'pending'
            ORDER BY depth, discovered_at
            LIMIT 1
            FOR UPDATE SKIP LOCKED
        )
        RETURNING url, depth
    """)
    return cur.fetchone()

The FOR UPDATE SKIP LOCKED pattern lets multiple crawler workers pull URLs from the same queue without conflicts. Each worker locks one row, processes it, and moves on.

Rate Limiting and Backoff

Respect the target site. Start with 1 request per second. If you get 429 (Too Many Requests) responses, back off exponentially:

import time

def fetch_with_backoff(url, max_retries=3):
    delay = 1
    for attempt in range(max_retries):
        response = requests.get(url, timeout=10)
        if response.status_code == 429:
            print(f"Rate limited. Waiting {delay}s...")
            time.sleep(delay)
            delay *= 2
            continue
        return response
    return None

For large-scale crawling across multiple sites, maintain per-domain rate limiters. Different sites have different tolerance levels.

Distributed Crawling with Job Queues

When one machine isn't enough, distribute the crawl across multiple workers. Redis + a job queue (like RQ in Python or BullMQ in Node.js) is the simplest pattern:

from redis import Redis
from rq import Queue

redis_conn = Redis()
queue = Queue(connection=redis_conn)

def crawl_page(url, depth):
    """Worker function: fetch, parse, enqueue new links."""
    response = requests.get(url, timeout=10)
    links = extract_links(response.text, url)
    for link in links:
        if depth < MAX_DEPTH:
            queue.enqueue(crawl_page, link, depth + 1)

# Seed the queue
queue.enqueue(crawl_page, "https://books.toscrape.com/", 0)

Run multiple workers with rq worker and they'll pull jobs from the same queue. Redis handles the coordination. For more on scaling automation, see our scaling web automation guide.

Running a Crawler in Docker

Docker makes deployment reproducible. Here's a minimal Dockerfile for the Python crawler:

# Dockerfile
FROM python:3.12-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY crawler.py .
CMD ["python", "crawler.py"]
# requirements.txt
requests==2.32.3
beautifulsoup4==4.12.3

Build and run:

docker build -t my-crawler .
docker run --rm my-crawler

Docker is especially useful for distributed crawlers: deploy the same image across multiple machines, each pulling from the shared job queue.


Open Source Web Crawlers Compared

You don't have to build from scratch. Several mature frameworks handle crawling out of the box. Here's how they compare.

Scrapy (Python)

Scrapy is the most established Python crawling framework. It handles request scheduling, rate limiting, data pipelines, and export formats. You write "spiders" that define how to follow links and extract data.

Best for: Large-scale crawls (10,000+ pages), complex pipelines with data cleaning and export, projects that need middleware for proxies or cookies.

Limitation: No JavaScript support without plugins like Scrapy-Playwright. Steeper learning curve than a simple Requests script.

Colly (Go)

Colly is a fast, compiled crawling framework for Go. It handles callbacks, rate limiting, and parallel fetching. Compiles to a single binary for easy deployment.

Best for: High-performance crawling where speed matters, teams using Go, projects that need low memory usage.

Limitation: Go's static typing makes quick prototyping slower. No JavaScript support.

Crawlee (Node.js)

Crawlee (from the Apify team) is a modern Node.js crawling library that supports both HTTP crawling (Cheerio) and browser crawling (Playwright). It handles anti-bot measures, auto-scaling, and request queuing.

Best for: JavaScript teams, projects that need both static and browser-based crawling, teams already using Apify's infrastructure.

Limitation: Larger dependency tree. Browser mode requires Playwright setup and resource management.

Comparison: Open Source Crawlers vs Browserbeam

Feature Scrapy Colly Crawlee Browserbeam
Language Python Go Node.js Any (REST API)
JavaScript rendering Plugin only No Yes (Playwright) Yes (built-in)
Infrastructure to manage Python + Scrapy Go binary Node.js + optional browsers None (cloud)
Structured extraction Custom code Custom code Custom code Declarative schema
Anti-bot handling Manual Manual Built-in Built-in
Proxy support Manual / middleware Manual Built-in Built-in
Learning curve Medium-High Medium Medium Low
Best for Large Python crawls High-performance Go JS teams, browser crawling JS-rendered sites, no infra

Use open-source frameworks when you need fine-grained control over the crawl process, have specific middleware requirements, or are crawling at extreme scale (millions of pages). Use Browserbeam when you need to crawl JavaScript-rendered pages without managing browser infrastructure, or when you want structured data extraction without writing parsing code.


Common Web Crawler Mistakes

Five pitfalls that trip up most developers building their first crawler. Each one has cost me hours of debugging.

Mistake 1: Ignoring robots.txt

The problem: Crawling paths that the site explicitly disallows.

Why it matters: Ignoring robots.txt can get your IP blocked immediately. Some sites will serve you fake data. In extreme cases, it creates legal exposure. More practically, it's disrespectful to site operators who set these rules for a reason (protecting heavy endpoints, preventing content theft, managing server load).

The fix: Always check robots.txt before crawling. Use Python's urllib.robotparser or the npm robots-parser package. If a path is disallowed, skip it.

Mistake 2: No Duplicate Detection

The problem: Crawling the same URL multiple times because your deduplication is incomplete.

Why it matters: Without deduplication, a crawler can enter infinite loops. Site A links to site B, site B links back to A. URL parameters create infinite variations: /page?sort=asc and /page?sort=desc might be the same content. Query parameter shuffling (?a=1&b=2 vs ?b=2&a=1) creates false "new" URLs.

The fix: Normalize URLs before checking the visited set. Remove fragments, sort query parameters, lowercase the hostname, and remove default ports:

from urllib.parse import urlparse, urlencode, parse_qs

def normalize_url(url):
    parsed = urlparse(url)
    # Sort query parameters
    params = parse_qs(parsed.query, keep_blank_values=True)
    sorted_query = urlencode(sorted(params.items()), doseq=True)
    # Rebuild without fragment, with sorted params
    normalized = parsed._replace(
        fragment="",
        query=sorted_query,
        netloc=parsed.netloc.lower()
    )
    return normalized.geturl()

Mistake 3: Unbounded Crawl Depth

The problem: Following links endlessly, crawling thousands of pages you don't need.

Why it matters: Sites with user-generated content, calendars, or search features can generate infinite URLs. A calendar with "next month" links never ends. Search result pages with paginated results can go thousands of pages deep.

The fix: Set a maximum depth from the seed URL. Track depth for each URL in the frontier. Stop adding links once you hit the limit:

MAX_DEPTH = 3
# frontier stores (url, depth) tuples
frontier = deque([(SEED_URL, 0)])

while frontier:
    url, depth = frontier.popleft()
    if depth > MAX_DEPTH:
        continue
    # ... crawl logic ...
    for link in new_links:
        frontier.append((link, depth + 1))

Mistake 4: Not Handling Errors Gracefully

The problem: A single failed request crashes the entire crawl.

Why it matters: Networks are unreliable. Servers return 500 errors, timeouts happen, DNS fails. A crawler that runs for hours needs to survive individual failures without losing all progress.

The fix: Wrap every fetch in try/except. Log failures but keep crawling. Save progress incrementally (not just at the end). Use retry logic with backoff for transient errors (429, 503, timeouts).

Mistake 5: Storing URLs Without Normalization

The problem: Your database fills with duplicate entries because the same page has multiple URL representations.

Why it matters: https://example.com/page, https://example.com/page/, https://EXAMPLE.COM/page, and https://example.com/page#section all point to the same content. Without normalization, you'll crawl and store each one separately.

The fix: Normalize every URL before storing or checking against the visited set. Strip trailing slashes (or always add them), lowercase the scheme and host, remove fragments, and sort query parameters. Do this at the point of discovery, not at the point of storage.


Frequently Asked Questions

How to build a web crawler?

Start with a seed URL, a queue (the "frontier"), and a visited set. Pop a URL from the queue, fetch the page, extract all links, add unseen links back to the queue, and repeat. Python's Requests + BeautifulSoup or Node.js's Axios + Cheerio are the simplest starting points. Add robots.txt checking, rate limiting, and a database as your crawler grows. The complete Python crawler in this guide runs in under 80 lines.

What is the difference between a web crawler and a web scraper?

A web crawler discovers pages by following links across a website. Its output is a list of URLs. A web scraper extracts structured data (prices, titles, dates) from specific pages. Its output is JSON or CSV. Most projects use both: the crawler finds pages, the scraper extracts data from each one. See the comparison table at the top of this guide for a full breakdown.

How does a web crawler handle robots.txt?

Before crawling any URL, the crawler fetches the site's /robots.txt file and checks whether the target path is allowed for its User-Agent. Python includes urllib.robotparser in the standard library. If a path is disallowed, the crawler skips it. The crawler should also honor Crawl-delay directives, which specify the minimum time between requests.

Can I build a web crawler in JavaScript?

Yes. Node.js is a strong choice for web crawling. Axios handles HTTP requests, Cheerio parses HTML with a jQuery-like API, and the async event loop handles concurrent fetching naturally. For JavaScript-rendered pages, add Puppeteer, Playwright, or the Browserbeam TypeScript SDK. The full JavaScript crawler in this guide runs against real websites with working selectors.

What database should I use for a web crawler?

SQLite for small crawls (under 100,000 URLs). It needs zero setup and handles single-writer workloads well. PostgreSQL for production crawlers with multiple workers. Its FOR UPDATE SKIP LOCKED pattern lets concurrent workers pull from the same queue safely. Redis for distributed URL frontiers where speed matters more than durability. The right choice depends on your scale and whether you need concurrent writers.

What are web crawler best practices?

Always respect robots.txt. Set a clear User-Agent string. Wait at least 1 second between requests to the same domain. Set a maximum crawl depth to avoid infinite loops. Normalize URLs before deduplication. Handle errors gracefully with retries and backoff. Save progress incrementally. Close connections and clean up resources. Monitor your crawler for anomalies (sudden spike in 404s means the site changed).

How do I build a distributed web crawler?

Use a shared job queue (Redis + RQ, BullMQ, or Celery) with multiple worker processes. Each worker pulls a URL from the queue, crawls it, discovers new links, and pushes them back to the queue. PostgreSQL or Redis serves as the shared URL frontier. Deploy workers as Docker containers for easy scaling. Start with 2-3 workers and scale based on the target site's tolerance.

Is web crawling legal?

Web crawling of publicly available information is generally legal in most jurisdictions. The 2022 US ruling in hiQ Labs v. LinkedIn affirmed that scraping public data does not violate the Computer Fraud and Abuse Act. That said, respecting robots.txt and terms of service is both ethical and practical. Crawling behind authentication, scraping copyrighted content for redistribution, or ignoring explicit cease-and-desist requests can create legal exposure. When in doubt, consult legal counsel for your specific use case.


Conclusion

You now have working web crawlers in Python and JavaScript, plus patterns for scaling them to production. The core loop is always the same: seed URL, frontier queue, fetch, parse, store, repeat. Everything else is refinement: robots.txt compliance, rate limiting, error handling, database storage, and distribution.

For JavaScript-rendered sites where static fetching returns empty pages, Browserbeam handles the browser so you don't have to. No Chromium installs, no Docker browser images, no crash handling. Just structured data back from every page your crawler discovers.

Here's what to try next:
- Point the Python crawler at a different site and adjust the link filter
- Add the sitemap-first pattern to skip discovery for known sites
- Try the Browserbeam crawler on a JavaScript-heavy site where Requests returns empty HTML
- Add concurrent fetching to the Node.js crawler with Promise.all

If you want structured extraction from crawled pages without writing parsing code, try Browserbeam for free. The API documentation covers every endpoint, and the SDKs install in seconds.

What will you crawl first?

You might also like:

Give your AI agent a faster, leaner browser

Structured page data instead of raw HTML. Your agent processes less, decides faster, and costs less to run.

Stability detection built in
Fraction of the payload size
Diffs after every action
No credit card required. 5,000 free credits included.